[00:00:40] 10ops-codfw, 10serviceops, 10Patch-For-Review: decommission mw2251-mw2255, mw2257-mw2258 - https://phabricator.wikimedia.org/T313730 (10RLazarus) a:05RLazarusβ†’03Papaul [00:00:49] 10ops-codfw, 10serviceops, 10Patch-For-Review: decommission mw2251-mw2255, mw2257-mw2258 - https://phabricator.wikimedia.org/T313730 (10RLazarus) @Papaul All yours! [00:18:30] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:35:30] (03CR) 10Krinkle: [C: 03+1] Set cache types for OAuth multi-DC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816884 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling) [00:41:18] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:42:34] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:43] (03PS1) 10Tim Starling: Configure the nonce cache separately from the session cache [extensions/OAuth] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/817860 (https://phabricator.wikimedia.org/T313578) [00:49:10] (03PS1) 10Tim Starling: Configure the nonce cache separately from the session cache [extensions/OAuth] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817861 (https://phabricator.wikimedia.org/T313578) [00:50:09] (03CR) 10Tim Starling: [C: 03+2] Configure the nonce cache separately from the session cache [extensions/OAuth] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/817860 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling) [00:50:12] (03CR) 10Tim Starling: [C: 03+2] Configure the nonce cache separately from the session cache [extensions/OAuth] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817861 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling) [00:56:12] (03Merged) 10jenkins-bot: Configure the nonce cache separately from the session cache [extensions/OAuth] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/817860 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling) [00:58:46] (03Merged) 10jenkins-bot: Configure the nonce cache separately from the session cache [extensions/OAuth] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817861 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling) [01:11:17] !log tstarling@deploy1002 Synchronized php-1.39.0-wmf.22/extensions/OAuth: New config var for T313578, not yet used (duration: 03m 39s) [01:11:23] T313578: Make OAuth work in Multi-DC active/active mode - https://phabricator.wikimedia.org/T313578 [01:17:32] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Danielgblack) Are you sure db1111 wasn't affected? While it didn't loose grafana plots there are some correlations (and some anti-c... [01:18:32] !log tstarling@deploy1002 Synchronized php-1.39.0-wmf.21/extensions/OAuth: New config var for T313578, not yet used (duration: 03m 23s) [01:18:37] T313578: Make OAuth work in Multi-DC active/active mode - https://phabricator.wikimedia.org/T313578 [01:20:42] (03PS3) 10Tim Starling: Set cache types for OAuth multi-DC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816884 (https://phabricator.wikimedia.org/T313578) [01:21:49] (03CR) 10Tim Starling: [C: 03+2] Set cache types for OAuth multi-DC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816884 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling) [01:22:57] (03Merged) 10jenkins-bot: Set cache types for OAuth multi-DC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816884 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling) [01:25:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [01:26:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [01:26:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [01:27:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [01:28:04] !log tstarling@deploy1002 Synchronized wmf-config/InitialiseSettings.php: move OAuth token storage T313578 (duration: 03m 04s) [01:28:08] T313578: Make OAuth work in Multi-DC active/active mode - https://phabricator.wikimedia.org/T313578 [01:37:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:40] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:17:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:24:06] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:30:07] (03PS1) 10Krinkle: docs: Remove outdated github/travis badges [debs/pybal] - 10https://gerrit.wikimedia.org/r/817918 [02:37:28] (03CR) 10Krinkle: monitoring: Fix broken grafana URLs that include unencoded space (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812945 (owner: 10Krinkle) [02:37:31] (03Abandoned) 10Krinkle: monitoring: Fix broken grafana URLs that include unencoded space [puppet] - 10https://gerrit.wikimedia.org/r/812945 (owner: 10Krinkle) [02:58:24] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:58:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [03:55:02] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:23:30] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 3 (netmon1003, ...), Fresh: 118 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:43:38] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10ori) @fgiunchedi and I spoke about this today. Some notes: #### Work queue When Swift receives an object with an expiration, the... [05:06:58] (03PS1) 10Stang: ja(wiki|wikivoyage): Add "Module talk" as alias of NS829 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817929 (https://phabricator.wikimedia.org/T314013) [05:19:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover x2 T313811 [05:19:45] T313811: Switchover x2 master db2142 -> db2144 - https://phabricator.wikimedia.org/T313811 [05:19:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover x2 T313811 [05:24:50] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 121 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:37:31] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) db1111 wasn't affected by this issue. It was only affecting s1 (english wikipedia) and db1111 lives in s8 (wikidata). W... [05:42:50] (03PS2) 10Marostegui: site.pp: Promote db2144 to x2 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/817736 (https://phabricator.wikimedia.org/T313811) [05:45:50] (03CR) 10Marostegui: [C: 03+2] site.pp: Promote db2144 to x2 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/817736 (https://phabricator.wikimedia.org/T313811) (owner: 10Marostegui) [05:50:07] (03PS2) 10KartikMistry: Enable SectionTranslation on 10 more WPs where ContentTranslation is available by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817758 (https://phabricator.wikimedia.org/T313300) [06:00:05] kormat, marostegui, and Amir1: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220728T0600) [06:00:14] !log Starting x2 codfw failover from db2142 to db2144 - T313811 [06:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:19] T313811: Switchover x2 master db2142 -> db2144 - https://phabricator.wikimedia.org/T313811 [06:00:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2144 to x2 primary T313811', diff saved to https://phabricator.wikimedia.org/P32025 and previous config saved to /var/cache/conftool/dbconfig/20220728-060057-marostegui.json [06:01:52] (03PS1) 10KartikMistry: Update cxserver to 2022-07-27-220330-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/817931 (https://phabricator.wikimedia.org/T308248) [06:07:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2142 T313811', diff saved to https://phabricator.wikimedia.org/P32026 and previous config saved to /var/cache/conftool/dbconfig/20220728-060757-root.json [06:08:03] T313811: Switchover x2 master db2142 -> db2144 - https://phabricator.wikimedia.org/T313811 [06:12:23] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Marostegui) [06:55:12] marostegui: I want to update cxserver. Is it OK to go if Switchover is done. [06:58:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [06:59:35] OK, will update cxserver after backport window. [07:00:04] Amir1, apergos, jnuche, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220728T0700). [07:00:04] koi and kart_: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:21] good morning! [07:00:38] there are two patches in the window today. I'll check on trainees momentarily. [07:01:43] hi! [07:02:11] we do have one trainee! [07:02:14] kart_: yes! [07:02:46] they are not yet here so we should wait for them. [07:03:12] koi and kart_: would you both ordinarily be self-deploy or no? [07:03:36] apergos: I'll self-deploy. [07:03:50] Need to go to Lunch after deployment :~ [07:04:02] apergos: I have no access to shell so.. [07:04:04] so wait a little, if our trainee doesn't show up in about 10 minutes, I'll ask you to go ahead [07:04:14] kart_: [07:04:14] (Although: Do not leave town rule applies, so I'm in the town :D) [07:04:49] Should I go ahead, apergos? [07:05:14] the reason is that I will lean on kart to screenshare while typing, for the benefit of our trainee, as I prefer not to try to train and deploy at the same time [07:05:41] kart_: go ahed and merge but then let's wait 10 min and see if our trainee arrives. [07:05:49] OK! [07:06:06] koi: I'll handle your deploy after kart_ is settled, with or without a trainee. [07:06:31] apergos: Sorry for that. I'll have other deployments next week, so we can plan for screenshare/training. [07:06:50] morning πŸ‘‹ [07:06:59] oh hey jnuche, awesome [07:07:03] apergos: do you want me to deploy so you don't need to do both things? [07:07:20] oh! that would be dandy, but note that our trainee is not yet here [07:07:41] and yes, I'll task you with deploying today if you don't mind, more experience is better :-) [07:08:02] okie dokes [07:09:00] kart_: you should have +2 your config and be waiting for merge by now :-P [07:10:12] (03CR) 10KartikMistry: [C: 03+2] Enable SectionTranslation on 10 more WPs where ContentTranslation is available by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817758 (https://phabricator.wikimedia.org/T313300) (owner: 10KartikMistry) [07:10:23] good good [07:10:35] apergos: :D [07:11:13] (03Merged) 10jenkins-bot: Enable SectionTranslation on 10 more WPs where ContentTranslation is available by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817758 (https://phabricator.wikimedia.org/T313300) (owner: 10KartikMistry) [07:13:02] we'll give it 5 more minutes and then deployment by someone, heh [07:17:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:17:35] kart_: the requisite time has passed. please go ahead and self-deploy. [07:18:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:18:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:19:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:19:11] apergos: Deploying. Was confused about messages from mwdebug canaries. [07:19:24] ah! [07:19:29] 2022-07-28 07:12:50,306 [INFO] The server is depooled from all services. Restarting the service directly [07:19:33] This ^ [07:19:57] I think it is fixed already. [07:19:58] there have been changes over the lst two days to the scap restart of php-fpm iirc [07:20:29] yeah, that's correct, we had a bug where we weren't correctly restarting php-fpm on canaries [07:21:44] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:817758|Enable SectionTranslation on 10 more WPs where ContentTranslation is available by default (T313300)]] (duration: 03m 16s) [07:21:50] T313300: Enable Section Translation on 10 more Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T313300 [07:21:59] OK. I'm done apergos. [07:22:08] great! [07:22:42] koi: you couldn't self-deploy right? [07:22:45] jnuche, you're up for +2 and deployment of koi 's patch, after doing all the usual due diligence [07:22:57] alright [07:23:49] (03CR) 10Jaime Nuche: [C: 03+2] "Backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817929 (https://phabricator.wikimedia.org/T314013) (owner: 10Stang) [07:26:20] ..maybe need a rebase? [07:26:48] the rebase is part of the deployment process [07:28:26] yeah, will be done on the deployment server [07:30:12] (03PS2) 10Jaime Nuche: ja(wiki|wikivoyage): Add "Module talk" as alias of NS829 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817929 (https://phabricator.wikimedia.org/T314013) (owner: 10Stang) [07:30:59] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36460/console" [puppet] - 10https://gerrit.wikimedia.org/r/817700 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [07:33:21] apergos: the patch is not merging on its own, should a submit manually? [07:34:48] I would remove the +2, and redo it [07:35:06] (03CR) 10Jaime Nuche: ja(wiki|wikivoyage): Add "Module talk" as alias of NS829 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817929 (https://phabricator.wikimedia.org/T314013) (owner: 10Stang) [07:35:35] (03CR) 10Jaime Nuche: [C: 03+2] ja(wiki|wikivoyage): Add "Module talk" as alias of NS829 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817929 (https://phabricator.wikimedia.org/T314013) (owner: 10Stang) [07:36:27] (03Merged) 10jenkins-bot: ja(wiki|wikivoyage): Add "Module talk" as alias of NS829 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817929 (https://phabricator.wikimedia.org/T314013) (owner: 10Stang) [07:36:37] there we go [07:38:31] koi: the change is now on mwdebug1001, please check [07:38:38] looking [07:39:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:40:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:40:12] so a thing I didn't think to mention that I often do is to reload the patch just before I am about to +2 it, and if I see that there is a merge conflict I will ask the patch owner or the person asking for deployment to resolve that first. [07:40:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:40:32] usually that's a matter of a simple rebase [07:41:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:41:14] gotcha [07:44:53] !log update HAProxy to version 2.4.18 on apt.wm.o thirdparty/haproxy24 [07:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:25] I thought it should be ok but not pretty sure [07:47:39] anyway, please sync it [07:48:06] do you need more time to check? [07:48:11] (03PS1) 10Kevin Bazira: ml-services: Add ko, sr & uk wiki articletopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/818055 (https://phabricator.wikimedia.org/T313307) [07:48:24] I tested using WikimediaDebug on mwdebug1001, and koi's patch doesn't seem to work [07:49:00] If what the author of the Phab task is saying is correct, there should be no need for special site configuration, and the bug lies elsewhere [07:49:27] do you want to roll this back? [07:49:57] thanks for point it out, I don't know this is a kind of bug [07:50:31] previous I thought maybe need to run namespaceDupes.php for it to take affect, but yeah [07:50:59] apergos: please revert it, I will look into it later [07:51:07] ok! [07:51:15] jnuche: you got this? [07:51:24] yep, rolling back [07:51:33] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:51:50] πŸ‘ [07:54:02] (03PS1) 10Jaime Nuche: Revert "ja(wiki|wikivoyage): Add "Module talk" as alias of NS829" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818056 [07:54:55] (03CR) 10Stang: "Reverted in I3a5460b2627a600d53e13ca39c26819e45429d44" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817929 (https://phabricator.wikimedia.org/T314013) (owner: 10Stang) [07:55:08] (03CR) 10AikoChou: [C: 03+1] ml-services: Add ko, sr & uk wiki articletopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/818055 (https://phabricator.wikimedia.org/T313307) (owner: 10Kevin Bazira) [07:55:53] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) So more info: At the time of the event, nothing logged apart from: ` Jul 27 20:38:32 db1132 mysqld[3344701]: 2022-07-27... [07:56:19] (03CR) 10Jaime Nuche: [C: 03+2] Revert "ja(wiki|wikivoyage): Add "Module talk" as alias of NS829" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818056 (owner: 10Jaime Nuche) [07:57:33] (03Merged) 10jenkins-bot: Revert "ja(wiki|wikivoyage): Add "Module talk" as alias of NS829" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818056 (owner: 10Jaime Nuche) [07:58:20] koi, apergos: commit reverted and mwdebug1001 rolled back [07:58:39] awesome [07:58:51] got it, thanks [07:59:16] sure thing, sorry the fix didn't work! [07:59:58] (03CR) 10Elukey: [C: 03+2] ml-services: Add ko, sr & uk wiki articletopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/818055 (https://phabricator.wikimedia.org/T313307) (owner: 10Kevin Bazira) [08:01:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:01:50] !log UTC morning backport and config training [08:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:13] oops [08:02:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:02:21] !log UTC morning backport and config training done [08:02:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:03:48] thanks for doing all the things, jnuche! [08:04:09] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:04:24] and that concludes today's somewhat more eventful session, see everyone here next time! I'll leave a note on the trainee's task and see what happened there. [08:04:40] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/817883 (owner: 10Andrea Denisse) [08:05:52] (03PS1) 10Marostegui: db2172: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818058 [08:07:01] (03CR) 10Marostegui: [C: 03+2] db2172: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818058 (owner: 10Marostegui) [08:09:40] (03PS1) 10Marostegui: instances.yaml: Add db2172 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/818060 (https://phabricator.wikimedia.org/T311493) [08:11:32] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2172 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/818060 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [08:12:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2172 to dbctl T311493', diff saved to https://phabricator.wikimedia.org/P32028 and previous config saved to /var/cache/conftool/dbconfig/20220728-081252-marostegui.json [08:12:57] T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493 [08:14:21] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:17:15] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::webserver: fallback php version for webrequests [puppet] - 10https://gerrit.wikimedia.org/r/817700 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [08:19:20] (03CR) 10GergΕ‘ Tisza: [C: 03+1] Multi-DC routing special cases for OAuth (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817086 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling) [08:19:38] (03CR) 10GergΕ‘ Tisza: Multi-DC routing special cases for OAuth [puppet] - 10https://gerrit.wikimedia.org/r/817086 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling) [08:24:33] (03PS1) 10Volans: admin: add sre-admins to the check for ops [puppet] - 10https://gerrit.wikimedia.org/r/818061 [08:25:24] (03CR) 10CI reject: [V: 04-1] admin: add sre-admins to the check for ops [puppet] - 10https://gerrit.wikimedia.org/r/818061 (owner: 10Volans) [08:27:18] (03PS2) 10Volans: admin: add sre-admins to the check for ops [puppet] - 10https://gerrit.wikimedia.org/r/818061 [08:28:11] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [08:32:58] (03PS1) 10Marostegui: db2174: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818062 [08:34:45] (03CR) 10Marostegui: [C: 03+2] db2174: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818062 (owner: 10Marostegui) [08:36:10] (03PS11) 10Filippo Giunchedi: sre: Port swift alerts from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/812883 (https://phabricator.wikimedia.org/T312765) (owner: 10Mark Bergsma) [08:36:35] !log update HAProxy to version 2.4.18 in cp4021 and cp4027 [08:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:26] (03CR) 10Filippo Giunchedi: "I know Mark is busy with other commitments so I BOLDed and went ahead to fix the issues in previous patchset. I've also temporarily downgr" [alerts] - 10https://gerrit.wikimedia.org/r/812883 (https://phabricator.wikimedia.org/T312765) (owner: 10Mark Bergsma) [08:38:12] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] "Applying now" [puppet] - 10https://gerrit.wikimedia.org/r/817701 (https://phabricator.wikimedia.org/T312638) (owner: 10Giuseppe Lavagetto) [08:38:14] * kart_ updating cxserver.. [08:38:28] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-07-27-220330-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/817931 (https://phabricator.wikimedia.org/T308248) (owner: 10KartikMistry) [08:42:42] (03Merged) 10jenkins-bot: Update cxserver to 2022-07-27-220330-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/817931 (https://phabricator.wikimedia.org/T308248) (owner: 10KartikMistry) [08:43:50] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [08:44:16] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [08:48:16] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [08:48:35] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:48:58] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [08:49:15] PROBLEM - puppet last run on lvs6001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:49:19] PROBLEM - puppet last run on lvs2008 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:49:33] PROBLEM - puppet last run on lvs4006 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:49:33] PROBLEM - puppet last run on lvs3006 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:49:51] PROBLEM - puppet last run on lvs5001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:50:35] (03CR) 10Vgutierrez: [C: 03+1] "Looks good from a traffic point of view" [puppet] - 10https://gerrit.wikimedia.org/r/817086 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling) [08:51:06] (03PS3) 10Vgutierrez: varnish: enable query-sorting in production via X-Wikimedia-Debug [puppet] - 10https://gerrit.wikimedia.org/r/816206 (https://phabricator.wikimedia.org/T138093) (owner: 10Ori) [08:51:22] <_joe_> sigh [08:51:26] <_joe_> no escape heh [08:51:45] uh? [08:51:52] <_joe_> the puppet alerts above [08:51:53] akosiaris: ^^ is that related to your confd change? [08:51:57] <_joe_> I tried to avoid them [08:52:07] <_joe_> vgutierrez: no it's related to our puppet alerting :P [08:52:10] LOL [08:52:14] <_joe_> it alerts once you actually run puppet [08:52:18] oh lovely [08:52:33] * vgutierrez keeps the axe away [08:53:36] !log disable puppet on cp hosts to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/816206 [08:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:22] (03PS1) 10Giuseppe Lavagetto: mediawiki::webserver: fix non-default routing [puppet] - 10https://gerrit.wikimedia.org/r/818063 [08:55:43] RECOVERY - puppet last run on lvs6001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:55:47] RECOVERY - puppet last run on lvs2008 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:55:59] RECOVERY - puppet last run on lvs4006 is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:55:59] RECOVERY - puppet last run on lvs3006 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:56:12] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [08:56:17] RECOVERY - puppet last run on lvs5001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:56:35] (03CR) 10Vgutierrez: [C: 03+2] varnish: enable query-sorting in production via X-Wikimedia-Debug [puppet] - 10https://gerrit.wikimedia.org/r/816206 (https://phabricator.wikimedia.org/T138093) (owner: 10Ori) [08:56:46] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36461/console" [puppet] - 10https://gerrit.wikimedia.org/r/818063 (owner: 10Giuseppe Lavagetto) [08:56:54] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [08:57:31] !log Updated cxserver to 2022-07-27-220330-production (T308248) [08:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:35] T308248: Newly supported languages in Google Translate - https://phabricator.wikimedia.org/T308248 [08:57:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2142', diff saved to https://phabricator.wikimedia.org/P32030 and previous config saved to /var/cache/conftool/dbconfig/20220728-085737-marostegui.json [09:00:13] (03PS2) 10Giuseppe Lavagetto: mediawiki::webserver: fix non-default routing [puppet] - 10https://gerrit.wikimedia.org/r/818063 [09:00:53] (03CR) 10Jbond: [C: 03+1] admin: add sstefanova user and to WMCS groups [puppet] - 10https://gerrit.wikimedia.org/r/817845 (https://phabricator.wikimedia.org/T313934) (owner: 10Volans) [09:01:38] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36462/console" [puppet] - 10https://gerrit.wikimedia.org/r/818063 (owner: 10Giuseppe Lavagetto) [09:02:17] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mediawiki::webserver: fix non-default routing [puppet] - 10https://gerrit.wikimedia.org/r/818063 (owner: 10Giuseppe Lavagetto) [09:06:31] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10Vgutierrez) @ori I've just deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/816206 and naive tested it against cp4027: ` vg... [09:07:29] 10SRE, 10ops-eqiad, 10DC-Ops: ps1-e4-eqiad alerts - https://phabricator.wikimedia.org/T314027 (10fgiunchedi) [09:17:25] !log set thanos ring replicas to 3.95 T311690 [09:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:32] T311690: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 [09:19:38] (03CR) 10MVernon: [C: 03+2] swift: stop flinging thumbnails at other DC in rewrite.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816726 (https://phabricator.wikimedia.org/T313102) (owner: 10MVernon) [09:24:55] !log rolling restart of swift proxies to apply wmf/rewrite update T313102 [09:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:00] T313102: Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself - https://phabricator.wikimedia.org/T313102 [09:28:52] 10SRE-swift-storage, 10Patch-For-Review: Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself - https://phabricator.wikimedia.org/T313102 (10MatthewVernon) 05Openβ†’03Resolved a:03MatthewVernon I've deployed this fix now, so closing this issue. [09:31:22] (03Abandoned) 10Jbond: do not merge! [puppet] - 10https://gerrit.wikimedia.org/r/817788 (owner: 10Jbond) [09:33:24] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [09:33:47] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:36:22] (03PS1) 10Volans: raid: convert get-raid-status-megacli to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/818070 (https://phabricator.wikimedia.org/T313952) [09:36:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:38:23] (03CR) 10CI reject: [V: 04-1] raid: convert get-raid-status-megacli to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/818070 (https://phabricator.wikimedia.org/T313952) (owner: 10Volans) [09:40:32] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:40:37] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [09:41:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:48:36] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:50:06] (03PS1) 10Jbond: C:raid convert to python3 [puppet] - 10https://gerrit.wikimedia.org/r/818072 (https://phabricator.wikimedia.org/T313952) [09:50:50] (03Abandoned) 10Jbond: C:raid convert to python3 [puppet] - 10https://gerrit.wikimedia.org/r/818072 (https://phabricator.wikimedia.org/T313952) (owner: 10Jbond) [09:55:35] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:58:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:00:04] mvolz: I, the Bot under the Fountain, call upon thee, The Deployer, to do Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220728T1000). [10:00:10] (03PS2) 10Volans: raid: convert get-raid-status-megacli to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/818070 (https://phabricator.wikimedia.org/T313952) [10:03:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:05:18] !log update gitlab1004 to 15.0.4-ce.0 [10:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:27] (03CR) 10Zfilipin: "I guess this commit can be abandoned since the related task is resolved." [puppet] - 10https://gerrit.wikimedia.org/r/817223 (https://phabricator.wikimedia.org/T313794) (owner: 10Pwangai) [10:08:34] (03Abandoned) 10Pwangai: admin: Add pwangai to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/817223 (https://phabricator.wikimedia.org/T313794) (owner: 10Pwangai) [10:12:44] !log jbond@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin2002" [10:13:16] !log jbond@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync data - jbond@cumin2002" [10:13:50] (03PS2) 10Jbond: sre.dns.netbox: add call to sre.puppet.sync-netbox-hiera [cookbooks] - 10https://gerrit.wikimedia.org/r/804575 [10:13:56] (03PS3) 10Jbond: sre.dns.netbox: add call to sre.puppet.sync-netbox-hiera [cookbooks] - 10https://gerrit.wikimedia.org/r/804575 [10:17:32] (03PS1) 10Majavah: P:prometheus::openstack_exporter: fix executable permissions [puppet] - 10https://gerrit.wikimedia.org/r/818075 (https://phabricator.wikimedia.org/T314016) [10:17:40] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) Some updates. Along with @Danielgblack @VicentiuCiorbaru (**MariaDB Foundation**) we have spent quite some fun time de... [10:19:07] !log jbond@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin2002" [10:19:29] !log jbond@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "sync data - jbond@cumin2002" [10:20:49] (03PS1) 10AikoChou: ml-services: add env variables for outlink-topic-model isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/818076 (https://phabricator.wikimedia.org/T313888) [10:20:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32032 and previous config saved to /var/cache/conftool/dbconfig/20220728-102051-root.json [10:20:55] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10LSobanski) @Jclark-ctr it will be another few weeks. The process to get the host ready has been started but it is a lengthy one. [10:21:02] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: standardize php pool names (1/3) [puppet] - 10https://gerrit.wikimedia.org/r/818077 [10:21:04] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: switch apache configurations to use the new pool [puppet] - 10https://gerrit.wikimedia.org/r/818078 [10:21:06] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: cleanup the legacy pool name [puppet] - 10https://gerrit.wikimedia.org/r/818079 [10:22:27] (03PS2) 10AikoChou: ml-services: add env variables for outlink-topic-model isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/818076 (https://phabricator.wikimedia.org/T313888) [10:24:04] (03CR) 10CI reject: [V: 04-1] mediawiki::php: standardize php pool names (1/3) [puppet] - 10https://gerrit.wikimedia.org/r/818077 (owner: 10Giuseppe Lavagetto) [10:24:56] (03CR) 10CI reject: [V: 04-1] mediawiki::php: switch apache configurations to use the new pool [puppet] - 10https://gerrit.wikimedia.org/r/818078 (owner: 10Giuseppe Lavagetto) [10:27:53] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/804575 (owner: 10Jbond) [10:28:58] (03PS4) 10Jbond: sre.dns.netbox: add call to sre.puppet.sync-netbox-hiera [cookbooks] - 10https://gerrit.wikimedia.org/r/804575 [10:35:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32034 and previous config saved to /var/cache/conftool/dbconfig/20220728-103555-root.json [10:37:02] (03CR) 10Klausman: [C: 03+1] ml-services: add env variables for outlink-topic-model isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/818076 (https://phabricator.wikimedia.org/T313888) (owner: 10AikoChou) [10:39:53] (03CR) 10Klausman: [C: 03+2] ml-services: add env variables for outlink-topic-model isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/818076 (https://phabricator.wikimedia.org/T313888) (owner: 10AikoChou) [10:40:02] 10SRE, 10LDAP-Access-Requests, 10User-zeljkofilipin: +2 for pwangai - https://phabricator.wikimedia.org/T313794 (10Peachey88) >>! In T313794#8109751, @pwangai wrote: > I am closing this request to follow instructions stipulated at https://www.mediawiki.org/wiki/Gerrit/Privilege_policy/en#Requesting_Gerrit_pr... [10:43:14] (03Merged) 10jenkins-bot: ml-services: add env variables for outlink-topic-model isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/818076 (https://phabricator.wikimedia.org/T313888) (owner: 10AikoChou) [10:46:30] (03PS1) 10Jbond: fix exception handleing [cookbooks] - 10https://gerrit.wikimedia.org/r/818085 [10:46:49] (03PS2) 10Giuseppe Lavagetto: mediawiki::php: standardize pool names (2/3) [puppet] - 10https://gerrit.wikimedia.org/r/818078 [10:46:51] (03PS2) 10Giuseppe Lavagetto: mediawiki::php: standardize pool names (3/3) [puppet] - 10https://gerrit.wikimedia.org/r/818079 [10:48:18] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:49:52] (03CR) 10CI reject: [V: 04-1] mediawiki::php: standardize pool names (2/3) [puppet] - 10https://gerrit.wikimedia.org/r/818078 (owner: 10Giuseppe Lavagetto) [10:50:56] (03PS2) 10Jbond: cookbook sre.puppet.sync-netbox-hiera: Fix exception handling [cookbooks] - 10https://gerrit.wikimedia.org/r/818085 [10:51:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32035 and previous config saved to /var/cache/conftool/dbconfig/20220728-105100-root.json [10:53:00] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [10:54:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: Add a missing f on an f-string [software/conftool] - 10https://gerrit.wikimedia.org/r/817910 (owner: 10RLazarus) [10:56:12] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/818085 (owner: 10Jbond) [10:57:04] (03Merged) 10jenkins-bot: requestctl: Add a missing f on an f-string [software/conftool] - 10https://gerrit.wikimedia.org/r/817910 (owner: 10RLazarus) [10:58:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [11:00:14] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/818070 (https://phabricator.wikimedia.org/T313952) (owner: 10Volans) [11:00:51] (03PS3) 10Jbond: cookbook sre.puppet.sync-netbox-hiera: Fix exception handling [cookbooks] - 10https://gerrit.wikimedia.org/r/818085 [11:00:54] (03PS2) 10Giuseppe Lavagetto: mediawiki::php: standardize php pool names (1/3) [puppet] - 10https://gerrit.wikimedia.org/r/818077 [11:00:56] (03PS3) 10Giuseppe Lavagetto: mediawiki::php: standardize pool names (2/3) [puppet] - 10https://gerrit.wikimedia.org/r/818078 [11:00:58] (03PS3) 10Giuseppe Lavagetto: mediawiki::php: standardize pool names (3/3) [puppet] - 10https://gerrit.wikimedia.org/r/818079 [11:01:01] (03CR) 10Jbond: [C: 03+2] "fixed thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/818085 (owner: 10Jbond) [11:04:52] (03CR) 10Jbond: "As said on irc im not sure about this. We have spoken about giving sre-admins different ldap permissions e.g. read-only orchestrator acce" [puppet] - 10https://gerrit.wikimedia.org/r/818061 (owner: 10Volans) [11:05:24] (03CR) 10Jbond: [C: 03+1] Add configuration for the release script [software/debmonitor] - 10https://gerrit.wikimedia.org/r/817722 (owner: 10Volans) [11:05:35] (03CR) 10Volans: [C: 03+2] raid: convert get-raid-status-megacli to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/818070 (https://phabricator.wikimedia.org/T313952) (owner: 10Volans) [11:05:39] (03CR) 10Jbond: [C: 03+1] Add configuration for the release script [software/cumin] - 10https://gerrit.wikimedia.org/r/817726 (owner: 10Volans) [11:05:46] (03PS3) 10Jcrespo: Initial commit [software/pampinus] - 10https://gerrit.wikimedia.org/r/817294 (https://phabricator.wikimedia.org/T283017) [11:06:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32036 and previous config saved to /var/cache/conftool/dbconfig/20220728-110604-root.json [11:06:22] (03Merged) 10jenkins-bot: cookbook sre.puppet.sync-netbox-hiera: Fix exception handling [cookbooks] - 10https://gerrit.wikimedia.org/r/818085 (owner: 10Jbond) [11:08:06] (03PS1) 10Jcrespo: Adapt mysql prometheus script to new zarcillo schema [puppet] - 10https://gerrit.wikimedia.org/r/818088 (https://phabricator.wikimedia.org/T283017) [11:08:38] RECOVERY - MegaRAID on ms-be2067 is OK: manual re-trigger of the critical https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:09:07] this is me [11:09:27] to test the re-trigger of the raid handler that will call get-raid-status-megacli that should work now [11:16:40] (03PS2) 10Jcrespo: Adapt mysql prometheus script to new zarcillo schema [puppet] - 10https://gerrit.wikimedia.org/r/818088 (https://phabricator.wikimedia.org/T283017) [11:17:29] (03PS12) 10Jbond: P:ssh::client: use more modern functions for collecting sskey [puppet] - 10https://gerrit.wikimedia.org/r/816775 [11:21:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32037 and previous config saved to /var/cache/conftool/dbconfig/20220728-112109-root.json [11:22:17] (03PS13) 10Jbond: P:ssh::client: use more modern functions for collecting sskey [puppet] - 10https://gerrit.wikimedia.org/r/816775 [11:23:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36463/console" [puppet] - 10https://gerrit.wikimedia.org/r/816775 (owner: 10Jbond) [11:24:28] (03Abandoned) 10Jbond: sshkey: move the sort to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/816847 (owner: 10Jbond) [11:24:36] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:52] (03Abandoned) 10Jbond: never merge, test doing the reduce in ruby [puppet] - 10https://gerrit.wikimedia.org/r/816852 (owner: 10Jbond) [11:27:24] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:ssh::client: use more modern functions for collecting sskey [puppet] - 10https://gerrit.wikimedia.org/r/816775 (owner: 10Jbond) [11:30:00] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: 9.x upgrade: replace client.verify.server [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [11:32:18] PROBLEM - MegaRAID on ms-be2067 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:32:19] ACKNOWLEDGEMENT - MegaRAID on ms-be2067 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T314039 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:32:22] 10SRE, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314039 (10ops-monitoring-bot) [11:34:06] (03CR) 10Vgutierrez: "please let's ensure that this CR is also backwards compatible or it's going to break the current deployment-prep environment" [puppet] - 10https://gerrit.wikimedia.org/r/803288 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [11:35:27] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: 9.x upgrade: remove wmf-tls log format [puppet] - 10https://gerrit.wikimedia.org/r/803301 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [11:35:38] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/817264 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris) [11:35:46] (03PS2) 10Alexandros Kosiaris: Switch etcd clients to use conf100[789] [puppet] - 10https://gerrit.wikimedia.org/r/817264 (https://phabricator.wikimedia.org/T311408) [11:35:49] (03CR) 10Alexandros Kosiaris: [V: 03+2] Switch etcd clients to use conf100[789] [puppet] - 10https://gerrit.wikimedia.org/r/817264 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris) [11:36:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32038 and previous config saved to /var/cache/conftool/dbconfig/20220728-113615-root.json [11:41:22] !log slow (10minutes interval) rolling restart of all pybals to pick up new conf hosts config. T311407 [11:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:28] T311407: Put conf100[789] in production - https://phabricator.wikimedia.org/T311407 [11:43:34] PROBLEM - PyBal connections to etcd on lvs3005 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [11:43:40] PROBLEM - PyBal connections to etcd on lvs1020 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=119) https://wikitech.wikimedia.org/wiki/PyBal [11:45:24] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:45:36] PROBLEM - PyBal connections to etcd on lvs6001 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [11:45:56] 10SRE, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314039 (10Volans) Actual output, I'll check why it didn't work ` $ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli === RaidStatus (does not include components in optimal state) name: Adapter #0 Virtual Drive: 2 (Ta... [11:46:20] PROBLEM - PyBal connections to etcd on lvs6002 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [11:46:34] PROBLEM - PyBal connections to etcd on lvs3006 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [11:46:46] akosiaris: is this you? [11:47:14] he logged "slow (10minutes interval) rolling restart of all pybals to pick up new conf hosts config. T311407" 5 minutes ago [11:47:15] T311407: Put conf100[789] in production - https://phabricator.wikimedia.org/T311407 [11:47:16] PROBLEM - PyBal connections to etcd on lvs6003 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [11:47:18] I saw him commenting about a pyball work, and those match the new hosts [11:48:08] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=71) https://wikitech.wikimedia.org/wiki/PyBal [11:48:08] PROBLEM - PyBal connections to etcd on lvs1017 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [11:48:22] PROBLEM - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=36) https://wikitech.wikimedia.org/wiki/PyBal [11:48:32] PROBLEM - PyBal connections to etcd on lvs3007 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [11:48:56] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:13] yep.. that's akosiaris work [11:49:44] ok, ignoring :) [11:50:01] !log jbond@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "test 818085 - jbond@cumin2002" [11:50:20] !log jbond@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "test 818085 - jbond@cumin2002" [11:53:03] (03CR) 10Klausman: [C: 03+1] Puppetize spark3 installation and configs using conda-analytics env (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata) [11:56:28] (03PS1) 10Filippo Giunchedi: sre: port Kafka alerts from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/818108 (https://phabricator.wikimedia.org/T305847) [11:56:59] (03PS2) 10Filippo Giunchedi: sre: port Kafka alerts from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/818108 (https://phabricator.wikimedia.org/T305847) [12:03:16] PROBLEM - SSH on wtp1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:07:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:08:22] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:09:42] yeah, alerts will slowly recover as I am restarting pybals [12:09:53] monitoring has diverged by the actual state of things right now [12:10:07] I don't dare to converge any faster than that though ;-) [12:14:01] ack, thx [12:16:56] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:19:36] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:21:39] (03PS1) 10Jbond: sre.puppet.sync-netbox-hiera: Add return check mode [cookbooks] - 10https://gerrit.wikimedia.org/r/818111 [12:22:31] (03PS11) 10Aqu: Puppetize spark3 installation and configs using conda-analytics env [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata) [12:24:12] RECOVERY - PyBal connections to etcd on lvs6001 is OK: OK: 12 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [12:24:19] (03CR) 10Aqu: "Typo fixed." [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata) [12:25:20] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:25:49] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Add return check mode [cookbooks] - 10https://gerrit.wikimedia.org/r/818111 (owner: 10Jbond) [12:27:48] (03PS2) 10Sbisson: Register Wikistories streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817263 (https://phabricator.wikimedia.org/T313633) [12:27:58] (03CR) 10Sbisson: Register Wikistories streams (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817263 (https://phabricator.wikimedia.org/T313633) (owner: 10Sbisson) [12:31:14] RECOVERY - PyBal connections to etcd on lvs6002 is OK: OK: 4 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [12:31:18] (03CR) 10Volans: [C: 04-1] "I don't think that this approach will work at all. You have to implement part of the logic in spicerack IMHO to allow for a check only, do" [cookbooks] - 10https://gerrit.wikimedia.org/r/818111 (owner: 10Jbond) [12:33:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32039 and previous config saved to /var/cache/conftool/dbconfig/20220728-123304-root.json [12:34:24] (03CR) 10Jelto: [C: 04-1] "There are two typos (gitlab and not gerrit), left comments in line." [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [12:37:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1132.eqiad.wmnet with reason: Maintenance [12:37:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1132.eqiad.wmnet with reason: Maintenance [12:38:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1139.eqiad.wmnet with reason: Maintenance [12:38:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1139.eqiad.wmnet with reason: Maintenance [12:38:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1169.eqiad.wmnet with reason: Maintenance [12:38:37] (03PS1) 10Jbond: reposync: don't ask for confirmation in dry run mode [software/spicerack] - 10https://gerrit.wikimedia.org/r/818114 [12:38:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1169.eqiad.wmnet with reason: Maintenance [12:38:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T312990)', diff saved to https://phabricator.wikimedia.org/P32040 and previous config saved to /var/cache/conftool/dbconfig/20220728-123854-marostegui.json [12:38:59] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [12:39:07] (03CR) 10Jbond: sre.puppet.sync-netbox-hiera: Add return check mode (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/818111 (owner: 10Jbond) [12:39:48] (03PS3) 10Ssingh: trafficserver: 9.x upgrade: rename max_connections_active_in [puppet] - 10https://gerrit.wikimedia.org/r/803286 (https://phabricator.wikimedia.org/T309651) [12:39:53] (03PS2) 10Jbond: sre.puppet.sync-netbox-hiera: Add return check mode [cookbooks] - 10https://gerrit.wikimedia.org/r/818111 [12:40:56] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36464/console" [puppet] - 10https://gerrit.wikimedia.org/r/803286 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [12:42:28] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:43:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T312990)', diff saved to https://phabricator.wikimedia.org/P32041 and previous config saved to /var/cache/conftool/dbconfig/20220728-124317-marostegui.json [12:43:24] (03PS1) 10Marostegui: instances.yaml: Add db2174 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/818115 (https://phabricator.wikimedia.org/T311493) [12:43:59] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Add return check mode [cookbooks] - 10https://gerrit.wikimedia.org/r/818111 (owner: 10Jbond) [12:44:58] RECOVERY - PyBal connections to etcd on lvs6003 is OK: OK: 16 connections established with conf1009.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [12:45:08] (03PS1) 10AikoChou: ml-services: add outlink-topic-model isvc to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/818116 (https://phabricator.wikimedia.org/T313888) [12:45:10] (03CR) 10CI reject: [V: 04-1] reposync: don't ask for confirmation in dry run mode [software/spicerack] - 10https://gerrit.wikimedia.org/r/818114 (owner: 10Jbond) [12:45:56] (03PS2) 10Jbond: reposync: don't ask for confirmation in dry run mode [software/spicerack] - 10https://gerrit.wikimedia.org/r/818114 [12:46:09] (03PS3) 10Jbond: sre.puppet.sync-netbox-hiera: Add return check mode [cookbooks] - 10https://gerrit.wikimedia.org/r/818111 [12:47:13] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2174 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/818115 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [12:48:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32042 and previous config saved to /var/cache/conftool/dbconfig/20220728-124809-root.json [12:49:26] (03PS3) 10Phuedx: testwiki: Add mediawiki.web_ui.interactions stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817225 (https://phabricator.wikimedia.org/T311268) [12:49:40] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:52:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2174 to dbctl T311493', diff saved to https://phabricator.wikimedia.org/P32043 and previous config saved to /var/cache/conftool/dbconfig/20220728-125253-marostegui.json [12:52:59] T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493 [12:54:05] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36465/console" [puppet] - 10https://gerrit.wikimedia.org/r/803286 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [12:58:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P32044 and previous config saved to /var/cache/conftool/dbconfig/20220728-125823-marostegui.json [12:59:51] (03CR) 10Elukey: [C: 03+2] ml-services: add outlink-topic-model isvc to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/818116 (https://phabricator.wikimedia.org/T313888) (owner: 10AikoChou) [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220728T1300) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220728T1300). [13:00:05] Lucas_WMDE and phuedx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] o/ [13:00:11] o/ [13:00:12] I can deploy! [13:01:13] Thanks! I have a new laptop and I haven't yet generated new keys [13:01:23] (03PS2) 10Lucas Werkmeister (WMDE): Configure wbsearchentities profile parameter on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806931 (https://phabricator.wikimedia.org/T307869) [13:01:39] ok [13:02:01] did you see my question in here yesterday, about the rate? [13:02:08] (03PS4) 10Ssingh: trafficserver: 9.x upgrade: rename max_connections_active_in [puppet] - 10https://gerrit.wikimedia.org/r/803286 (https://phabricator.wikimedia.org/T309651) [13:02:12] (I haven’t looked at the diffConfig of the latest patch set yet) [13:03:01] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36466/console" [puppet] - 10https://gerrit.wikimedia.org/r/803286 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:03:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32045 and previous config saved to /var/cache/conftool/dbconfig/20220728-130314-root.json [13:03:31] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Configure wbsearchentities profile parameter on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806931 (https://phabricator.wikimedia.org/T307869) (owner: 10Lucas Werkmeister (WMDE)) [13:03:53] Lucas_WMDE: I did see the question. I think it's because the stream is now default in InitialiseSettings.php and I couldn't override it for all beta wikis in InitialiseSettings-labs.php (+default doesn't work IIRC) [13:04:04] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:04:08] So I'd expect some beta wikis to have the stream defined but with a sampling rate of 0 [13:04:13] ok [13:04:15] Which is acceptable [13:04:21] that was going to be my next question :) [13:04:25] β€œdo we care” [13:04:26] (03Merged) 10jenkins-bot: Configure wbsearchentities profile parameter on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806931 (https://phabricator.wikimedia.org/T307869) (owner: 10Lucas Werkmeister (WMDE)) [13:04:29] (03PS1) 10Volans: raid: fix compression in get-raid-status-megacli [puppet] - 10https://gerrit.wikimedia.org/r/818120 (https://phabricator.wikimedia.org/T313952) [13:04:32] RECOVERY - SSH on wtp1041.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:04:53] hm, mediawiki-staging is already ahead of upstream before I’m running git fetch [13:04:57] let’s see if that resolves itself in a moment… [13:05:03] it does, yay [13:05:32] testing on mwdebug1001 [13:05:53] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:06:42] seems to work fine, I’ll sync [13:07:38] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [13:08:29] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:08:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:09:25] phuedx: interestingly, all the rate 1 lines in the diff have a comma at the end, and all the rate 0 lines don’t [13:09:43] I guess that’s also due to them coming from IS.php vs IS-labs.php, or something like that [13:09:49] :o [13:09:57] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:10:27] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:806931|Configure wbsearchentities profile parameter on Wikidata (T307869)]] (duration: 03m 25s) [13:10:31] T307869: Request for new search profile for Wikidata that boosts Items for languages - https://phabricator.wikimedia.org/T307869 [13:10:45] seems to be either rate 1, unit pageView; or unit pageView, rate 0 [13:11:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:12:02] why is Gerrit claiming the change is up to date and not letting me rebase it? [13:12:08] I just merged another change into master didn’t I [13:12:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:12:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:12:48] (03PS4) 10Lucas Werkmeister (WMDE): testwiki: Add mediawiki.web_ui.interactions stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817225 (https://phabricator.wikimedia.org/T311268) (owner: 10Phuedx) [13:13:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:13:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P32047 and previous config saved to /var/cache/conftool/dbconfig/20220728-131329-marostegui.json [13:13:44] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:13:51] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] testwiki: Add mediawiki.web_ui.interactions stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817225 (https://phabricator.wikimedia.org/T311268) (owner: 10Phuedx) [13:13:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:14:03] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:14:38] (03Merged) 10jenkins-bot: testwiki: Add mediawiki.web_ui.interactions stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817225 (https://phabricator.wikimedia.org/T311268) (owner: 10Phuedx) [13:15:13] phuedx: the change is on mwdebug1001, can you test it? [13:15:46] On it [13:17:27] (03CR) 10Volans: [C: 03+2] Add configuration for the release script [software/debmonitor] - 10https://gerrit.wikimedia.org/r/817722 (owner: 10Volans) [13:17:36] (03CR) 10Volans: [C: 03+2] Add configuration for the release script [software/cumin] - 10https://gerrit.wikimedia.org/r/817726 (owner: 10Volans) [13:18:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:18:14] Lucas_WMDE: I've confirmed that the stream config is only sent to the client on testwiki (and, for example, not on enwiki) and that it has a sampling rate of 1 on testwiki [13:18:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32048 and previous config saved to /var/cache/conftool/dbconfig/20220728-131818-root.json [13:19:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:19:12] phuedx: thanks! [13:19:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:20:05] Lucas_WMDE: Confirmed that the appropriate events are being sent only on testwiki [13:20:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:20:08] Lucas_WMDE: LGTM [13:20:20] oh, sorry, I thought the first message was already a LGTM ^^ [13:20:22] syncing now anyways [13:20:23] thanks! [13:21:18] (03Merged) 10jenkins-bot: Add configuration for the release script [software/debmonitor] - 10https://gerrit.wikimedia.org/r/817722 (owner: 10Volans) [13:22:56] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:817225|testwiki: Add mediawiki.web_ui.interactions stream (T311268)]] (1/2) (duration: 03m 24s) [13:23:01] T311268: *WebUIActionsTracking migration to Metrics Platform - https://phabricator.wikimedia.org/T311268 [13:23:24] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) (owner: 10Ebernhardson) [13:23:48] (03CR) 10Jbond: [C: 03+1] raid: fix compression in get-raid-status-megacli [puppet] - 10https://gerrit.wikimedia.org/r/818120 (https://phabricator.wikimedia.org/T313952) (owner: 10Volans) [13:24:23] (03Merged) 10jenkins-bot: Add configuration for the release script [software/cumin] - 10https://gerrit.wikimedia.org/r/817726 (owner: 10Volans) [13:24:34] (03CR) 10Volans: [C: 03+2] raid: fix compression in get-raid-status-megacli [puppet] - 10https://gerrit.wikimedia.org/r/818120 (https://phabricator.wikimedia.org/T313952) (owner: 10Volans) [13:25:31] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: 9.x upgrade: rename max_connections_active_in [puppet] - 10https://gerrit.wikimedia.org/r/803286 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:26:30] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:817225|testwiki: Add mediawiki.web_ui.interactions stream (T311268)]] (2/2) (duration: 03m 19s) [13:27:00] 10SRE, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314039 (10Volans) 05Openβ†’03Resolved a:03Volans Resolving this to test that the raid handler can create a new one correctly. [13:27:10] !log UTC afternoon backport+config window done [13:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T312990)', diff saved to https://phabricator.wikimedia.org/P32049 and previous config saved to /var/cache/conftool/dbconfig/20220728-132835-marostegui.json [13:28:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1133.eqiad.wmnet with reason: Maintenance [13:28:40] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [13:28:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1133.eqiad.wmnet with reason: Maintenance [13:29:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1105.eqiad.wmnet with reason: Maintenance [13:29:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1105.eqiad.wmnet with reason: Maintenance [13:29:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T312990)', diff saved to https://phabricator.wikimedia.org/P32050 and previous config saved to /var/cache/conftool/dbconfig/20220728-132929-marostegui.json [13:31:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T312990)', diff saved to https://phabricator.wikimedia.org/P32051 and previous config saved to /var/cache/conftool/dbconfig/20220728-133157-marostegui.json [13:33:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32052 and previous config saved to /var/cache/conftool/dbconfig/20220728-133323-root.json [13:33:50] (03CR) 10Ssingh: [V: 03+1 C: 03+2] trafficserver: 9.x upgrade: rename max_connections_active_in [puppet] - 10https://gerrit.wikimedia.org/r/803286 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:34:18] RECOVERY - MegaRAID on ms-be2067 is OK: testing get_raid_status_megacli https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:34:18] this is me ^^^ [13:35:46] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): More public IPs for codfw1dev - https://phabricator.wikimedia.org/T313977 (10cmooney) @Andrew I'm reluctant to allocate more space for WMCS in Codfw, when there is a /29 already allocated and not being used. So I've routed 185.... [13:38:04] (03PS4) 10Ssingh: trafficserver: 9.x upgrade: switch ip_allow.config to YAML format [puppet] - 10https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651) [13:38:11] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36467/console" [puppet] - 10https://gerrit.wikimedia.org/r/818077 (owner: 10Giuseppe Lavagetto) [13:38:49] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36468/console" [puppet] - 10https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:39:34] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:42:32] (03CR) 10Andrew Bogott: [C: 03+2] P:prometheus::openstack_exporter: fix executable permissions [puppet] - 10https://gerrit.wikimedia.org/r/818075 (https://phabricator.wikimedia.org/T314016) (owner: 10Majavah) [13:43:06] (03CR) 10Vgutierrez: trafficserver: 9.x upgrade: switch ip_allow.config to YAML format [puppet] - 10https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:43:29] (03CR) 10Ssingh: [V: 03+1 C: 03+2] trafficserver: 9.x upgrade: switch ip_allow.config to YAML format [puppet] - 10https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:46:08] (03PS5) 10Ssingh: trafficserver: 9.x upgrade: replace client.verify.server [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) [13:46:14] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "We need to fix noc.wikimedia.org at the very least before this can be done, but more importantly, we need to verify nothing on the appserv" [puppet] - 10https://gerrit.wikimedia.org/r/818079 (owner: 10Giuseppe Lavagetto) [13:46:51] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36469/console" [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:47:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P32053 and previous config saved to /var/cache/conftool/dbconfig/20220728-134703-marostegui.json [13:47:16] (03CR) 10Ssingh: [V: 03+1] "rebased on production, no code change" [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:47:47] (03CR) 10Ssingh: [V: 03+1 C: 03+2] trafficserver: 9.x upgrade: replace client.verify.server [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:48:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32054 and previous config saved to /var/cache/conftool/dbconfig/20220728-134828-root.json [13:51:15] (03PS6) 10Ssingh: trafficserver: 9.x upgrade: install ATS 9.x from component [puppet] - 10https://gerrit.wikimedia.org/r/816806 (https://phabricator.wikimedia.org/T309651) [13:52:11] (03CR) 10Michael Große: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818127 (https://phabricator.wikimedia.org/T313896) (owner: 10Michael Große) [13:52:23] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36471/console" [puppet] - 10https://gerrit.wikimedia.org/r/816806 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [13:53:32] PROBLEM - MegaRAID on ms-be2067 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:53:33] ACKNOWLEDGEMENT - MegaRAID on ms-be2067 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T314049 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:53:38] 10SRE, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10ops-monitoring-bot) [13:58:46] (03CR) 10Klausman: [C: 03+1] Puppetize spark3 installation and configs using conda-analytics env [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata) [13:59:32] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Volans) [14:00:02] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10Python3-Porting: Migrate get-raid-status-megacli to Python3 - https://phabricator.wikimedia.org/T313952 (10Volans) 05Openβ†’03Resolved a:03Volans The script has been converted to Python 3 and it's now working again. For the speci... [14:00:37] !log dcausse@deploy1002 Started deploy [wikimedia/discovery/analytics@137a4ff]: (no justification provided) [14:00:51] (03PS7) 10Ssingh: trafficserver: 9.x upgrade: install ATS 9.x from component [puppet] - 10https://gerrit.wikimedia.org/r/816806 (https://phabricator.wikimedia.org/T309651) [14:01:23] 10SRE, 10Traffic, 10Patch-For-Review: per-backend-service concurrency limits in ATS-BE - https://phabricator.wikimedia.org/T306223 (10CDanis) [14:01:27] 10SRE, 10Traffic, 10Patch-For-Review: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651 (10CDanis) [14:01:44] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:02:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P32055 and previous config saved to /var/cache/conftool/dbconfig/20220728-140209-marostegui.json [14:02:24] 10SRE, 10Traffic, 10Patch-For-Review: per-backend-service concurrency limits in ATS-BE - https://phabricator.wikimedia.org/T306223 (10CDanis) Awaiting {T309651} to continue testing [14:02:41] !log dcausse@deploy1002 Finished deploy [wikimedia/discovery/analytics@137a4ff]: (no justification provided) (duration: 02m 03s) [14:03:04] RECOVERY - PyBal connections to etcd on lvs1017 is OK: OK: 12 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [14:05:00] RECOVERY - PyBal connections to etcd on lvs3005 is OK: OK: 12 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [14:05:20] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:09:42] (03PS8) 10Ssingh: trafficserver: 9.x upgrade: install ATS 9.x from component [puppet] - 10https://gerrit.wikimedia.org/r/816806 (https://phabricator.wikimedia.org/T309651) [14:10:31] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36475/console" [puppet] - 10https://gerrit.wikimedia.org/r/816806 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [14:11:26] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:11:32] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:13:25] 10SRE, 10Data-Engineering, 10Data Pipelines (Sprint 00), 10Patch-For-Review: Create conda .deb and docker image - https://phabricator.wikimedia.org/T304450 (10JArguello-WMF) [14:14:06] RECOVERY - PyBal connections to etcd on lvs3006 is OK: OK: 4 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [14:14:28] (03PS2) 10Ssingh: trafficserver: 9.x upgrade: remove deprecated parent_proxy_routing_enable [puppet] - 10https://gerrit.wikimedia.org/r/803288 (https://phabricator.wikimedia.org/T309651) [14:15:29] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36476/console" [puppet] - 10https://gerrit.wikimedia.org/r/803288 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [14:16:06] (03CR) 10Ssingh: [V: 03+1] trafficserver: 9.x upgrade: remove deprecated parent_proxy_routing_enable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803288 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [14:16:14] RECOVERY - PyBal connections to etcd on lvs1018 is OK: OK: 36 connections established with conf1007.eqiad.wmnet:4001 (min=36) https://wikitech.wikimedia.org/wiki/PyBal [14:17:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T312990)', diff saved to https://phabricator.wikimedia.org/P32056 and previous config saved to /var/cache/conftool/dbconfig/20220728-141715-marostegui.json [14:17:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1134.eqiad.wmnet with reason: Maintenance [14:17:21] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [14:17:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1134.eqiad.wmnet with reason: Maintenance [14:17:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T312990)', diff saved to https://phabricator.wikimedia.org/P32057 and previous config saved to /var/cache/conftool/dbconfig/20220728-141736-marostegui.json [14:18:27] akosiaris: FYI there were also some BGP alerts above that I think are related to the pybal restarts [14:19:04] I checked a couple of them and looked already re-established [14:19:13] let's see if icinga agrees [14:19:48] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:01] All seem up in eqiad/codfw anyway. Most recent 8 mins back [14:20:04] thanks volans [14:20:15] ^ this is happening because of a misbehaving certificte transparency log, https://sabre.ct.comodo.com/ [14:20:25] if it persists, I will remove it from the list. nothing much we can do about it [14:20:27] thanks sukhe, I was about to have a quick look [14:21:12] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:21:20] volans: yup, it should. [14:22:00] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:22:14] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:26] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 71 connections established with conf1007.eqiad.wmnet:4001 (min=71) https://wikitech.wikimedia.org/wiki/PyBal [14:22:52] RECOVERY - PyBal connections to etcd on lvs3007 is OK: OK: 16 connections established with conf1009.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [14:24:22] yup, here it is [14:24:28] probably done soon. [14:25:11] (03CR) 10Ebernhardson: elastic: Restart masters one at a time after all others (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) (owner: 10Ebernhardson) [14:26:28] (03CR) 10Volans: [C: 03+1] "reply inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) (owner: 10Ebernhardson) [14:31:13] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Beta: add configuration for redirect badges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818127 (https://phabricator.wikimedia.org/T313896) (owner: 10Michael Große) [14:32:30] (03PS13) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723 [14:32:33] (03PS1) 10Jbond: C:varnish: fix varnish confd test data [puppet] - 10https://gerrit.wikimedia.org/r/818134 [14:37:18] RECOVERY - PyBal connections to etcd on lvs1020 is OK: OK: 119 connections established with conf1007.eqiad.wmnet:4001 (min=119) https://wikitech.wikimedia.org/wiki/PyBal [14:41:52] PROBLEM - etcd service on conf1006 is CRITICAL: CRITICAL - Expecting active but unit etcd is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:42:00] PROBLEM - Etcd cluster health on conf1006 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [14:42:01] (03CR) 10MVernon: [C: 03+2] hieradata: move all of sessionstore to 3.11.13 [puppet] - 10https://gerrit.wikimedia.org/r/817798 (https://phabricator.wikimedia.org/T309896) (owner: 10MVernon) [14:42:36] PROBLEM - Etcd cluster health on conf1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [14:43:10] PROBLEM - Etcd cluster health on conf1004 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [14:43:14] PROBLEM - etcd service on conf1004 is CRITICAL: CRITICAL - Expecting active but unit etcd is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:43:22] PROBLEM - etcd service on conf1005 is CRITICAL: CRITICAL - Expecting active but unit etcd is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:43:45] (JobUnavailable) firing: Reduced availability for job etcd in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:44:05] ^ known? [14:44:07] why are these even alerting. I 've disabled alerting on all conf100456 hosts [14:44:40] sukhe: yeah, I 've remove the role shortly and it should pick up again [14:44:41] g [14:44:43] grr [14:44:51] np :) [14:46:05] !log mvernon@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:sessionstore: upgrade to 3.11.13 T309896 - mvernon@cumin2002 [14:46:11] T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896 [14:47:12] (03PS2) 10Alexandros Kosiaris: conf100[456]: Remove them from server SRV RRs [dns] - 10https://gerrit.wikimedia.org/r/817261 (https://phabricator.wikimedia.org/T311408) [14:47:22] (03CR) 10Alexandros Kosiaris: [C: 03+2] conf100[456]: Remove them from server SRV RRs [dns] - 10https://gerrit.wikimedia.org/r/817261 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris) [14:48:49] (03PS18) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) [14:56:13] (03PS1) 10Andrew Bogott: cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136 [14:57:25] (03PS2) 10Alexandros Kosiaris: Remove mentions of conf100[456] [puppet] - 10https://gerrit.wikimedia.org/r/817266 (https://phabricator.wikimedia.org/T311408) [14:57:30] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Remove mentions of conf100[456] [puppet] - 10https://gerrit.wikimedia.org/r/817266 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris) [14:57:32] (03CR) 10CI reject: [V: 04-1] cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136 (owner: 10Andrew Bogott) [14:58:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T312990)', diff saved to https://phabricator.wikimedia.org/P32061 and previous config saved to /var/cache/conftool/dbconfig/20220728-145805-marostegui.json [14:58:11] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [14:58:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [14:59:32] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10RobH) [15:02:33] (03PS2) 10Andrew Bogott: cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136 [15:02:49] (03PS1) 10Jdrewniak: Revert "styles: Unify on standard external link icon" [skins/Vector] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818147 (https://phabricator.wikimedia.org/T261391) [15:02:57] (03CR) 10CI reject: [V: 04-1] cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136 (owner: 10Andrew Bogott) [15:03:12] (03PS9) 10Ssingh: trafficserver: 9.x upgrade: install ATS 9.x from component [puppet] - 10https://gerrit.wikimedia.org/r/816806 (https://phabricator.wikimedia.org/T309651) [15:03:58] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36480/console" [puppet] - 10https://gerrit.wikimedia.org/r/816806 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [15:05:07] (03PS3) 10Andrew Bogott: cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136 [15:05:46] (03CR) 10CI reject: [V: 04-1] cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136 (owner: 10Andrew Bogott) [15:08:59] (03PS3) 10Ssingh: trafficserver: 9.x upgrade: remove deprecated parent_proxy_routing_enable [puppet] - 10https://gerrit.wikimedia.org/r/803288 (https://phabricator.wikimedia.org/T309651) [15:09:04] (03PS4) 10Andrew Bogott: cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136 [15:09:44] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36481/console" [puppet] - 10https://gerrit.wikimedia.org/r/803288 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [15:10:03] (03CR) 10CI reject: [V: 04-1] cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136 (owner: 10Andrew Bogott) [15:10:46] (03PS1) 10Phuedx: Remove WikibaseTermboxInteraction $wgEventLoggingSchemas entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818137 (https://phabricator.wikimedia.org/T290303) [15:13:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P32062 and previous config saved to /var/cache/conftool/dbconfig/20220728-151311-marostegui.json [15:13:49] (03CR) 10Phuedx: [C: 04-2] "DNM until after I2fb990ee086 has been deployed (Thursday, 4th August 2022 at ~20:00 UTC)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818137 (https://phabricator.wikimedia.org/T290303) (owner: 10Phuedx) [15:14:14] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: 9.x upgrade: install ATS 9.x from component [puppet] - 10https://gerrit.wikimedia.org/r/816806 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [15:14:41] (03CR) 10Ssingh: [V: 03+1 C: 03+2] trafficserver: 9.x upgrade: install ATS 9.x from component [puppet] - 10https://gerrit.wikimedia.org/r/816806 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [15:14:52] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: 9.x upgrade: remove deprecated parent_proxy_routing_enable [puppet] - 10https://gerrit.wikimedia.org/r/803288 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [15:15:37] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:16:33] (03CR) 10Ssingh: [V: 03+1 C: 03+2] trafficserver: 9.x upgrade: remove deprecated parent_proxy_routing_enable [puppet] - 10https://gerrit.wikimedia.org/r/803288 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [15:17:51] !log akosiaris@cumin1001 START - Cookbook sre.hosts.decommission for hosts conf[1004-1006].eqiad.wmnet [15:20:32] (03CR) 10Ebernhardson: elastic: Restart masters one at a time after all others (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) (owner: 10Ebernhardson) [15:20:52] (03PS1) 10Alexandros Kosiaris: mwdebug: Switch conf100[456] to conf100[789] [deployment-charts] - 10https://gerrit.wikimedia.org/r/818140 (https://phabricator.wikimedia.org/T311408) [15:22:07] (03PS5) 10Andrew Bogott: cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136 [15:22:13] !log mvernon@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:sessionstore: upgrade to 3.11.13 T309896 - mvernon@cumin2002 [15:22:18] T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896 [15:23:01] (03PS1) 10Alexandros Kosiaris: datahub: Switchover conf1004 to conf1007 [deployment-charts] - 10https://gerrit.wikimedia.org/r/818141 (https://phabricator.wikimedia.org/T311408) [15:23:45] EventGate broken? [15:23:49] https://grafana.wikimedia.org/d/000000326/navigation-timing-alerts?orgId=1&refresh=5m [15:24:15] (MjolnirUpdateFailureRateExceedesThreshold) firing: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [15:24:45] [10min ago][#wikimedia-perf-bots] (Low amount of navigation timing data for group 2) firing: Low amount of navigation timing data for group 2Β  - https://alerts.wikimedia.org/?q=alertname%3DLow+amount+of+navigation+timing+data+for+group+2 [15:25:28] Krinkle: doesn't look like it? https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&var-service=eventgate-main&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos&var-site=All&from=now-12h&to=now [15:25:31] at least not eventgate [15:26:35] I have failed over etcd and zookeeper hosts to the newer machines but I haven't witnessed anything yet. kafka's have been restarted, other conf clients have been restarted too [15:26:37] (03CR) 10CI reject: [V: 04-1] cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136 (owner: 10Andrew Bogott) [15:27:36] akosiaris: webperf navtiming.py perhaps? [15:28:07] (03CR) 10Dzahn: gitlab: add reserved service IP 208.80.153.8, point to replica-new (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [15:28:14] https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos&var-site=All&from=now-15m&to=now [15:28:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P32063 and previous config saved to /var/cache/conftool/dbconfig/20220728-152817-marostegui.json [15:28:20] (03PS6) 10Dzahn: gitlab: add reserved service IP 208.80.153.8, point to replica-new [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) [15:28:41] This is the EventGate instance for client events, navtiming still coming in there [15:28:52] So it's lost between intake and graphite [15:29:15] (MjolnirUpdateFailureRateExceedesThreshold) resolved: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [15:29:55] (03PS6) 10Andrew Bogott: cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136 [15:30:06] (03PS1) 10Ssingh: hiera: enable ATS9 on cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/818144 (https://phabricator.wikimedia.org/T309651) [15:30:22] Krinkle: navtiming is spewing an exception indeed [15:30:44] pasting it to phab [15:30:54] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36486/console" [puppet] - 10https://gerrit.wikimedia.org/r/818144 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [15:31:21] Krinkle: https://phabricator.wikimedia.org/P32064 [15:31:46] (03PS7) 10Andrew Bogott: cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136 [15:35:04] (03CR) 10Vgutierrez: [C: 03+1] hiera: enable ATS9 on cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/818144 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [15:35:06] (03CR) 10Ori: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/818145 (https://phabricator.wikimedia.org/T211661) (owner: 10Ori) [15:36:03] 10SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10odimitrijevic) Approved! [15:36:12] (03PS8) 10Andrew Bogott: cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136 [15:36:33] 10SRE, 10SRE-Access-Requests: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10odimitrijevic) Approved [15:36:47] 10SRE, 10SRE-Access-Requests: Requesting access to the Desktop Improvements project statistics for SGrabarczuk - https://phabricator.wikimedia.org/T313616 (10odimitrijevic) Approved [15:36:59] (03PS2) 10Ori: Randomize thumbnail TTL to prevent stampedes [puppet] - 10https://gerrit.wikimedia.org/r/818145 (https://phabricator.wikimedia.org/T211661) [15:37:47] !depool ats-be on cp4026 for ATS9 testing [15:37:47] for s in nginx varnish-fe varnish-be varnish-be-rand; do confctl --tags dc=eqiad,cluster=cache_text,service=$s --action set/pooled=no cp1053.eqiad.wmnet; done [15:37:51] ha [15:37:56] !log depool ats-be on cp4026 for ATS9 testing [15:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:57] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4026.ulsfo.wmnet,service=ats-be [15:43:19] (03CR) 10Alexandros Kosiaris: [C: 03+2] mwdebug: Switch conf100[456] to conf100[789] [deployment-charts] - 10https://gerrit.wikimedia.org/r/818140 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris) [15:43:22] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: enable ATS9 on cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/818144 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [15:43:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T312990)', diff saved to https://phabricator.wikimedia.org/P32066 and previous config saved to /var/cache/conftool/dbconfig/20220728-154323-marostegui.json [15:43:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1128.eqiad.wmnet with reason: Maintenance [15:43:28] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [15:43:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1128.eqiad.wmnet with reason: Maintenance [15:43:41] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) (owner: 10Ebernhardson) [15:43:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T312990)', diff saved to https://phabricator.wikimedia.org/P32067 and previous config saved to /var/cache/conftool/dbconfig/20220728-154344-marostegui.json [15:45:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] datahub: Switchover conf1004 to conf1007 [deployment-charts] - 10https://gerrit.wikimedia.org/r/818141 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris) [15:46:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T312990)', diff saved to https://phabricator.wikimedia.org/P32068 and previous config saved to /var/cache/conftool/dbconfig/20220728-154607-marostegui.json [15:46:57] (03Merged) 10jenkins-bot: mwdebug: Switch conf100[456] to conf100[789] [deployment-charts] - 10https://gerrit.wikimedia.org/r/818140 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris) [15:49:30] (03Merged) 10jenkins-bot: datahub: Switchover conf1004 to conf1007 [deployment-charts] - 10https://gerrit.wikimedia.org/r/818141 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris) [15:52:18] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for FNegri - https://phabricator.wikimedia.org/T314066 (10fnegri) [15:52:49] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: sync [15:52:53] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync [15:53:18] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for FNegri - https://phabricator.wikimedia.org/T314066 (10fnegri) Please note I was already added to the "wmf" LDAP group by @Andrew because we both didn't realize the correct procedure was to go through this ticket! I still need to be added to the wmf-nda Pha... [15:54:19] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:54:43] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp4026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:55:16] (03PS3) 10Jbond: P:gerrit: Export sshkey for gerrit shared services [puppet] - 10https://gerrit.wikimedia.org/r/816715 (https://phabricator.wikimedia.org/T303857) [15:57:09] PROBLEM - Ensure traffic_server is running for instance backend on cp4026 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:57:22] ^^ that's sukhe & me [15:57:45] thanks [15:59:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36489/console" [puppet] - 10https://gerrit.wikimedia.org/r/816715 (https://phabricator.wikimedia.org/T303857) (owner: 10Jbond) [16:00:05] jbond and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220728T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P32069 and previous config saved to /var/cache/conftool/dbconfig/20220728-160113-marostegui.json [16:08:43] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4026 is OK: HTTP OK: HTTP/1.1 200 Ok - 35278 bytes in 0.236 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:09:07] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp4026 is OK: HTTP OK: HTTP/1.0 200 OK - 24940 bytes in 0.231 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:09:15] RECOVERY - Ensure traffic_server is running for instance backend on cp4026 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:11:47] !log akosiaris@cumin1001 START - Cookbook sre.dns.netbox [16:11:57] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: sync [16:12:06] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync [16:12:33] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:15:11] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host elastic1056.eqiad.wmnet [16:15:12] (03PS1) 10Vgutierrez: trafficserver: Avoid loading plugins from /run [puppet] - 10https://gerrit.wikimedia.org/r/818172 (https://phabricator.wikimedia.org/T309651) [16:16:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P32070 and previous config saved to /var/cache/conftool/dbconfig/20220728-161621-marostegui.json [16:16:51] (03CR) 10Ssingh: [C: 03+1] trafficserver: Avoid loading plugins from /run [puppet] - 10https://gerrit.wikimedia.org/r/818172 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [16:16:57] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36491/console" [puppet] - 10https://gerrit.wikimedia.org/r/818172 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [16:17:17] (03CR) 10Jbond: "also need to add querysort vmod" [puppet] - 10https://gerrit.wikimedia.org/r/818134 (owner: 10Jbond) [16:17:26] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Avoid loading plugins from /run [puppet] - 10https://gerrit.wikimedia.org/r/818172 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [16:21:30] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/datahub: sync on main [16:21:35] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [16:21:41] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: sync on main [16:21:49] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [16:21:59] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: sync on main [16:22:03] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [16:23:31] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:23:45] (JobUnavailable) resolved: Reduced availability for job etcd in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:23:50] ^ expected [16:23:55] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp4026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:23:57] vgutierrez and I are fixing [16:23:59] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:24:10] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:24:12] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts conf[1004-1006].eqiad.wmnet [16:24:41] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host elastic1056.eqiad.wmnet [16:26:58] (03PS1) 10Ssingh: trafficserver: add top-level tag for logging [puppet] - 10https://gerrit.wikimedia.org/r/818174 (https://phabricator.wikimedia.org/T309651) [16:28:06] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36492/console" [puppet] - 10https://gerrit.wikimedia.org/r/818174 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [16:29:16] (03PS2) 10Ssingh: trafficserver: add top-level tag for logging [puppet] - 10https://gerrit.wikimedia.org/r/818174 (https://phabricator.wikimedia.org/T309651) [16:30:09] PROBLEM - Ensure traffic_server is running for instance backend on cp4026 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:30:11] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36493/console" [puppet] - 10https://gerrit.wikimedia.org/r/818174 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [16:31:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T312990)', diff saved to https://phabricator.wikimedia.org/P32071 and previous config saved to /var/cache/conftool/dbconfig/20220728-163127-marostegui.json [16:31:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1184.eqiad.wmnet with reason: Maintenance [16:31:34] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [16:31:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1184.eqiad.wmnet with reason: Maintenance [16:31:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T312990)', diff saved to https://phabricator.wikimedia.org/P32072 and previous config saved to /var/cache/conftool/dbconfig/20220728-163149-marostegui.json [16:33:58] 10SRE, 10ops-eqiad, 10DC-Ops: Please verify location of an-worker1111.eqiad.wmnet - https://phabricator.wikimedia.org/T298785 (10Cmjohnson) 05Openβ†’03Resolved [16:34:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T312990)', diff saved to https://phabricator.wikimedia.org/P32073 and previous config saved to /var/cache/conftool/dbconfig/20220728-163412-marostegui.json [16:34:25] 10SRE, 10ops-eqiad, 10DC-Ops: Please verify location of an-master1001.eqiad.wmnet - https://phabricator.wikimedia.org/T298621 (10Cmjohnson) 05Openβ†’03Resolved [16:35:19] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp4026 is OK: HTTP OK: HTTP/1.0 200 OK - 24969 bytes in 0.231 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:35:25] RECOVERY - Ensure traffic_server is running for instance backend on cp4026 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:37:10] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: add top-level tag for logging [puppet] - 10https://gerrit.wikimedia.org/r/818174 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [16:37:19] (03PS5) 10Dzahn: gerrit: turn gerrit2002 into a gerrit migration dest host [puppet] - 10https://gerrit.wikimedia.org/r/817841 (https://phabricator.wikimedia.org/T313250) [16:38:08] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:38:51] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4026 is OK: HTTP OK: HTTP/1.1 200 Ok - 35353 bytes in 0.237 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:40:59] (03CR) 10Ssingh: [V: 03+1 C: 03+2] trafficserver: add top-level tag for logging [puppet] - 10https://gerrit.wikimedia.org/r/818174 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [16:41:12] (03CR) 10Dzahn: [V: 03+1] "now it works. creates firewall rules/rsync server/monitoring on gerrit2002, while on gerrit1001/2001 it does nothing except add the new ho" [puppet] - 10https://gerrit.wikimedia.org/r/817841 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [16:41:59] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10Cmjohnson) a:03Jclark-ctr [16:42:46] !log disabling puppet on gerrit servers for a change in gerrit puppet code [16:42:49] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:11] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/36495/gerrit2002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/817841 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [16:44:17] (03PS6) 10Dzahn: gerrit: turn gerrit2002 into a gerrit migration dest host [puppet] - 10https://gerrit.wikimedia.org/r/817841 (https://phabricator.wikimedia.org/T313250) [16:45:12] !log pooling ats-be@cp4026 running ATS 9.1.2 - T309651 [16:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:17] T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651 [16:49:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P32074 and previous config saved to /var/cache/conftool/dbconfig/20220728-164918-marostegui.json [16:55:26] (03CR) 10Filippo Giunchedi: "Idea LGTM! See inline" [puppet] - 10https://gerrit.wikimedia.org/r/818145 (https://phabricator.wikimedia.org/T211661) (owner: 10Ori) [16:55:43] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:57:28] (03CR) 10Dzahn: "gerrit2002 getting rsyncd / firewall rules.. and noop confirmed on prod hosts gerrit1001/gerrit2001 -" [puppet] - 10https://gerrit.wikimedia.org/r/817841 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [16:58:35] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:05] bd808: I, the Bot under the Fountain, call upon thee, The Deployer, to do Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220728T1700). [17:02:42] (03PS1) 10Dzahn: admin: add gerrit access groups to gerrit migration role [puppet] - 10https://gerrit.wikimedia.org/r/818183 (https://phabricator.wikimedia.org/T280597) [17:04:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P32075 and previous config saved to /var/cache/conftool/dbconfig/20220728-170424-marostegui.json [17:04:54] (03CR) 10Dzahn: "@Jbond This situation does not make it an access request, right? No changes to groups, new hardware replacing old hardware..." [puppet] - 10https://gerrit.wikimedia.org/r/818183 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [17:07:23] (03CR) 10Dzahn: "also sets contact groups for Icinga monitoring which just got added. so in theory you get notified and have privs in Icinga because you ar" [puppet] - 10https://gerrit.wikimedia.org/r/818183 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [17:10:05] (03CR) 10Dzahn: "@Jbond same here, it's only about giving access to new hardware that is replacing old hardware with the twist that it would be nice to hav" [puppet] - 10https://gerrit.wikimedia.org/r/817811 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [17:16:27] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:18:11] (03PS2) 10Dduvall: scap: Deploy configuration using scap3 templates [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/817915 (https://phabricator.wikimedia.org/T313950) [17:18:18] !log [Elastic] `sudo disable-puppet "production issue"` && `sudo systemctl stop mjolnir-kafka-bulk-daemon.service` on `ryankemper@search-loader1001` [17:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T312990)', diff saved to https://phabricator.wikimedia.org/P32076 and previous config saved to /var/cache/conftool/dbconfig/20220728-171930-marostegui.json [17:19:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1106.eqiad.wmnet with reason: Maintenance [17:19:37] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [17:19:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1106.eqiad.wmnet with reason: Maintenance [17:19:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:19:50] !log [Elastic] `ryankemper@search-loader2001:~$ sudo disable-puppet "production issue" && sudo systemctl stop mjolnir-kafka-bulk-daemon.service` just to be safe (we prob only needed to halt eqiad) [17:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:20:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T312990)', diff saved to https://phabricator.wikimedia.org/P32077 and previous config saved to /var/cache/conftool/dbconfig/20220728-172008-marostegui.json [17:22:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T312990)', diff saved to https://phabricator.wikimedia.org/P32078 and previous config saved to /var/cache/conftool/dbconfig/20220728-172235-marostegui.json [17:23:14] !log [Elastic] Restarting `elastic1072` after halting mjolnir bulk daemons: `ryankemper@elastic1072:~$ sudo depool && sleep 30 && sudo systemctl restart elasticsearch_6* && sleep 30 && sudo pool` [17:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:41] 10SRE, 10ops-eqiad, 10DC-Ops: ps1-e4-eqiad alerts - https://phabricator.wikimedia.org/T314027 (10Papaul) @fgiunchedi f1-f4 PDU's are not setup yet [17:33:33] (03PS14) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723 [17:34:30] (03PS15) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723 [17:34:59] (03PS9) 10Andrew Bogott: cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136 [17:35:01] (03PS3) 10Andrew Bogott: hieradata: switch traffic to cloudrabbit1001-3 [puppet] - 10https://gerrit.wikimedia.org/r/816818 (owner: 10Majavah) [17:36:52] 10SRE, 10ops-eqiad, 10DC-Ops: ps1-e4-eqiad alerts - https://phabricator.wikimedia.org/T314027 (10wiki_willy) a:03Cmjohnson [17:37:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P32079 and previous config saved to /var/cache/conftool/dbconfig/20220728-173742-marostegui.json [17:38:58] (03PS16) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723 [17:39:50] (03CR) 10Jbond: C:varnish: Rate limit hotlinking (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/768723 (owner: 10Jbond) [17:40:15] (03PS3) 10Andrew Bogott: wikimediacloud.org: add cname records for rabbitmq [dns] - 10https://gerrit.wikimedia.org/r/817877 [17:41:04] !log [Elastic] Re-running `delete`s and `update`s from `2022-07-28T15:00:00Z` until `2022-07-28T17:30:00Z` on `ryankemper@mwmaint1002` tmux `mlr_outage` [17:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:25] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:52:35] 10SRE, 10Gerrit, 10serviceops, 10serviceops-collab, and 2 others: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) on gerrit2002 we now have, created by the migration class: - a group "gerrit2" - a user "gerrit2" - a directory /srv/gerrit - package rsync installed, /etc/def... [17:52:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P32080 and previous config saved to /var/cache/conftool/dbconfig/20220728-175248-marostegui.json [17:53:05] (03PS3) 10Ori: Randomize thumbnail TTL to prevent stampedes [puppet] - 10https://gerrit.wikimedia.org/r/818145 (https://phabricator.wikimedia.org/T211661) [17:53:53] (03CR) 10Ori: Randomize thumbnail TTL to prevent stampedes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/818145 (https://phabricator.wikimedia.org/T211661) (owner: 10Ori) [17:54:04] 10SRE, 10Gerrit, 10serviceops, 10serviceops-collab, and 2 others: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) [17:54:09] (03CR) 10Dzahn: [C: 04-2] "this is blocked on https://phabricator.wikimedia.org/T313972" [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [17:55:11] (03CR) 10Dzahn: "blocked by https://phabricator.wikimedia.org/T313972" [puppet] - 10https://gerrit.wikimedia.org/r/815396 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [17:55:32] (03CR) 10Dzahn: [C: 04-2] "blocked by https://phabricator.wikimedia.org/T313972" [dns] - 10https://gerrit.wikimedia.org/r/815395 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [17:55:58] (03PS2) 10Dzahn: add gerrit-replica-new.wikimedia.org, point to 208.80.153.109 [dns] - 10https://gerrit.wikimedia.org/r/815395 (https://phabricator.wikimedia.org/T313250) [17:56:51] (03PS3) 10Dzahn: gerrit: add hiera settings for replica to gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/815396 (https://phabricator.wikimedia.org/T313250) [18:00:04] brennen and jeena: Dear deployers, time to do the MediaWiki train - Utc-7 Version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220728T1800). [18:00:51] o/ [18:01:25] o/ - currently blocked on T314058 [18:01:25] T314058: TypeError: Argument 1 passed to Flow\Hooks::onSpecialCheckUserGetLinksFromRow() must be SpecialPage, CheckUserGetEditsPager given - https://phabricator.wikimedia.org/T314058 [18:01:51] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:03:01] there's a patch in the works for that blocker, shouldn't be too long i think. [18:03:34] (03PS2) 10Jbond: C:varnish: fix varnish confd test data [puppet] - 10https://gerrit.wikimedia.org/r/818134 [18:03:36] (03PS17) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723 [18:06:47] !log [Elastic] Finished re-running `delete`s and `update`s from `2022-07-28T15:00:00Z` until `2022-07-28T17:30:00Z` [18:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T312990)', diff saved to https://phabricator.wikimedia.org/P32081 and previous config saved to /var/cache/conftool/dbconfig/20220728-180754-marostegui.json [18:07:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1099.eqiad.wmnet with reason: Maintenance [18:07:59] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [18:08:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1099.eqiad.wmnet with reason: Maintenance [18:08:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T312990)', diff saved to https://phabricator.wikimedia.org/P32082 and previous config saved to /var/cache/conftool/dbconfig/20220728-180815-marostegui.json [18:10:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T312990)', diff saved to https://phabricator.wikimedia.org/P32083 and previous config saved to /var/cache/conftool/dbconfig/20220728-181044-marostegui.json [18:15:56] (03PS18) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723 [18:25:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P32084 and previous config saved to /var/cache/conftool/dbconfig/20220728-182550-marostegui.json [18:28:19] !log gerrit: rsyncing /home from prod gerrit1001 to /srv/home-gerrit1001.wikimedia.org on gerrit2002 new replica T243027 T313250 [18:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:24] T243027: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 [18:28:25] T313250: Bring up Gerrit2002 - https://phabricator.wikimedia.org/T313250 [18:31:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Cmjohnson) a:05Cmjohnsonβ†’03Andrew @andrew what do you need one with these? The task was re-opened and I see some action but not sure what... [18:32:08] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1082 - https://phabricator.wikimedia.org/T312626 (10Cmjohnson) a:05Cmjohnsonβ†’03BTullis @BTullis Can we try and do this Monday, please? [18:33:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10Cmjohnson) @nskaggs I would like to schedule this to be completed on Monday around 1600UTC. Does that work for you? [18:36:07] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:38:59] (03CR) 10AOkoth: [C: 03+1] gitlab: add reserved service IP 208.80.153.8, point to replica-new [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [18:40:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P32085 and previous config saved to /var/cache/conftool/dbconfig/20220728-184056-marostegui.json [18:46:10] (03PS1) 10Zabe: Update CheckUser hook for pagination [extensions/Flow] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818154 (https://phabricator.wikimedia.org/T314058) [18:46:16] (03PS2) 10Brennen Bearnes: Update CheckUser hook for pagination [extensions/Flow] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818154 (https://phabricator.wikimedia.org/T314058) (owner: 10Zabe) [18:46:30] (03PS1) 10Zabe: Add CheckUser to phan analysis [extensions/Flow] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818155 [18:46:56] (03PS3) 10Zabe: Update CheckUser hook for pagination [extensions/Flow] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818154 (https://phabricator.wikimedia.org/T314058) [18:47:02] zabe: sorry to step on your toes there [18:47:18] no worries :) [18:47:48] (btw. I can test the fix once its merged) [18:48:33] cool [18:53:08] (03CR) 10Brennen Bearnes: [C: 03+2] Update CheckUser hook for pagination [extensions/Flow] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818154 (https://phabricator.wikimedia.org/T314058) (owner: 10Zabe) [18:53:50] brennen, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Flow/+/818155/1 needs to be merged first, in order to make CI pass for the actual fix [18:53:59] (but that patch does not need to be synced) [18:56:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T312990)', diff saved to https://phabricator.wikimedia.org/P32086 and previous config saved to /var/cache/conftool/dbconfig/20220728-185603-marostegui.json [18:56:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1119.eqiad.wmnet with reason: Maintenance [18:56:09] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [18:56:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1119.eqiad.wmnet with reason: Maintenance [18:56:23] (03CR) 10Brennen Bearnes: Update CheckUser hook for pagination [extensions/Flow] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818154 (https://phabricator.wikimedia.org/T314058) (owner: 10Zabe) [18:56:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T312990)', diff saved to https://phabricator.wikimedia.org/P32087 and previous config saved to /var/cache/conftool/dbconfig/20220728-185624-marostegui.json [18:56:33] (03CR) 10Brennen Bearnes: [C: 03+2] Add CheckUser to phan analysis [extensions/Flow] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818155 (owner: 10Zabe) [18:56:43] ah, right. i'm eternally getting myself confused by patch chains in gerrit. [18:58:05] !log gerrit: starting rsync of /srv/gerrit (>240GB) from prod gerrit1001 to /srv/gerrit on gerrit2002 new replica T243027 T313250 ..slowly ..with --bwlimit=1000 [18:58:05] T243027: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 [18:58:06] T313250: Bring up Gerrit2002 - https://phabricator.wikimedia.org/T313250 [18:58:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T312990)', diff saved to https://phabricator.wikimedia.org/P32088 and previous config saved to /var/cache/conftool/dbconfig/20220728-185847-marostegui.json [18:58:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [19:00:07] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@82e0383]: (no justification provided) [19:00:25] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@82e0383]: (no justification provided) (duration: 00m 17s) [19:03:00] (03PS10) 10Krinkle: multiversion: Add dblists-index.php for fast runtime lookups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816029 (https://phabricator.wikimedia.org/T169821) [19:03:15] (03PS9) 10Krinkle: multiversion: Switch getTagsForWiki() to fast dblists-index.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816089 (https://phabricator.wikimedia.org/T169821) [19:03:18] (03PS10) 10Krinkle: multiversion: Switch getTagsForWiki() to fast dblists-index.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816089 (https://phabricator.wikimedia.org/T169821) [19:06:20] (03CR) 10Brennen Bearnes: [C: 03+2] Update CheckUser hook for pagination [extensions/Flow] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818154 (https://phabricator.wikimedia.org/T314058) (owner: 10Zabe) [19:12:23] (03Merged) 10jenkins-bot: Add CheckUser to phan analysis [extensions/Flow] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818155 (owner: 10Zabe) [19:13:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P32089 and previous config saved to /var/cache/conftool/dbconfig/20220728-191353-marostegui.json [19:18:18] (03CR) 10Andrew Bogott: [C: 03+2] wikimediacloud.org: add cname records for rabbitmq [dns] - 10https://gerrit.wikimedia.org/r/817877 (owner: 10Andrew Bogott) [19:21:51] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:22:17] 10SRE, 10ops-codfw, 10Machine-Learning-Team: codfw: ml-serve2001 memmory issue DIMM A2 - https://phabricator.wikimedia.org/T313822 (10Papaul) 05Openβ†’03Resolved a:03Papaul [19:23:30] (03Merged) 10jenkins-bot: Update CheckUser hook for pagination [extensions/Flow] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818154 (https://phabricator.wikimedia.org/T314058) (owner: 10Zabe) [19:27:20] zabe: should be on mwdebug1002 [19:27:57] lemme see [19:28:59] brennen, looks good [19:28:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P32090 and previous config saved to /var/cache/conftool/dbconfig/20220728-192859-marostegui.json [19:29:59] zabe: cool, syncing [19:31:47] 10SRE, 10Data Engineering Planning: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10RhinosF1) [19:32:02] 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10RhinosF1) [19:32:17] volans: ^ is marked as UBN! [19:34:33] !log brennen@deploy1002 Synchronized php-1.39.0-wmf.22/extensions/Flow: Backport: [[gerrit:818154|Update CheckUser hook for pagination (T314058 T314069)]] (duration: 03m 16s) [19:34:39] T314058: TypeError: Argument 1 passed to Flow\Hooks::onSpecialCheckUserGetLinksFromRow() must be SpecialPage, CheckUserGetEditsPager given - https://phabricator.wikimedia.org/T314058 [19:34:39] T314069: Fatal exception of type "TypeError" when using checkuser "Get edits" on Wikidata - https://phabricator.wikimedia.org/T314069 [19:35:03] (03PS1) 10Andrew Bogott: Reorder the list of of profile::openstack::eqiad1::openstack_controllers [puppet] - 10https://gerrit.wikimedia.org/r/818210 (https://phabricator.wikimedia.org/T313268) [19:35:55] !log 1.39.0-wmf.22 train (T308075): blocker resolved, rolling to all wikis [19:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:00] T308075: 1.39.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T308075 [19:36:45] (03PS1) 10TrainBranchBot: all wikis to 1.39.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818212 (https://phabricator.wikimedia.org/T308075) [19:36:47] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.39.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818212 (https://phabricator.wikimedia.org/T308075) (owner: 10TrainBranchBot) [19:37:33] (03CR) 10Dzahn: [C: 03+1] "code matches the explanation, makes sense to me" [puppet] - 10https://gerrit.wikimedia.org/r/817759 (https://phabricator.wikimedia.org/T311746) (owner: 10Jelto) [19:39:13] (03CR) 10Andrew Bogott: [C: 03+2] Reorder the list of of profile::openstack::eqiad1::openstack_controllers [puppet] - 10https://gerrit.wikimedia.org/r/818210 (https://phabricator.wikimedia.org/T313268) (owner: 10Andrew Bogott) [19:39:57] PROBLEM - Check systemd state on mw2389 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:40:00] (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818212 (https://phabricator.wikimedia.org/T308075) (owner: 10TrainBranchBot) [19:44:00] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.22 refs T308075 [19:44:05] T308075: 1.39.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T308075 [19:44:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T312990)', diff saved to https://phabricator.wikimedia.org/P32091 and previous config saved to /var/cache/conftool/dbconfig/20220728-194405-marostegui.json [19:44:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1118.eqiad.wmnet with reason: Maintenance [19:44:12] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [19:44:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1118.eqiad.wmnet with reason: Maintenance [19:44:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T312990)', diff saved to https://phabricator.wikimedia.org/P32092 and previous config saved to /var/cache/conftool/dbconfig/20220728-194426-marostegui.json [19:45:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Andrew) >>! In T305414#8113020, @Cmjohnson wrote: > @andrew what do you need one with these? The task was re-opened and I see some action but... [19:45:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:46:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:46:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:46:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10Andrew) @nskaggs is out for several weeks, so this should wait until late August unless someone else appears who wants to coordinate on this. [19:46:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T312990)', diff saved to https://phabricator.wikimedia.org/P32093 and previous config saved to /var/cache/conftool/dbconfig/20220728-194654-marostegui.json [19:47:03] (03PS1) 10Ryan Kemper: elastic: alert on per-node indexing not occurring [alerts] - 10https://gerrit.wikimedia.org/r/818214 (https://phabricator.wikimedia.org/T314078) [19:47:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10taavi) 05Openβ†’03Resolved That firewall issue should be sorted with my latest patch above. [19:47:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:49:43] (03CR) 10Dzahn: [C: 03+2] "I consider this just part of the setup task or prep for migration, not an access request." [puppet] - 10https://gerrit.wikimedia.org/r/818183 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [19:50:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Andrew) Yep, it's closed! [19:50:28] (03PS2) 10Ryan Kemper: elastic: alert on per-node indexing not occurring [alerts] - 10https://gerrit.wikimedia.org/r/818214 (https://phabricator.wikimedia.org/T314078) [19:53:17] (03CR) 10Dzahn: [C: 03+2] "this was actually a noop, you all already had access. it was done in ./hosts/gerrit2002.yaml though, not as nice as by role" [puppet] - 10https://gerrit.wikimedia.org/r/818183 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [19:54:12] (03CR) 10CI reject: [V: 04-1] elastic: alert on per-node indexing not occurring [alerts] - 10https://gerrit.wikimedia.org/r/818214 (https://phabricator.wikimedia.org/T314078) (owner: 10Ryan Kemper) [19:54:25] (03PS1) 10Dzahn: gerrit/hieradata: delete ./hosts/gerrit2002.yaml [puppet] - 10https://gerrit.wikimedia.org/r/818216 [19:54:53] (03CR) 10Dzahn: [C: 03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/818216/" [puppet] - 10https://gerrit.wikimedia.org/r/818183 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [19:55:10] (03CR) 10Dzahn: [C: 03+2] gerrit/hieradata: delete ./hosts/gerrit2002.yaml [puppet] - 10https://gerrit.wikimedia.org/r/818216 (owner: 10Dzahn) [19:56:07] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:56:13] (03PS3) 10Ryan Kemper: elastic: alert on per-node indexing not occurring [alerts] - 10https://gerrit.wikimedia.org/r/818214 (https://phabricator.wikimedia.org/T314078) [20:00:05] brennen: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220728T2000). [20:00:05] jan_drewniak and stephanebisson: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] o/ [20:00:19] Hi [20:01:21] o/ [20:01:22] I have to run in about 20 minutes. It would be great if we can start with my patch. If not, no big deal, I'll reschedule it for next week. [20:01:45] sure [20:01:55] the other looks like it'll take longer to merge [20:02:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P32094 and previous config saved to /var/cache/conftool/dbconfig/20220728-200200-marostegui.json [20:02:11] (03CR) 10Thcipriani: [C: 03+2] Revert "styles: Unify on standard external link icon" [skins/Vector] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818147 (https://phabricator.wikimedia.org/T261391) (owner: 10Jdrewniak) [20:03:14] stephanebisson: could you rebase your patch? Gerrit isn't able to rebase it automagically. [20:04:06] on it [20:05:35] (03PS3) 10Sbisson: Register Wikistories streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817263 (https://phabricator.wikimedia.org/T313633) [20:05:59] (03CR) 10Thcipriani: [C: 03+2] Register Wikistories streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817263 (https://phabricator.wikimedia.org/T313633) (owner: 10Sbisson) [20:07:05] (03Merged) 10jenkins-bot: Register Wikistories streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817263 (https://phabricator.wikimedia.org/T313633) (owner: 10Sbisson) [20:08:36] stephanebisson: you patch is on mwdebug1002, check please [20:08:46] thcipriani ok [20:12:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:13:27] thcipriani it looks like I cannot fully test it from a single test server but I think it's ok to sync. [20:13:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:13:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:13:45] stephanebisson: ok, going live [20:14:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:14:37] (03CR) 10Dzahn: [C: 03+1] P:gerrit: Export sshkey for gerrit shared services [puppet] - 10https://gerrit.wikimedia.org/r/816715 (https://phabricator.wikimedia.org/T303857) (owner: 10Jbond) [20:16:30] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): More public IPs for codfw1dev - https://phabricator.wikimedia.org/T313977 (10Andrew) 05Openβ†’03Resolved This works! ` +----------------------+--------------------------------------+ | Field | Value... [20:17:00] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:17:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P32095 and previous config saved to /var/cache/conftool/dbconfig/20220728-201706-marostegui.json [20:17:58] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.2.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [20:18:21] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:817263|Register Wikistories streams (T313633)]] (duration: 03m 24s) [20:18:25] T313633: Register Wikistories streams in InitialiseSettings.php - https://phabricator.wikimedia.org/T313633 [20:18:34] ^ stephanebisson should be live now [20:18:46] thcipriani thanks! [20:19:12] I wonder if scap did something to make that high average get latency with the restart? also...codfw? [20:19:48] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [20:20:03] well [20:20:05] nevermind [20:23:29] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1001/36501/phab2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/685136 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [20:26:09] (03CR) 10Dzahn: "note how production catalog on phab1001 has git-ssh stuff.. but this host does not..even though it gets other phabricator things." [puppet] - 10https://gerrit.wikimedia.org/r/685136 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [20:26:38] (03Merged) 10jenkins-bot: Revert "styles: Unify on standard external link icon" [skins/Vector] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818147 (https://phabricator.wikimedia.org/T261391) (owner: 10Jdrewniak) [20:28:28] ok, merged [20:28:46] jan_drewniak: around? [20:28:47] (03CR) 10Andrew Bogott: [C: 03+2] Expand retry logic for cinder backups [puppet] - 10https://gerrit.wikimedia.org/r/817378 (https://phabricator.wikimedia.org/T310640) (owner: 10Nskaggs) [20:32:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T312990)', diff saved to https://phabricator.wikimedia.org/P32096 and previous config saved to /var/cache/conftool/dbconfig/20220728-203212-marostegui.json [20:32:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1140.eqiad.wmnet with reason: Maintenance [20:32:18] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [20:32:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1140.eqiad.wmnet with reason: Maintenance [20:32:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2103.codfw.wmnet with reason: Maintenance [20:33:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2103.codfw.wmnet with reason: Maintenance [20:33:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 16 hosts with reason: Maintenance [20:33:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 16 hosts with reason: Maintenance [20:33:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [20:34:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [20:34:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1135.eqiad.wmnet with reason: Maintenance [20:34:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1135.eqiad.wmnet with reason: Maintenance [20:34:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T312990)', diff saved to https://phabricator.wikimedia.org/P32097 and previous config saved to /var/cache/conftool/dbconfig/20220728-203446-marostegui.json [20:37:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T312990)', diff saved to https://phabricator.wikimedia.org/P32098 and previous config saved to /var/cache/conftool/dbconfig/20220728-203709-marostegui.json [20:38:46] (03PS1) 10Thcipriani: Revert "Revert "styles: Unify on standard external link icon"" [skins/Vector] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818157 [20:38:57] (03CR) 10Thcipriani: [C: 03+2] Revert "Revert "styles: Unify on standard external link icon"" [skins/Vector] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818157 (owner: 10Thcipriani) [20:43:22] (03CR) 10Dzahn: "a lot of the changes are all about exim because once upon a time "mail to phab tasK" was a thing. we need to go through the whole https://" [puppet] - 10https://gerrit.wikimedia.org/r/685136 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [20:45:48] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Papaul) ` papaul@lsw1-e2-eqiad> show interfaces descriptions | match db1191 ge-0/0/40 db1191 {#2013339101930} ` ` papaul@lsw1-e2-... [20:52:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P32099 and previous config saved to /var/cache/conftool/dbconfig/20220728-205215-marostegui.json [20:53:24] (03CR) 10Brennen Bearnes: [C: 03+1] "Just looking this over with Jeena - approach seems totally reasonable." [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/817915 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall) [20:58:30] (03Merged) 10jenkins-bot: Revert "Revert "styles: Unify on standard external link icon"" [skins/Vector] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818157 (owner: 10Thcipriani) [21:03:50] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): More public IPs for codfw1dev - https://phabricator.wikimedia.org/T313977 (10Andrew) 05Resolvedβ†’03Open These IPs are reachable from within codfw1dev but not from the greated Internet. @cmooney is that what you'd expect? It's... [21:03:56] !log brennen@deploy1002 Started deploy [phabricator/deployment@a21dea9]: test deploy to phab2001 [21:04:22] !log brennen@deploy1002 Finished deploy [phabricator/deployment@a21dea9]: test deploy to phab2001 (duration: 00m 27s) [21:06:57] !log brennen@deploy1002 Started deploy [phabricator/deployment@a0f0699]: test deploy to phab2001 (take 2) [21:07:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P32100 and previous config saved to /var/cache/conftool/dbconfig/20220728-210721-marostegui.json [21:07:24] !log brennen@deploy1002 Finished deploy [phabricator/deployment@a0f0699]: test deploy to phab2001 (take 2) (duration: 00m 27s) [21:08:34] (03PS1) 10Dduvall: phabricator: Support scap3 deployment of configuration [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950) [21:09:10] (03PS3) 10Dduvall: scap: Deploy configuration using scap3 templates [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/817915 (https://phabricator.wikimedia.org/T313950) [21:17:08] 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10odimitrijevic) a:03RKemper [21:18:06] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@5ec2435]: (no justification provided) [21:18:15] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@5ec2435]: (no justification provided) (duration: 00m 09s) [21:22:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T312990)', diff saved to https://phabricator.wikimedia.org/P32102 and previous config saved to /var/cache/conftool/dbconfig/20220728-212227-marostegui.json [21:22:33] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [21:26:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:30:48] (03PS1) 10Brennen Bearnes: scap: stub out a checks.yaml [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/818231 (https://phabricator.wikimedia.org/T313953) [21:38:37] oh no! I missed the backport window πŸ€¦β€β™‚οΈ thcipriani: is it too late to do my backport now? [21:40:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:41:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:41:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:42:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:46:08] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:51:12] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@e8d4704]: (no justification provided) [21:51:22] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@e8d4704]: (no justification provided) (duration: 00m 09s) [22:00:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:02:46] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: session-c1505.scope,user@114.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:09:00] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:15:43] (03CR) 10Jforrester: [C: 03+1] "This looks fine to me. Should we deploy this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809932 (https://phabricator.wikimedia.org/T303359) (owner: 10MarcoAurelio) [22:18:08] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:21:53] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@9ea9cd1]: (no justification provided) [22:22:03] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@9ea9cd1]: (no justification provided) (duration: 00m 09s) [22:27:19] (03CR) 10Brennen Bearnes: [C: 03+1] site: add phabricator role to phab2002 [puppet] - 10https://gerrit.wikimedia.org/r/685136 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:32:48] (03PS1) 10BCornwall: acme-chief: use /usr/bin/env as python interpreter [puppet] - 10https://gerrit.wikimedia.org/r/818234 [22:54:42] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:58:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [23:06:13] (03PS1) 10Ebernhardson: Release updated version of search-extra for 6.8.23 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/818241 [23:06:35] (03CR) 10Ori: "Friendly ping" [deployment-charts] - 10https://gerrit.wikimedia.org/r/816203 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori) [23:08:31] (03CR) 10Bking: [C: 03+2] "Plugins and changelog...nice!" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/818241 (owner: 10Ebernhardson) [23:09:48] (03CR) 10Bking: [V: 03+2 C: 03+2] Release updated version of search-extra for 6.8.23 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/818241 (owner: 10Ebernhardson) [23:21:36] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:22:07] (03PS2) 10Tim Starling: Multi-DC routing special cases for OAuth [puppet] - 10https://gerrit.wikimedia.org/r/817086 (https://phabricator.wikimedia.org/T313578) [23:22:23] (03CR) 10Tim Starling: Multi-DC routing special cases for OAuth (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817086 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling) [23:29:50] 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10RKemper) p:05Unbreak!β†’03High [23:30:17] 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10RKemper) >>! In T311176#8113236, @RhinosF1 wrote: > @EChetty: Can you please clarify raising this as UBN? Is this something work should be dropped immediately to do or ca... [23:39:17] (03PS1) 10Ryan Kemper: analytics-admins: add xcollazo [puppet] - 10https://gerrit.wikimedia.org/r/818266 (https://phabricator.wikimedia.org/T311176) [23:40:26] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:50:46] (03CR) 10Tim Starling: [C: 03+2] Multi-DC routing special cases for OAuth [puppet] - 10https://gerrit.wikimedia.org/r/817086 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling) [23:53:46] (03PS1) 10Ryan Kemper: kerberos: mraish has kerberos principal now [puppet] - 10https://gerrit.wikimedia.org/r/818267 (https://phabricator.wikimedia.org/T313316) [23:58:03] (03CR) 10Ryan Kemper: "Merging because the corresponding kerberos principal (user) has been added for mraish" [puppet] - 10https://gerrit.wikimedia.org/r/818267 (https://phabricator.wikimedia.org/T313316) (owner: 10Ryan Kemper) [23:58:05] (03CR) 10Ryan Kemper: [C: 03+2] kerberos: mraish has kerberos principal now [puppet] - 10https://gerrit.wikimedia.org/r/818267 (https://phabricator.wikimedia.org/T313316) (owner: 10Ryan Kemper)