[00:00:40] <wikibugs>	 10ops-codfw, 10serviceops, 10Patch-For-Review: decommission mw2251-mw2255, mw2257-mw2258 - https://phabricator.wikimedia.org/T313730 (10RLazarus) a:05RLazarus→03Papaul
[00:00:49] <wikibugs>	 10ops-codfw, 10serviceops, 10Patch-For-Review: decommission mw2251-mw2255, mw2257-mw2258 - https://phabricator.wikimedia.org/T313730 (10RLazarus) @Papaul All yours!
[00:18:30] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[00:35:30] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Set cache types for OAuth multi-DC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816884 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling)
[00:41:18] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[00:42:34] <icinga-wm>	 RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:48:43] <wikibugs>	 (03PS1) 10Tim Starling: Configure the nonce cache separately from the session cache [extensions/OAuth] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/817860 (https://phabricator.wikimedia.org/T313578)
[00:49:10] <wikibugs>	 (03PS1) 10Tim Starling: Configure the nonce cache separately from the session cache [extensions/OAuth] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817861 (https://phabricator.wikimedia.org/T313578)
[00:50:09] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Configure the nonce cache separately from the session cache [extensions/OAuth] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/817860 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling)
[00:50:12] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Configure the nonce cache separately from the session cache [extensions/OAuth] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817861 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling)
[00:56:12] <wikibugs>	 (03Merged) 10jenkins-bot: Configure the nonce cache separately from the session cache [extensions/OAuth] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/817860 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling)
[00:58:46] <wikibugs>	 (03Merged) 10jenkins-bot: Configure the nonce cache separately from the session cache [extensions/OAuth] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/817861 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling)
[01:11:17] <logmsgbot>	 !log tstarling@deploy1002 Synchronized php-1.39.0-wmf.22/extensions/OAuth: New config var for T313578, not yet used (duration: 03m 39s)
[01:11:23] <stashbot>	 T313578: Make OAuth work in Multi-DC active/active mode - https://phabricator.wikimedia.org/T313578
[01:17:32] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Danielgblack) Are you sure db1111 wasn't affected? While it didn't loose grafana plots there are some correlations (and some anti-c...
[01:18:32] <logmsgbot>	 !log tstarling@deploy1002 Synchronized php-1.39.0-wmf.21/extensions/OAuth: New config var for T313578, not yet used (duration: 03m 23s)
[01:18:37] <stashbot>	 T313578: Make OAuth work in Multi-DC active/active mode - https://phabricator.wikimedia.org/T313578
[01:20:42] <wikibugs>	 (03PS3) 10Tim Starling: Set cache types for OAuth multi-DC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816884 (https://phabricator.wikimedia.org/T313578)
[01:21:49] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Set cache types for OAuth multi-DC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816884 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling)
[01:22:57] <wikibugs>	 (03Merged) 10jenkins-bot: Set cache types for OAuth multi-DC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816884 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling)
[01:25:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[01:26:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[01:26:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[01:27:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[01:28:04] <logmsgbot>	 !log tstarling@deploy1002 Synchronized wmf-config/InitialiseSettings.php: move OAuth token storage T313578 (duration: 03m 04s)
[01:28:08] <stashbot>	 T313578: Make OAuth work in Multi-DC active/active mode - https://phabricator.wikimedia.org/T313578
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:52:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:12:40] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:17:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:24:06] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:30:07] <wikibugs>	 (03PS1) 10Krinkle: docs: Remove outdated github/travis badges [debs/pybal] - 10https://gerrit.wikimedia.org/r/817918
[02:37:28] <wikibugs>	 (03CR) 10Krinkle: monitoring: Fix broken grafana URLs that include unencoded space (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812945 (owner: 10Krinkle)
[02:37:31] <wikibugs>	 (03Abandoned) 10Krinkle: monitoring: Fix broken grafana URLs that include unencoded space [puppet] - 10https://gerrit.wikimedia.org/r/812945 (owner: 10Krinkle)
[02:58:24] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:58:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder  - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[03:55:02] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:23:30] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 3 (netmon1003, ...), Fresh: 118 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[04:43:38] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10ori) @fgiunchedi and I spoke about this today. Some notes:  #### Work queue When Swift receives an object with an expiration, the...
[05:06:58] <wikibugs>	 (03PS1) 10Stang: ja(wiki|wikivoyage): Add "Module talk" as alias of NS829 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817929 (https://phabricator.wikimedia.org/T314013)
[05:19:41] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover x2 T313811
[05:19:45] <stashbot>	 T313811: Switchover x2 master db2142 -> db2144 - https://phabricator.wikimedia.org/T313811
[05:19:57] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover x2 T313811
[05:24:50] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 121 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[05:37:31] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) db1111 wasn't affected by this issue. It was only affecting s1 (english wikipedia) and db1111 lives in s8 (wikidata). W...
[05:42:50] <wikibugs>	 (03PS2) 10Marostegui: site.pp: Promote db2144 to x2 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/817736 (https://phabricator.wikimedia.org/T313811)
[05:45:50] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Promote db2144 to x2 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/817736 (https://phabricator.wikimedia.org/T313811) (owner: 10Marostegui)
[05:50:07] <wikibugs>	 (03PS2) 10KartikMistry: Enable SectionTranslation on 10 more WPs where ContentTranslation is available by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817758 (https://phabricator.wikimedia.org/T313300)
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220728T0600)
[06:00:14] <marostegui>	 !log Starting x2 codfw failover from db2142 to db2144 - T313811
[06:00:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:19] <stashbot>	 T313811: Switchover x2 master db2142 -> db2144 - https://phabricator.wikimedia.org/T313811
[06:00:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2144 to x2 primary T313811', diff saved to https://phabricator.wikimedia.org/P32025 and previous config saved to /var/cache/conftool/dbconfig/20220728-060057-marostegui.json
[06:01:52] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2022-07-27-220330-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/817931 (https://phabricator.wikimedia.org/T308248)
[06:07:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2142 T313811', diff saved to https://phabricator.wikimedia.org/P32026 and previous config saved to /var/cache/conftool/dbconfig/20220728-060757-root.json
[06:08:03] <stashbot>	 T313811: Switchover x2 master db2142 -> db2144 - https://phabricator.wikimedia.org/T313811
[06:12:23] <wikibugs>	 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Marostegui)
[06:55:12] <kart_>	 marostegui: I want to update cxserver. Is it OK to go if Switchover is done.
[06:58:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder  - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[06:59:35] <kart_>	 OK, will update cxserver after backport window.
[07:00:04] <jouncebot>	 Amir1, apergos, jnuche, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220728T0700).
[07:00:04] <jouncebot>	 koi and kart_: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:21] <apergos>	 good morning!
[07:00:38] <apergos>	 there are two patches in the window today. I'll check on trainees momentarily.
[07:01:43] <koi>	 hi!
[07:02:11] <apergos>	 we do have one trainee!
[07:02:14] <marostegui>	 kart_: yes!
[07:02:46] <apergos>	 they are not yet here so we should wait for them.
[07:03:12] <apergos>	 koi and kart_: would you both ordinarily be self-deploy or no?
[07:03:36] <kart_>	 apergos: I'll self-deploy.
[07:03:50] <kart_>	 Need to go to Lunch after deployment :~
[07:04:02] <koi>	 apergos: I have no access to shell so..
[07:04:04] <apergos>	 so wait a little, if our trainee doesn't show up in about 10 minutes, I'll ask you to go ahead
[07:04:14] <apergos>	 kart_: 
[07:04:14] <kart_>	 (Although: Do not leave town rule applies, so I'm in the town :D)
[07:04:49] <kart_>	 Should I go ahead, apergos?
[07:05:14] <apergos>	 the reason is that I will lean on kart to screenshare while typing, for the benefit of our trainee, as I prefer not to try to train and deploy at the same time
[07:05:41] <apergos>	 kart_: go ahed and merge but then let's wait 10 min and see if our trainee arrives.
[07:05:49] <kart_>	 OK!
[07:06:06] <apergos>	 koi: I'll handle your deploy after kart_ is settled, with or without a trainee.
[07:06:31] <kart_>	 apergos: Sorry for that. I'll have other deployments next week, so we can plan for screenshare/training.
[07:06:50] <jnuche>	 morning 👋
[07:06:59] <apergos>	 oh hey jnuche, awesome
[07:07:03] <jnuche>	 apergos: do you want me to deploy so you don't need to do both things?
[07:07:20] <apergos>	 oh! that would be dandy, but note that our trainee is not yet here
[07:07:41] <apergos>	 and yes, I'll task you with deploying today if you don't mind, more experience is better :-)
[07:08:02] <jnuche>	 okie dokes
[07:09:00] <apergos>	 kart_:  you should have +2 your config and be waiting for merge by now :-P
[07:10:12] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Enable SectionTranslation on 10 more WPs where ContentTranslation is available by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817758 (https://phabricator.wikimedia.org/T313300) (owner: 10KartikMistry)
[07:10:23] <apergos>	 good good
[07:10:35] <kart_>	 apergos: :D
[07:11:13] <wikibugs>	 (03Merged) 10jenkins-bot: Enable SectionTranslation on 10 more WPs where ContentTranslation is available by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817758 (https://phabricator.wikimedia.org/T313300) (owner: 10KartikMistry)
[07:13:02] <apergos>	 we'll give it 5 more minutes and then deployment by someone, heh
[07:17:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:17:35] <apergos>	 kart_: the requisite time has passed. please go ahead and self-deploy.
[07:18:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:18:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:19:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:19:11] <kart_>	 apergos: Deploying. Was confused about messages from mwdebug canaries.
[07:19:24] <apergos>	 ah!
[07:19:29] <kart_>	 2022-07-28 07:12:50,306 [INFO] The server is depooled from all services. Restarting the service directly
[07:19:33] <kart_>	 This ^
[07:19:57] <kart_>	 I think it is fixed already.
[07:19:58] <apergos>	 there have been changes over the lst two days to the scap restart of php-fpm iirc
[07:20:29] <jnuche>	 yeah, that's correct, we had a bug where we weren't correctly restarting php-fpm on canaries
[07:21:44] <logmsgbot>	 !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:817758|Enable SectionTranslation on 10 more WPs where ContentTranslation is available by default (T313300)]] (duration: 03m 16s)
[07:21:50] <stashbot>	 T313300: Enable Section Translation on 10 more Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T313300
[07:21:59] <kart_>	 OK. I'm done apergos.
[07:22:08] <apergos>	 great!
[07:22:42] <jnuche>	 koi: you couldn't self-deploy right?
[07:22:45] <apergos>	 jnuche, you're up for +2 and deployment of koi 's patch, after doing all the usual due diligence
[07:22:57] <jnuche>	 alright
[07:23:49] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+2] "Backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817929 (https://phabricator.wikimedia.org/T314013) (owner: 10Stang)
[07:26:20] <koi>	 ..maybe need a rebase?
[07:26:48] <apergos>	 the rebase is part of the deployment process
[07:28:26] <jnuche>	 yeah, will be done on the deployment server
[07:30:12] <wikibugs>	 (03PS2) 10Jaime Nuche: ja(wiki|wikivoyage): Add "Module talk" as alias of NS829 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817929 (https://phabricator.wikimedia.org/T314013) (owner: 10Stang)
[07:30:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36460/console" [puppet] - 10https://gerrit.wikimedia.org/r/817700 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto)
[07:33:21] <jnuche>	 apergos: the patch is not merging on its own, should a submit manually?
[07:34:48] <apergos>	 I would remove the +2, and redo it
[07:35:06] <wikibugs>	 (03CR) 10Jaime Nuche: ja(wiki|wikivoyage): Add "Module talk" as alias of NS829 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817929 (https://phabricator.wikimedia.org/T314013) (owner: 10Stang)
[07:35:35] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+2] ja(wiki|wikivoyage): Add "Module talk" as alias of NS829 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817929 (https://phabricator.wikimedia.org/T314013) (owner: 10Stang)
[07:36:27] <wikibugs>	 (03Merged) 10jenkins-bot: ja(wiki|wikivoyage): Add "Module talk" as alias of NS829 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817929 (https://phabricator.wikimedia.org/T314013) (owner: 10Stang)
[07:36:37] <apergos>	 there we go
[07:38:31] <jnuche>	 koi: the change is now on mwdebug1001, please check
[07:38:38] <koi>	 looking
[07:39:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:40:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:40:12] <apergos>	 so a thing I didn't think to mention that I often do is to reload the patch just before I am about to +2 it, and if I see that there is a  merge conflict I will ask the patch owner or the person asking for deployment to resolve that first.
[07:40:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:40:32] <apergos>	 usually that's a matter of a simple rebase
[07:41:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:41:14] <jnuche>	 gotcha
[07:44:53] <vgutierrez>	 !log update HAProxy to version 2.4.18 on apt.wm.o thirdparty/haproxy24
[07:44:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:25] <koi>	 I thought it should be ok but not pretty sure
[07:47:39] <koi>	 anyway, please sync it
[07:48:06] <jnuche>	 do you need more time to check?
[07:48:11] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: Add ko, sr & uk wiki articletopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/818055 (https://phabricator.wikimedia.org/T313307)
[07:48:24] <PleaseStand>	 I tested using WikimediaDebug on mwdebug1001, and koi's patch doesn't seem to work
[07:49:00] <PleaseStand>	 If what the author of the Phab task is saying is correct, there should be no need for special site configuration, and the bug lies elsewhere
[07:49:27] <apergos>	 do you want to roll this back?
[07:49:57] <koi>	 thanks for point it out, I don't know this is a kind of bug
[07:50:31] <koi>	 previous I thought maybe need to run namespaceDupes.php for it to take affect, but yeah
[07:50:59] <koi>	 apergos: please revert it, I will look into it later
[07:51:07] <apergos>	  ok!
[07:51:15] <apergos>	 jnuche: you got this? 
[07:51:24] <jnuche>	 yep, rolling back
[07:51:33] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:51:50] <apergos>	 👍 
[07:54:02] <wikibugs>	 (03PS1) 10Jaime Nuche: Revert "ja(wiki|wikivoyage): Add "Module talk" as alias of NS829" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818056
[07:54:55] <wikibugs>	 (03CR) 10Stang: "Reverted in I3a5460b2627a600d53e13ca39c26819e45429d44" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817929 (https://phabricator.wikimedia.org/T314013) (owner: 10Stang)
[07:55:08] <wikibugs>	 (03CR) 10AikoChou: [C: 03+1] ml-services: Add ko, sr & uk wiki articletopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/818055 (https://phabricator.wikimedia.org/T313307) (owner: 10Kevin Bazira)
[07:55:53] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) So more info: At the time of the event, nothing logged apart from: ` Jul 27 20:38:32 db1132 mysqld[3344701]: 2022-07-27...
[07:56:19] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+2] Revert "ja(wiki|wikivoyage): Add "Module talk" as alias of NS829" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818056 (owner: 10Jaime Nuche)
[07:57:33] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "ja(wiki|wikivoyage): Add "Module talk" as alias of NS829" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818056 (owner: 10Jaime Nuche)
[07:58:20] <jnuche>	 koi, apergos: commit reverted and mwdebug1001 rolled back
[07:58:39] <apergos>	 awesome
[07:58:51] <koi>	 got it, thanks
[07:59:16] <jnuche>	 sure thing, sorry the fix didn't work!
[07:59:58] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: Add ko, sr & uk wiki articletopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/818055 (https://phabricator.wikimedia.org/T313307) (owner: 10Kevin Bazira)
[08:01:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:01:50] <jnuche>	 !log UTC morning backport and config training
[08:01:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:02:13] <jnuche>	 oops
[08:02:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:02:21] <jnuche>	 !log UTC morning backport and config training done
[08:02:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:02:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:03:48] <apergos>	 thanks for doing all the things,  jnuche!
[08:04:09] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:04:24] <apergos>	 and that concludes today's somewhat more eventful session, see everyone here next time!  I'll leave a note on the trainee's task and see what happened there.
[08:04:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/817883 (owner: 10Andrea Denisse)
[08:05:52] <wikibugs>	 (03PS1) 10Marostegui: db2172: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818058
[08:07:01] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2172: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818058 (owner: 10Marostegui)
[08:09:40] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db2172 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/818060 (https://phabricator.wikimedia.org/T311493)
[08:11:32] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2172 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/818060 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[08:12:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2172 to dbctl T311493', diff saved to https://phabricator.wikimedia.org/P32028 and previous config saved to /var/cache/conftool/dbconfig/20220728-081252-marostegui.json
[08:12:57] <stashbot>	 T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493
[08:14:21] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:17:15] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::webserver: fallback php version for webrequests [puppet] - 10https://gerrit.wikimedia.org/r/817700 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto)
[08:19:20] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] Multi-DC routing special cases for OAuth (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817086 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling)
[08:19:38] <wikibugs>	 (03CR) 10Gergő Tisza: Multi-DC routing special cases for OAuth [puppet] - 10https://gerrit.wikimedia.org/r/817086 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling)
[08:24:33] <wikibugs>	 (03PS1) 10Volans: admin: add sre-admins to the check for ops [puppet] - 10https://gerrit.wikimedia.org/r/818061
[08:25:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin: add sre-admins to the check for ops [puppet] - 10https://gerrit.wikimedia.org/r/818061 (owner: 10Volans)
[08:27:18] <wikibugs>	 (03PS2) 10Volans: admin: add sre-admins to the check for ops [puppet] - 10https://gerrit.wikimedia.org/r/818061
[08:28:11] <logmsgbot>	 !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[08:32:58] <wikibugs>	 (03PS1) 10Marostegui: db2174: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818062
[08:34:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2174: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/818062 (owner: 10Marostegui)
[08:36:10] <wikibugs>	 (03PS11) 10Filippo Giunchedi: sre: Port swift alerts from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/812883 (https://phabricator.wikimedia.org/T312765) (owner: 10Mark Bergsma)
[08:36:35] <vgutierrez>	 !log update HAProxy to version 2.4.18 in cp4021 and cp4027
[08:36:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I know Mark is busy with other commitments so I BOLDed and went ahead to fix the issues in previous patchset. I've also temporarily downgr" [alerts] - 10https://gerrit.wikimedia.org/r/812883 (https://phabricator.wikimedia.org/T312765) (owner: 10Mark Bergsma)
[08:38:12] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] "Applying now" [puppet] - 10https://gerrit.wikimedia.org/r/817701 (https://phabricator.wikimedia.org/T312638) (owner: 10Giuseppe Lavagetto)
[08:38:14] * kart_ updating cxserver..
[08:38:28] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-07-27-220330-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/817931 (https://phabricator.wikimedia.org/T308248) (owner: 10KartikMistry)
[08:42:42] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2022-07-27-220330-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/817931 (https://phabricator.wikimedia.org/T308248) (owner: 10KartikMistry)
[08:43:50] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[08:44:16] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[08:48:16] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[08:48:35] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:48:58] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[08:49:15] <icinga-wm>	 PROBLEM - puppet last run on lvs6001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:49:19] <icinga-wm>	 PROBLEM - puppet last run on lvs2008 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:49:33] <icinga-wm>	 PROBLEM - puppet last run on lvs4006 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:49:33] <icinga-wm>	 PROBLEM - puppet last run on lvs3006 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:49:51] <icinga-wm>	 PROBLEM - puppet last run on lvs5001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:50:35] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "Looks good from a traffic point of view" [puppet] - 10https://gerrit.wikimedia.org/r/817086 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling)
[08:51:06] <wikibugs>	 (03PS3) 10Vgutierrez: varnish: enable query-sorting in production via X-Wikimedia-Debug [puppet] - 10https://gerrit.wikimedia.org/r/816206 (https://phabricator.wikimedia.org/T138093) (owner: 10Ori)
[08:51:22] <_joe_>	 sigh
[08:51:26] <_joe_>	 no escape heh
[08:51:45] <vgutierrez>	 uh?
[08:51:52] <_joe_>	 the puppet alerts above
[08:51:53] <vgutierrez>	 akosiaris: ^^ is that related to your confd change?
[08:51:57] <_joe_>	 I tried to avoid them
[08:52:07] <_joe_>	 vgutierrez: no it's related to our puppet alerting :P
[08:52:10] <vgutierrez>	 LOL
[08:52:14] <_joe_>	 it alerts once you actually run puppet
[08:52:18] <vgutierrez>	 oh lovely
[08:52:33] * vgutierrez keeps the axe away
[08:53:36] <vgutierrez>	 !log disable puppet on cp hosts to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/816206
[08:53:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:22] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::webserver: fix non-default routing [puppet] - 10https://gerrit.wikimedia.org/r/818063
[08:55:43] <icinga-wm>	 RECOVERY - puppet last run on lvs6001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:55:47] <icinga-wm>	 RECOVERY - puppet last run on lvs2008 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:55:59] <icinga-wm>	 RECOVERY - puppet last run on lvs4006 is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:55:59] <icinga-wm>	 RECOVERY - puppet last run on lvs3006 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:56:12] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[08:56:17] <icinga-wm>	 RECOVERY - puppet last run on lvs5001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:56:35] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] varnish: enable query-sorting in production via X-Wikimedia-Debug [puppet] - 10https://gerrit.wikimedia.org/r/816206 (https://phabricator.wikimedia.org/T138093) (owner: 10Ori)
[08:56:46] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36461/console" [puppet] - 10https://gerrit.wikimedia.org/r/818063 (owner: 10Giuseppe Lavagetto)
[08:56:54] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[08:57:31] <kart_>	 !log Updated cxserver to 2022-07-27-220330-production (T308248)
[08:57:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:35] <stashbot>	 T308248: Newly supported languages in Google Translate - https://phabricator.wikimedia.org/T308248
[08:57:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2142', diff saved to https://phabricator.wikimedia.org/P32030 and previous config saved to /var/cache/conftool/dbconfig/20220728-085737-marostegui.json
[09:00:13] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki::webserver: fix non-default routing [puppet] - 10https://gerrit.wikimedia.org/r/818063
[09:00:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] admin: add sstefanova user and to WMCS groups [puppet] - 10https://gerrit.wikimedia.org/r/817845 (https://phabricator.wikimedia.org/T313934) (owner: 10Volans)
[09:01:38] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36462/console" [puppet] - 10https://gerrit.wikimedia.org/r/818063 (owner: 10Giuseppe Lavagetto)
[09:02:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mediawiki::webserver: fix non-default routing [puppet] - 10https://gerrit.wikimedia.org/r/818063 (owner: 10Giuseppe Lavagetto)
[09:06:31] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10Vgutierrez) @ori I've just deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/816206 and naive tested it against cp4027: ` vg...
[09:07:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: ps1-e4-eqiad alerts - https://phabricator.wikimedia.org/T314027 (10fgiunchedi)
[09:17:25] <Emperor>	 !log set thanos ring replicas to 3.95 T311690
[09:17:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:32] <stashbot>	 T311690: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690
[09:19:38] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] swift: stop flinging thumbnails at other DC in rewrite.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816726 (https://phabricator.wikimedia.org/T313102) (owner: 10MVernon)
[09:24:55] <Emperor>	 !log rolling restart of swift proxies to apply wmf/rewrite update T313102
[09:24:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:00] <stashbot>	 T313102: Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself - https://phabricator.wikimedia.org/T313102
[09:28:52] <wikibugs>	 10SRE-swift-storage, 10Patch-For-Review: Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself - https://phabricator.wikimedia.org/T313102 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon I've deployed this fix now, so closing this issue.
[09:31:22] <wikibugs>	 (03Abandoned) 10Jbond: do not merge! [puppet] - 10https://gerrit.wikimedia.org/r/817788 (owner: 10Jbond)
[09:33:24] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[09:33:47] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[09:36:22] <wikibugs>	 (03PS1) 10Volans: raid: convert get-raid-status-megacli to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/818070 (https://phabricator.wikimedia.org/T313952)
[09:36:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:38:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] raid: convert get-raid-status-megacli to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/818070 (https://phabricator.wikimedia.org/T313952) (owner: 10Volans)
[09:40:32] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:40:37] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[09:41:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:48:36] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[09:50:06] <wikibugs>	 (03PS1) 10Jbond: C:raid convert to python3 [puppet] - 10https://gerrit.wikimedia.org/r/818072 (https://phabricator.wikimedia.org/T313952)
[09:50:50] <wikibugs>	 (03Abandoned) 10Jbond: C:raid convert to python3 [puppet] - 10https://gerrit.wikimedia.org/r/818072 (https://phabricator.wikimedia.org/T313952) (owner: 10Jbond)
[09:55:35] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[09:58:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:00:04] <jouncebot>	 mvolz: I, the Bot under the Fountain, call upon thee, The Deployer, to do Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220728T1000).
[10:00:10] <wikibugs>	 (03PS2) 10Volans: raid: convert get-raid-status-megacli to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/818070 (https://phabricator.wikimedia.org/T313952)
[10:03:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:05:18] <jelto>	 !log update gitlab1004 to 15.0.4-ce.0
[10:05:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:27] <wikibugs>	 (03CR) 10Zfilipin: "I guess this commit can be abandoned since the related task is resolved." [puppet] - 10https://gerrit.wikimedia.org/r/817223 (https://phabricator.wikimedia.org/T313794) (owner: 10Pwangai)
[10:08:34] <wikibugs>	 (03Abandoned) 10Pwangai: admin: Add pwangai to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/817223 (https://phabricator.wikimedia.org/T313794) (owner: 10Pwangai)
[10:12:44] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin2002"
[10:13:16] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync data - jbond@cumin2002"
[10:13:50] <wikibugs>	 (03PS2) 10Jbond: sre.dns.netbox: add call to sre.puppet.sync-netbox-hiera [cookbooks] - 10https://gerrit.wikimedia.org/r/804575
[10:13:56] <wikibugs>	 (03PS3) 10Jbond: sre.dns.netbox: add call to sre.puppet.sync-netbox-hiera [cookbooks] - 10https://gerrit.wikimedia.org/r/804575
[10:17:32] <wikibugs>	 (03PS1) 10Majavah: P:prometheus::openstack_exporter: fix executable permissions [puppet] - 10https://gerrit.wikimedia.org/r/818075 (https://phabricator.wikimedia.org/T314016)
[10:17:40] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) Some updates. Along with  @Danielgblack @VicentiuCiorbaru (**MariaDB Foundation**) we have spent quite some fun time de...
[10:19:07] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin2002"
[10:19:29] <logmsgbot>	 !log jbond@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "sync data - jbond@cumin2002"
[10:20:49] <wikibugs>	 (03PS1) 10AikoChou: ml-services: add env variables for outlink-topic-model isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/818076 (https://phabricator.wikimedia.org/T313888)
[10:20:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32032 and previous config saved to /var/cache/conftool/dbconfig/20220728-102051-root.json
[10:20:55] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Install NVMe SSDs into  moss-be100[1|2] & thanos-be100? - https://phabricator.wikimedia.org/T310922 (10LSobanski) @Jclark-ctr it will be another few weeks. The process to get the host ready has been started but it is a lengthy one.
[10:21:02] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::php: standardize php pool names (1/3) [puppet] - 10https://gerrit.wikimedia.org/r/818077
[10:21:04] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::php: switch apache configurations to use the new pool [puppet] - 10https://gerrit.wikimedia.org/r/818078
[10:21:06] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::php: cleanup the legacy pool name [puppet] - 10https://gerrit.wikimedia.org/r/818079
[10:22:27] <wikibugs>	 (03PS2) 10AikoChou: ml-services: add env variables for outlink-topic-model isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/818076 (https://phabricator.wikimedia.org/T313888)
[10:24:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mediawiki::php: standardize php pool names (1/3) [puppet] - 10https://gerrit.wikimedia.org/r/818077 (owner: 10Giuseppe Lavagetto)
[10:24:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mediawiki::php: switch apache configurations to use the new pool [puppet] - 10https://gerrit.wikimedia.org/r/818078 (owner: 10Giuseppe Lavagetto)
[10:27:53] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/804575 (owner: 10Jbond)
[10:28:58] <wikibugs>	 (03PS4) 10Jbond: sre.dns.netbox: add call to sre.puppet.sync-netbox-hiera [cookbooks] - 10https://gerrit.wikimedia.org/r/804575
[10:35:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32034 and previous config saved to /var/cache/conftool/dbconfig/20220728-103555-root.json
[10:37:02] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: add env variables for outlink-topic-model isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/818076 (https://phabricator.wikimedia.org/T313888) (owner: 10AikoChou)
[10:39:53] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] ml-services: add env variables for outlink-topic-model isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/818076 (https://phabricator.wikimedia.org/T313888) (owner: 10AikoChou)
[10:40:02] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10User-zeljkofilipin: +2 for pwangai - https://phabricator.wikimedia.org/T313794 (10Peachey88) >>! In T313794#8109751, @pwangai wrote: > I am closing this request to follow instructions stipulated at https://www.mediawiki.org/wiki/Gerrit/Privilege_policy/en#Requesting_Gerrit_pr...
[10:43:14] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: add env variables for outlink-topic-model isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/818076 (https://phabricator.wikimedia.org/T313888) (owner: 10AikoChou)
[10:46:30] <wikibugs>	 (03PS1) 10Jbond: fix exception handleing [cookbooks] - 10https://gerrit.wikimedia.org/r/818085
[10:46:49] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki::php: standardize pool names (2/3) [puppet] - 10https://gerrit.wikimedia.org/r/818078
[10:46:51] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki::php: standardize pool names (3/3) [puppet] - 10https://gerrit.wikimedia.org/r/818079
[10:48:18] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:49:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mediawiki::php: standardize pool names (2/3) [puppet] - 10https://gerrit.wikimedia.org/r/818078 (owner: 10Giuseppe Lavagetto)
[10:50:56] <wikibugs>	 (03PS2) 10Jbond: cookbook sre.puppet.sync-netbox-hiera: Fix exception handling [cookbooks] - 10https://gerrit.wikimedia.org/r/818085
[10:51:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32035 and previous config saved to /var/cache/conftool/dbconfig/20220728-105100-root.json
[10:53:00] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[10:54:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: Add a missing f on an f-string [software/conftool] - 10https://gerrit.wikimedia.org/r/817910 (owner: 10RLazarus)
[10:56:12] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/818085 (owner: 10Jbond)
[10:57:04] <wikibugs>	 (03Merged) 10jenkins-bot: requestctl: Add a missing f on an f-string [software/conftool] - 10https://gerrit.wikimedia.org/r/817910 (owner: 10RLazarus)
[10:58:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder  - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[11:00:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/818070 (https://phabricator.wikimedia.org/T313952) (owner: 10Volans)
[11:00:51] <wikibugs>	 (03PS3) 10Jbond: cookbook sre.puppet.sync-netbox-hiera: Fix exception handling [cookbooks] - 10https://gerrit.wikimedia.org/r/818085
[11:00:54] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki::php: standardize php pool names (1/3) [puppet] - 10https://gerrit.wikimedia.org/r/818077
[11:00:56] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki::php: standardize pool names (2/3) [puppet] - 10https://gerrit.wikimedia.org/r/818078
[11:00:58] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki::php: standardize pool names (3/3) [puppet] - 10https://gerrit.wikimedia.org/r/818079
[11:01:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "fixed thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/818085 (owner: 10Jbond)
[11:04:52] <wikibugs>	 (03CR) 10Jbond: "As said on irc im not sure about this.  We have spoken about giving sre-admins different ldap permissions e.g. read-only orchestrator acce" [puppet] - 10https://gerrit.wikimedia.org/r/818061 (owner: 10Volans)
[11:05:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Add configuration for the release script [software/debmonitor] - 10https://gerrit.wikimedia.org/r/817722 (owner: 10Volans)
[11:05:35] <wikibugs>	 (03CR) 10Volans: [C: 03+2] raid: convert get-raid-status-megacli to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/818070 (https://phabricator.wikimedia.org/T313952) (owner: 10Volans)
[11:05:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Add configuration for the release script [software/cumin] - 10https://gerrit.wikimedia.org/r/817726 (owner: 10Volans)
[11:05:46] <wikibugs>	 (03PS3) 10Jcrespo: Initial commit [software/pampinus] - 10https://gerrit.wikimedia.org/r/817294 (https://phabricator.wikimedia.org/T283017)
[11:06:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32036 and previous config saved to /var/cache/conftool/dbconfig/20220728-110604-root.json
[11:06:22] <wikibugs>	 (03Merged) 10jenkins-bot: cookbook sre.puppet.sync-netbox-hiera: Fix exception handling [cookbooks] - 10https://gerrit.wikimedia.org/r/818085 (owner: 10Jbond)
[11:08:06] <wikibugs>	 (03PS1) 10Jcrespo: Adapt mysql prometheus script to new zarcillo schema [puppet] - 10https://gerrit.wikimedia.org/r/818088 (https://phabricator.wikimedia.org/T283017)
[11:08:38] <icinga-wm>	 RECOVERY - MegaRAID on ms-be2067 is OK: manual re-trigger of the critical https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:09:07] <volans>	 this is me
[11:09:27] <volans>	 to test the re-trigger of the raid handler that will call get-raid-status-megacli that should work now
[11:16:40] <wikibugs>	 (03PS2) 10Jcrespo: Adapt mysql prometheus script to new zarcillo schema [puppet] - 10https://gerrit.wikimedia.org/r/818088 (https://phabricator.wikimedia.org/T283017)
[11:17:29] <wikibugs>	 (03PS12) 10Jbond: P:ssh::client: use more modern functions for collecting sskey [puppet] - 10https://gerrit.wikimedia.org/r/816775
[11:21:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32037 and previous config saved to /var/cache/conftool/dbconfig/20220728-112109-root.json
[11:22:17] <wikibugs>	 (03PS13) 10Jbond: P:ssh::client: use more modern functions for collecting sskey [puppet] - 10https://gerrit.wikimedia.org/r/816775
[11:23:27] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36463/console" [puppet] - 10https://gerrit.wikimedia.org/r/816775 (owner: 10Jbond)
[11:24:28] <wikibugs>	 (03Abandoned) 10Jbond: sshkey: move the sort to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/816847 (owner: 10Jbond)
[11:24:36] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:24:52] <wikibugs>	 (03Abandoned) 10Jbond: never merge, test doing the reduce in ruby [puppet] - 10https://gerrit.wikimedia.org/r/816852 (owner: 10Jbond)
[11:27:24] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:ssh::client: use more modern functions for collecting sskey [puppet] - 10https://gerrit.wikimedia.org/r/816775 (owner: 10Jbond)
[11:30:00] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] trafficserver: 9.x upgrade: replace client.verify.server [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[11:32:18] <icinga-wm>	 PROBLEM - MegaRAID on ms-be2067 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:32:19] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on ms-be2067 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T314039 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:32:22] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314039 (10ops-monitoring-bot)
[11:34:06] <wikibugs>	 (03CR) 10Vgutierrez: "please let's ensure that this CR is also backwards compatible or it's going to break the current deployment-prep environment" [puppet] - 10https://gerrit.wikimedia.org/r/803288 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[11:35:27] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] trafficserver: 9.x upgrade: remove wmf-tls log format [puppet] - 10https://gerrit.wikimedia.org/r/803301 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[11:35:38] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/817264 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris)
[11:35:46] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Switch etcd clients to use conf100[789] [puppet] - 10https://gerrit.wikimedia.org/r/817264 (https://phabricator.wikimedia.org/T311408)
[11:35:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2] Switch etcd clients to use conf100[789] [puppet] - 10https://gerrit.wikimedia.org/r/817264 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris)
[11:36:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32038 and previous config saved to /var/cache/conftool/dbconfig/20220728-113615-root.json
[11:41:22] <akosiaris>	 !log slow (10minutes interval) rolling restart of all pybals to pick up new conf hosts config. T311407
[11:41:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:28] <stashbot>	 T311407: Put conf100[789] in production - https://phabricator.wikimedia.org/T311407
[11:43:34] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs3005 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[11:43:40] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1020 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=119) https://wikitech.wikimedia.org/wiki/PyBal
[11:45:24] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:45:36] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs6001 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[11:45:56] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314039 (10Volans) Actual output, I'll check why it didn't work ` $ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli === RaidStatus (does not include components in optimal state) name: Adapter #0   Virtual Drive: 2 (Ta...
[11:46:20] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs6002 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[11:46:34] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs3006 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[11:46:46] <volans>	 akosiaris: is this you?
[11:47:14] <jelto>	 he logged "slow (10minutes interval) rolling restart of all pybals to pick up new conf hosts config. T311407" 5 minutes ago
[11:47:15] <stashbot>	 T311407: Put conf100[789] in production - https://phabricator.wikimedia.org/T311407
[11:47:16] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs6003 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal
[11:47:18] <jynus>	 I saw him commenting about a pyball work, and those match the new hosts
[11:48:08] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=71) https://wikitech.wikimedia.org/wiki/PyBal
[11:48:08] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1017 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[11:48:22] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=36) https://wikitech.wikimedia.org/wiki/PyBal
[11:48:32] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs3007 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal
[11:48:56] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:49:13] <vgutierrez>	 yep.. that's akosiaris work
[11:49:44] <volans>	 ok, ignoring :)
[11:50:01] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "test 818085 - jbond@cumin2002"
[11:50:20] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "test 818085 - jbond@cumin2002"
[11:53:03] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Puppetize spark3 installation and configs using conda-analytics env (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata)
[11:56:28] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: port Kafka alerts from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/818108 (https://phabricator.wikimedia.org/T305847)
[11:56:59] <wikibugs>	 (03PS2) 10Filippo Giunchedi: sre: port Kafka alerts from Icinga [alerts] - 10https://gerrit.wikimedia.org/r/818108 (https://phabricator.wikimedia.org/T305847)
[12:03:16] <icinga-wm>	 PROBLEM - SSH on wtp1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:07:10] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:08:22] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:09:42] <akosiaris>	 yeah, alerts will slowly recover as I am restarting pybals
[12:09:53] <akosiaris>	 monitoring has diverged by the actual state of things right now
[12:10:07] <akosiaris>	 I don't dare to converge any faster than that though ;-)
[12:14:01] <volans>	 ack, thx
[12:16:56] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:19:36] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:21:39] <wikibugs>	 (03PS1) 10Jbond: sre.puppet.sync-netbox-hiera: Add return check mode [cookbooks] - 10https://gerrit.wikimedia.org/r/818111
[12:22:31] <wikibugs>	 (03PS11) 10Aqu: Puppetize spark3 installation and configs using conda-analytics env [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata)
[12:24:12] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs6001 is OK: OK: 12 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[12:24:19] <wikibugs>	 (03CR) 10Aqu: "Typo fixed." [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata)
[12:25:20] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:25:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Add return check mode [cookbooks] - 10https://gerrit.wikimedia.org/r/818111 (owner: 10Jbond)
[12:27:48] <wikibugs>	 (03PS2) 10Sbisson: Register Wikistories streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817263 (https://phabricator.wikimedia.org/T313633)
[12:27:58] <wikibugs>	 (03CR) 10Sbisson: Register Wikistories streams (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817263 (https://phabricator.wikimedia.org/T313633) (owner: 10Sbisson)
[12:31:14] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs6002 is OK: OK: 4 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[12:31:18] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "I don't think that this approach will work at all. You have to implement part of the logic in spicerack IMHO to allow for a check only, do" [cookbooks] - 10https://gerrit.wikimedia.org/r/818111 (owner: 10Jbond)
[12:33:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32039 and previous config saved to /var/cache/conftool/dbconfig/20220728-123304-root.json
[12:34:24] <wikibugs>	 (03CR) 10Jelto: [C: 04-1] "There are two typos (gitlab and not gerrit), left comments in line." [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[12:37:38] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[12:37:40] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[12:38:02] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[12:38:16] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[12:38:36] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[12:38:37] <wikibugs>	 (03PS1) 10Jbond: reposync: don't ask for confirmation in dry run mode [software/spicerack] - 10https://gerrit.wikimedia.org/r/818114
[12:38:49] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[12:38:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T312990)', diff saved to https://phabricator.wikimedia.org/P32040 and previous config saved to /var/cache/conftool/dbconfig/20220728-123854-marostegui.json
[12:38:59] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[12:39:07] <wikibugs>	 (03CR) 10Jbond: sre.puppet.sync-netbox-hiera: Add return check mode (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/818111 (owner: 10Jbond)
[12:39:48] <wikibugs>	 (03PS3) 10Ssingh: trafficserver: 9.x upgrade: rename max_connections_active_in [puppet] - 10https://gerrit.wikimedia.org/r/803286 (https://phabricator.wikimedia.org/T309651)
[12:39:53] <wikibugs>	 (03PS2) 10Jbond: sre.puppet.sync-netbox-hiera: Add return check mode [cookbooks] - 10https://gerrit.wikimedia.org/r/818111
[12:40:56] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36464/console" [puppet] - 10https://gerrit.wikimedia.org/r/803286 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[12:42:28] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:43:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T312990)', diff saved to https://phabricator.wikimedia.org/P32041 and previous config saved to /var/cache/conftool/dbconfig/20220728-124317-marostegui.json
[12:43:24] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db2174 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/818115 (https://phabricator.wikimedia.org/T311493)
[12:43:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Add return check mode [cookbooks] - 10https://gerrit.wikimedia.org/r/818111 (owner: 10Jbond)
[12:44:58] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs6003 is OK: OK: 16 connections established with conf1009.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal
[12:45:08] <wikibugs>	 (03PS1) 10AikoChou: ml-services: add outlink-topic-model isvc to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/818116 (https://phabricator.wikimedia.org/T313888)
[12:45:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] reposync: don't ask for confirmation in dry run mode [software/spicerack] - 10https://gerrit.wikimedia.org/r/818114 (owner: 10Jbond)
[12:45:56] <wikibugs>	 (03PS2) 10Jbond: reposync: don't ask for confirmation in dry run mode [software/spicerack] - 10https://gerrit.wikimedia.org/r/818114
[12:46:09] <wikibugs>	 (03PS3) 10Jbond: sre.puppet.sync-netbox-hiera: Add return check mode [cookbooks] - 10https://gerrit.wikimedia.org/r/818111
[12:47:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2174 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/818115 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui)
[12:48:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32042 and previous config saved to /var/cache/conftool/dbconfig/20220728-124809-root.json
[12:49:26] <wikibugs>	 (03PS3) 10Phuedx: testwiki: Add mediawiki.web_ui.interactions stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817225 (https://phabricator.wikimedia.org/T311268)
[12:49:40] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:52:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2174 to dbctl T311493', diff saved to https://phabricator.wikimedia.org/P32043 and previous config saved to /var/cache/conftool/dbconfig/20220728-125253-marostegui.json
[12:52:59] <stashbot>	 T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493
[12:54:05] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36465/console" [puppet] - 10https://gerrit.wikimedia.org/r/803286 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[12:58:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P32044 and previous config saved to /var/cache/conftool/dbconfig/20220728-125823-marostegui.json
[12:59:51] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: add outlink-topic-model isvc to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/818116 (https://phabricator.wikimedia.org/T313888) (owner: 10AikoChou)
[13:00:05] <jouncebot>	 Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220728T1300)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220728T1300).
[13:00:05] <jouncebot>	 Lucas_WMDE and phuedx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:10] <Lucas_WMDE>	 o/
[13:00:11] <phuedx>	 o/
[13:00:12] <Lucas_WMDE>	 I can deploy!
[13:01:13] <phuedx>	 Thanks! I have a new laptop and I haven't yet generated new keys
[13:01:23] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Configure wbsearchentities profile parameter on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806931 (https://phabricator.wikimedia.org/T307869)
[13:01:39] <Lucas_WMDE>	 ok
[13:02:01] <Lucas_WMDE>	 did you see my question in here yesterday, about the rate?
[13:02:08] <wikibugs>	 (03PS4) 10Ssingh: trafficserver: 9.x upgrade: rename max_connections_active_in [puppet] - 10https://gerrit.wikimedia.org/r/803286 (https://phabricator.wikimedia.org/T309651)
[13:02:12] <Lucas_WMDE>	 (I haven’t looked at the diffConfig of the latest patch set yet)
[13:03:01] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36466/console" [puppet] - 10https://gerrit.wikimedia.org/r/803286 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[13:03:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32045 and previous config saved to /var/cache/conftool/dbconfig/20220728-130314-root.json
[13:03:31] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Configure wbsearchentities profile parameter on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806931 (https://phabricator.wikimedia.org/T307869) (owner: 10Lucas Werkmeister (WMDE))
[13:03:53] <phuedx>	 Lucas_WMDE: I did see the question. I think it's because the stream is now default in InitialiseSettings.php and I couldn't override it for all beta wikis in InitialiseSettings-labs.php (+default doesn't work IIRC)
[13:04:04] <icinga-wm>	 PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:04:08] <phuedx>	 So I'd expect some beta wikis to have the stream defined but with a sampling rate of 0
[13:04:13] <Lucas_WMDE>	 ok
[13:04:15] <phuedx>	 Which is acceptable
[13:04:21] <Lucas_WMDE>	 that was going to be my next question :)
[13:04:25] <Lucas_WMDE>	 “do we care”
[13:04:26] <wikibugs>	 (03Merged) 10jenkins-bot: Configure wbsearchentities profile parameter on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806931 (https://phabricator.wikimedia.org/T307869) (owner: 10Lucas Werkmeister (WMDE))
[13:04:29] <wikibugs>	 (03PS1) 10Volans: raid: fix compression in get-raid-status-megacli [puppet] - 10https://gerrit.wikimedia.org/r/818120 (https://phabricator.wikimedia.org/T313952)
[13:04:32] <icinga-wm>	 RECOVERY - SSH on wtp1041.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:04:53] <Lucas_WMDE>	 hm, mediawiki-staging is already ahead of upstream before I’m running git fetch
[13:04:57] <Lucas_WMDE>	 let’s see if that resolves itself in a moment…
[13:05:03] <Lucas_WMDE>	 it does, yay
[13:05:32] <Lucas_WMDE>	 testing on mwdebug1001
[13:05:53] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[13:06:42] <Lucas_WMDE>	 seems to work fine, I’ll sync
[13:07:38] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[13:08:29] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[13:08:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:09:25] <Lucas_WMDE>	 phuedx: interestingly, all the rate 1 lines in the diff have a comma at the end, and all the rate 0 lines don’t
[13:09:43] <Lucas_WMDE>	 I guess that’s also due to them coming from IS.php vs IS-labs.php, or something like that
[13:09:49] <phuedx>	 :o
[13:09:57] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[13:10:27] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:806931|Configure wbsearchentities profile parameter on Wikidata (T307869)]] (duration: 03m 25s)
[13:10:31] <stashbot>	 T307869: Request for new search profile for Wikidata that boosts Items for languages - https://phabricator.wikimedia.org/T307869
[13:10:45] <Lucas_WMDE>	 seems to be either rate 1, unit pageView; or unit pageView, rate 0
[13:11:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:12:02] <Lucas_WMDE>	 why is Gerrit claiming the change is up to date and not letting me rebase it?
[13:12:08] <Lucas_WMDE>	 I just merged another change into master didn’t I
[13:12:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:12:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:12:48] <wikibugs>	 (03PS4) 10Lucas Werkmeister (WMDE): testwiki: Add mediawiki.web_ui.interactions stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817225 (https://phabricator.wikimedia.org/T311268) (owner: 10Phuedx)
[13:13:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:13:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P32047 and previous config saved to /var/cache/conftool/dbconfig/20220728-131329-marostegui.json
[13:13:44] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[13:13:51] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] testwiki: Add mediawiki.web_ui.interactions stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817225 (https://phabricator.wikimedia.org/T311268) (owner: 10Phuedx)
[13:13:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:14:03] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[13:14:38] <wikibugs>	 (03Merged) 10jenkins-bot: testwiki: Add mediawiki.web_ui.interactions stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817225 (https://phabricator.wikimedia.org/T311268) (owner: 10Phuedx)
[13:15:13] <Lucas_WMDE>	 phuedx: the change is on mwdebug1001, can you test it?
[13:15:46] <phuedx>	 On it
[13:17:27] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Add configuration for the release script [software/debmonitor] - 10https://gerrit.wikimedia.org/r/817722 (owner: 10Volans)
[13:17:36] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Add configuration for the release script [software/cumin] - 10https://gerrit.wikimedia.org/r/817726 (owner: 10Volans)
[13:18:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:18:14] <phuedx>	 Lucas_WMDE: I've confirmed that the stream config is only sent to the client on testwiki (and, for example, not on enwiki) and that it has a sampling rate of 1 on testwiki
[13:18:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32048 and previous config saved to /var/cache/conftool/dbconfig/20220728-131818-root.json
[13:19:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:19:12] <Lucas_WMDE>	 phuedx: thanks!
[13:19:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:20:05] <phuedx>	 Lucas_WMDE: Confirmed that the appropriate events are being sent only on testwiki
[13:20:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:20:08] <phuedx>	 Lucas_WMDE: LGTM
[13:20:20] <Lucas_WMDE>	 oh, sorry, I thought the first message was already a LGTM ^^
[13:20:22] <Lucas_WMDE>	 syncing now anyways
[13:20:23] <Lucas_WMDE>	 thanks!
[13:21:18] <wikibugs>	 (03Merged) 10jenkins-bot: Add configuration for the release script [software/debmonitor] - 10https://gerrit.wikimedia.org/r/817722 (owner: 10Volans)
[13:22:56] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:817225|testwiki: Add mediawiki.web_ui.interactions stream (T311268)]] (1/2) (duration: 03m 24s)
[13:23:01] <stashbot>	 T311268: *WebUIActionsTracking migration to Metrics Platform - https://phabricator.wikimedia.org/T311268
[13:23:24] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) (owner: 10Ebernhardson)
[13:23:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] raid: fix compression in get-raid-status-megacli [puppet] - 10https://gerrit.wikimedia.org/r/818120 (https://phabricator.wikimedia.org/T313952) (owner: 10Volans)
[13:24:23] <wikibugs>	 (03Merged) 10jenkins-bot: Add configuration for the release script [software/cumin] - 10https://gerrit.wikimedia.org/r/817726 (owner: 10Volans)
[13:24:34] <wikibugs>	 (03CR) 10Volans: [C: 03+2] raid: fix compression in get-raid-status-megacli [puppet] - 10https://gerrit.wikimedia.org/r/818120 (https://phabricator.wikimedia.org/T313952) (owner: 10Volans)
[13:25:31] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] trafficserver: 9.x upgrade: rename max_connections_active_in [puppet] - 10https://gerrit.wikimedia.org/r/803286 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[13:26:30] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:817225|testwiki: Add mediawiki.web_ui.interactions stream (T311268)]] (2/2) (duration: 03m 19s)
[13:27:00] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314039 (10Volans) 05Open→03Resolved a:03Volans Resolving this to test that the raid handler can create a new one correctly.
[13:27:10] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:27:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T312990)', diff saved to https://phabricator.wikimedia.org/P32049 and previous config saved to /var/cache/conftool/dbconfig/20220728-132835-marostegui.json
[13:28:37] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1133.eqiad.wmnet with reason: Maintenance
[13:28:40] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[13:28:50] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1133.eqiad.wmnet with reason: Maintenance
[13:29:11] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[13:29:24] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[13:29:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T312990)', diff saved to https://phabricator.wikimedia.org/P32050 and previous config saved to /var/cache/conftool/dbconfig/20220728-132929-marostegui.json
[13:31:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T312990)', diff saved to https://phabricator.wikimedia.org/P32051 and previous config saved to /var/cache/conftool/dbconfig/20220728-133157-marostegui.json
[13:33:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32052 and previous config saved to /var/cache/conftool/dbconfig/20220728-133323-root.json
[13:33:50] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] trafficserver: 9.x upgrade: rename max_connections_active_in [puppet] - 10https://gerrit.wikimedia.org/r/803286 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[13:34:18] <icinga-wm>	 RECOVERY - MegaRAID on ms-be2067 is OK: testing get_raid_status_megacli https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:34:18] <volans>	 this is me ^^^
[13:35:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): More public IPs for codfw1dev - https://phabricator.wikimedia.org/T313977 (10cmooney) @Andrew I'm reluctant to allocate more space for WMCS in Codfw, when there is a /29 already allocated and not being used.  So I've routed 185....
[13:38:04] <wikibugs>	 (03PS4) 10Ssingh: trafficserver: 9.x upgrade: switch ip_allow.config to YAML format [puppet] - 10https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651)
[13:38:11] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36467/console" [puppet] - 10https://gerrit.wikimedia.org/r/818077 (owner: 10Giuseppe Lavagetto)
[13:38:49] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36468/console" [puppet] - 10https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[13:39:34] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:42:32] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] P:prometheus::openstack_exporter: fix executable permissions [puppet] - 10https://gerrit.wikimedia.org/r/818075 (https://phabricator.wikimedia.org/T314016) (owner: 10Majavah)
[13:43:06] <wikibugs>	 (03CR) 10Vgutierrez: trafficserver: 9.x upgrade: switch ip_allow.config to YAML format [puppet] - 10https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[13:43:29] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] trafficserver: 9.x upgrade: switch ip_allow.config to YAML format [puppet] - 10https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[13:46:08] <wikibugs>	 (03PS5) 10Ssingh: trafficserver: 9.x upgrade: replace client.verify.server [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651)
[13:46:14] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "We need to fix noc.wikimedia.org at the very least before this can be done, but more importantly, we need to verify nothing on the appserv" [puppet] - 10https://gerrit.wikimedia.org/r/818079 (owner: 10Giuseppe Lavagetto)
[13:46:51] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36469/console" [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[13:47:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P32053 and previous config saved to /var/cache/conftool/dbconfig/20220728-134703-marostegui.json
[13:47:16] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "rebased on production, no code change" [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[13:47:47] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] trafficserver: 9.x upgrade: replace client.verify.server [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[13:48:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32054 and previous config saved to /var/cache/conftool/dbconfig/20220728-134828-root.json
[13:51:15] <wikibugs>	 (03PS6) 10Ssingh: trafficserver: 9.x upgrade: install ATS 9.x from component [puppet] - 10https://gerrit.wikimedia.org/r/816806 (https://phabricator.wikimedia.org/T309651)
[13:52:11] <wikibugs>	 (03CR) 10Michael Große: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818127 (https://phabricator.wikimedia.org/T313896) (owner: 10Michael Große)
[13:52:23] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36471/console" [puppet] - 10https://gerrit.wikimedia.org/r/816806 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[13:53:32] <icinga-wm>	 PROBLEM - MegaRAID on ms-be2067 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:53:33] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on ms-be2067 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T314049 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:53:38] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10ops-monitoring-bot)
[13:58:46] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Puppetize spark3 installation and configs using conda-analytics env [puppet] - 10https://gerrit.wikimedia.org/r/813278 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata)
[13:59:32] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Volans)
[14:00:02] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10Python3-Porting: Migrate get-raid-status-megacli to Python3 - https://phabricator.wikimedia.org/T313952 (10Volans) 05Open→03Resolved a:03Volans The script has been converted to Python 3 and it's now working again. For the speci...
[14:00:37] <logmsgbot>	 !log dcausse@deploy1002 Started deploy [wikimedia/discovery/analytics@137a4ff]: (no justification provided)
[14:00:51] <wikibugs>	 (03PS7) 10Ssingh: trafficserver: 9.x upgrade: install ATS 9.x from component [puppet] - 10https://gerrit.wikimedia.org/r/816806 (https://phabricator.wikimedia.org/T309651)
[14:01:23] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: per-backend-service concurrency limits in ATS-BE - https://phabricator.wikimedia.org/T306223 (10CDanis)
[14:01:27] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651 (10CDanis)
[14:01:44] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:02:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P32055 and previous config saved to /var/cache/conftool/dbconfig/20220728-140209-marostegui.json
[14:02:24] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: per-backend-service concurrency limits in ATS-BE - https://phabricator.wikimedia.org/T306223 (10CDanis) Awaiting {T309651} to continue testing
[14:02:41] <logmsgbot>	 !log dcausse@deploy1002 Finished deploy [wikimedia/discovery/analytics@137a4ff]: (no justification provided) (duration: 02m 03s)
[14:03:04] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1017 is OK: OK: 12 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[14:05:00] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs3005 is OK: OK: 12 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[14:05:20] <icinga-wm>	 RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:09:42] <wikibugs>	 (03PS8) 10Ssingh: trafficserver: 9.x upgrade: install ATS 9.x from component [puppet] - 10https://gerrit.wikimedia.org/r/816806 (https://phabricator.wikimedia.org/T309651)
[14:10:31] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36475/console" [puppet] - 10https://gerrit.wikimedia.org/r/816806 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[14:11:26] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:11:32] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:13:25] <wikibugs>	 10SRE, 10Data-Engineering, 10Data Pipelines (Sprint 00), 10Patch-For-Review: Create conda .deb and docker image - https://phabricator.wikimedia.org/T304450 (10JArguello-WMF)
[14:14:06] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs3006 is OK: OK: 4 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[14:14:28] <wikibugs>	 (03PS2) 10Ssingh: trafficserver: 9.x upgrade: remove deprecated parent_proxy_routing_enable [puppet] - 10https://gerrit.wikimedia.org/r/803288 (https://phabricator.wikimedia.org/T309651)
[14:15:29] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36476/console" [puppet] - 10https://gerrit.wikimedia.org/r/803288 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[14:16:06] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] trafficserver: 9.x upgrade: remove deprecated parent_proxy_routing_enable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803288 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[14:16:14] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1018 is OK: OK: 36 connections established with conf1007.eqiad.wmnet:4001 (min=36) https://wikitech.wikimedia.org/wiki/PyBal
[14:17:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T312990)', diff saved to https://phabricator.wikimedia.org/P32056 and previous config saved to /var/cache/conftool/dbconfig/20220728-141715-marostegui.json
[14:17:17] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[14:17:21] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[14:17:30] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[14:17:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T312990)', diff saved to https://phabricator.wikimedia.org/P32057 and previous config saved to /var/cache/conftool/dbconfig/20220728-141736-marostegui.json
[14:18:27] <volans>	 akosiaris: FYI there were also some BGP alerts above that I think are related to the pybal restarts
[14:19:04] <volans>	 I checked a couple of them and looked already re-established
[14:19:13] <volans>	 let's see if icinga agrees
[14:19:48] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:20:01] <topranks>	 All seem up in eqiad/codfw anyway.  Most recent 8 mins back
[14:20:04] <topranks>	 thanks volans
[14:20:15] <sukhe>	 ^ this is happening because of a misbehaving certificte transparency log,  https://sabre.ct.comodo.com/
[14:20:25] <sukhe>	 if it persists, I will remove it from the list. nothing much we can do about it
[14:20:27] <volans>	 thanks sukhe, I was about to have a quick look
[14:21:12] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:21:20] <akosiaris>	 volans: yup, it should.
[14:22:00] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:22:14] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:22:26] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 71 connections established with conf1007.eqiad.wmnet:4001 (min=71) https://wikitech.wikimedia.org/wiki/PyBal
[14:22:52] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs3007 is OK: OK: 16 connections established with conf1009.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal
[14:24:22] <akosiaris>	 yup, here it is
[14:24:28] <akosiaris>	 probably done soon. 
[14:25:11] <wikibugs>	 (03CR) 10Ebernhardson: elastic: Restart masters one at a time after all others (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) (owner: 10Ebernhardson)
[14:26:28] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "reply inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) (owner: 10Ebernhardson)
[14:31:13] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Beta: add configuration for redirect badges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818127 (https://phabricator.wikimedia.org/T313896) (owner: 10Michael Große)
[14:32:30] <wikibugs>	 (03PS13) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723
[14:32:33] <wikibugs>	 (03PS1) 10Jbond: C:varnish: fix varnish confd test data [puppet] - 10https://gerrit.wikimedia.org/r/818134
[14:37:18] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1020 is OK: OK: 119 connections established with conf1007.eqiad.wmnet:4001 (min=119) https://wikitech.wikimedia.org/wiki/PyBal
[14:41:52] <icinga-wm>	 PROBLEM - etcd service on conf1006 is CRITICAL: CRITICAL - Expecting active but unit etcd is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:42:00] <icinga-wm>	 PROBLEM - Etcd cluster health on conf1006 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd
[14:42:01] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] hieradata: move all of sessionstore to 3.11.13 [puppet] - 10https://gerrit.wikimedia.org/r/817798 (https://phabricator.wikimedia.org/T309896) (owner: 10MVernon)
[14:42:36] <icinga-wm>	 PROBLEM - Etcd cluster health on conf1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd
[14:43:10] <icinga-wm>	 PROBLEM - Etcd cluster health on conf1004 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd
[14:43:14] <icinga-wm>	 PROBLEM - etcd service on conf1004 is CRITICAL: CRITICAL - Expecting active but unit etcd is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:43:22] <icinga-wm>	 PROBLEM - etcd service on conf1005 is CRITICAL: CRITICAL - Expecting active but unit etcd is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:43:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job etcd in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:44:05] <sukhe>	 ^ known?
[14:44:07] <akosiaris>	 why are these even alerting. I 've disabled alerting on all conf100456 hosts
[14:44:40] <akosiaris>	 sukhe: yeah, I 've remove the role shortly and it should pick up again
[14:44:41] <akosiaris>	 g
[14:44:43] <akosiaris>	 grr
[14:44:51] <sukhe>	 np :) 
[14:46:05] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:sessionstore: upgrade to 3.11.13 T309896 - mvernon@cumin2002
[14:46:11] <stashbot>	 T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896
[14:47:12] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: conf100[456]: Remove them from server SRV RRs [dns] - 10https://gerrit.wikimedia.org/r/817261 (https://phabricator.wikimedia.org/T311408)
[14:47:22] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] conf100[456]: Remove them from server SRV RRs [dns] - 10https://gerrit.wikimedia.org/r/817261 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris)
[14:48:49] <wikibugs>	 (03PS18) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389)
[14:56:13] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136
[14:57:25] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Remove mentions of conf100[456] [puppet] - 10https://gerrit.wikimedia.org/r/817266 (https://phabricator.wikimedia.org/T311408)
[14:57:30] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Remove mentions of conf100[456] [puppet] - 10https://gerrit.wikimedia.org/r/817266 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris)
[14:57:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136 (owner: 10Andrew Bogott)
[14:58:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T312990)', diff saved to https://phabricator.wikimedia.org/P32061 and previous config saved to /var/cache/conftool/dbconfig/20220728-145805-marostegui.json
[14:58:11] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[14:58:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder  - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[14:59:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10RobH)
[15:02:33] <wikibugs>	 (03PS2) 10Andrew Bogott: cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136
[15:02:49] <wikibugs>	 (03PS1) 10Jdrewniak: Revert "styles: Unify on standard external link icon" [skins/Vector] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818147 (https://phabricator.wikimedia.org/T261391)
[15:02:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136 (owner: 10Andrew Bogott)
[15:03:12] <wikibugs>	 (03PS9) 10Ssingh: trafficserver: 9.x upgrade: install ATS 9.x from component [puppet] - 10https://gerrit.wikimedia.org/r/816806 (https://phabricator.wikimedia.org/T309651)
[15:03:58] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36480/console" [puppet] - 10https://gerrit.wikimedia.org/r/816806 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[15:05:07] <wikibugs>	 (03PS3) 10Andrew Bogott: cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136
[15:05:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136 (owner: 10Andrew Bogott)
[15:08:59] <wikibugs>	 (03PS3) 10Ssingh: trafficserver: 9.x upgrade: remove deprecated parent_proxy_routing_enable [puppet] - 10https://gerrit.wikimedia.org/r/803288 (https://phabricator.wikimedia.org/T309651)
[15:09:04] <wikibugs>	 (03PS4) 10Andrew Bogott: cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136
[15:09:44] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36481/console" [puppet] - 10https://gerrit.wikimedia.org/r/803288 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[15:10:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136 (owner: 10Andrew Bogott)
[15:10:46] <wikibugs>	 (03PS1) 10Phuedx: Remove WikibaseTermboxInteraction $wgEventLoggingSchemas entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818137 (https://phabricator.wikimedia.org/T290303)
[15:13:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P32062 and previous config saved to /var/cache/conftool/dbconfig/20220728-151311-marostegui.json
[15:13:49] <wikibugs>	 (03CR) 10Phuedx: [C: 04-2] "DNM until after I2fb990ee086 has been deployed (Thursday, 4th August 2022 at ~20:00 UTC)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818137 (https://phabricator.wikimedia.org/T290303) (owner: 10Phuedx)
[15:14:14] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] trafficserver: 9.x upgrade: install ATS 9.x from component [puppet] - 10https://gerrit.wikimedia.org/r/816806 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[15:14:41] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] trafficserver: 9.x upgrade: install ATS 9.x from component [puppet] - 10https://gerrit.wikimedia.org/r/816806 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[15:14:52] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] trafficserver: 9.x upgrade: remove deprecated parent_proxy_routing_enable [puppet] - 10https://gerrit.wikimedia.org/r/803288 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[15:15:37] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:16:33] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] trafficserver: 9.x upgrade: remove deprecated parent_proxy_routing_enable [puppet] - 10https://gerrit.wikimedia.org/r/803288 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[15:17:51] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.hosts.decommission for hosts conf[1004-1006].eqiad.wmnet
[15:20:32] <wikibugs>	 (03CR) 10Ebernhardson: elastic: Restart masters one at a time after all others (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) (owner: 10Ebernhardson)
[15:20:52] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mwdebug: Switch conf100[456] to conf100[789] [deployment-charts] - 10https://gerrit.wikimedia.org/r/818140 (https://phabricator.wikimedia.org/T311408)
[15:22:07] <wikibugs>	 (03PS5) 10Andrew Bogott: cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136
[15:22:13] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:sessionstore: upgrade to 3.11.13 T309896 - mvernon@cumin2002
[15:22:18] <stashbot>	 T309896: Upgrade Cassandra to latest 3.x (3.11.13) - https://phabricator.wikimedia.org/T309896
[15:23:01] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: datahub: Switchover conf1004 to conf1007 [deployment-charts] - 10https://gerrit.wikimedia.org/r/818141 (https://phabricator.wikimedia.org/T311408)
[15:23:45] <Krinkle>	 EventGate broken?
[15:23:49] <Krinkle>	 https://grafana.wikimedia.org/d/000000326/navigation-timing-alerts?orgId=1&refresh=5m
[15:24:15] <jinxer-wm>	 (MjolnirUpdateFailureRateExceedesThreshold) firing: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates  - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold
[15:24:45] <Krinkle>	 [10min ago][#wikimedia-perf-bots]  <jinxer-wm> (Low amount of navigation timing data for group 2) firing: Low amount of navigation timing data for group 2   - https://alerts.wikimedia.org/?q=alertname%3DLow+amount+of+navigation+timing+data+for+group+2
[15:25:28] <akosiaris>	 Krinkle: doesn't look like it? https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&var-service=eventgate-main&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos&var-site=All&from=now-12h&to=now
[15:25:31] <akosiaris>	 at least not eventgate 
[15:26:35] <akosiaris>	 I have failed over etcd and zookeeper hosts to the newer machines but I haven't witnessed anything yet. kafka's have been restarted, other conf clients have been restarted too
[15:26:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136 (owner: 10Andrew Bogott)
[15:27:36] <Krinkle>	 akosiaris: webperf navtiming.py perhaps?
[15:28:07] <wikibugs>	 (03CR) 10Dzahn: gitlab: add reserved service IP 208.80.153.8, point to replica-new (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[15:28:14] <Krinkle>	 https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos&var-site=All&from=now-15m&to=now
[15:28:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P32063 and previous config saved to /var/cache/conftool/dbconfig/20220728-152817-marostegui.json
[15:28:20] <wikibugs>	 (03PS6) 10Dzahn: gitlab: add reserved service IP 208.80.153.8, point to replica-new [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142)
[15:28:41] <Krinkle>	 This is the EventGate instance for client events, navtiming still coming in there
[15:28:52] <Krinkle>	 So it's lost between intake and graphite
[15:29:15] <jinxer-wm>	 (MjolnirUpdateFailureRateExceedesThreshold) resolved: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates  - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold
[15:29:55] <wikibugs>	 (03PS6) 10Andrew Bogott: cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136
[15:30:06] <wikibugs>	 (03PS1) 10Ssingh: hiera: enable ATS9 on cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/818144 (https://phabricator.wikimedia.org/T309651)
[15:30:22] <akosiaris>	 Krinkle: navtiming is spewing an exception indeed
[15:30:44] <akosiaris>	 pasting it to phab
[15:30:54] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36486/console" [puppet] - 10https://gerrit.wikimedia.org/r/818144 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[15:31:21] <akosiaris>	 Krinkle: https://phabricator.wikimedia.org/P32064
[15:31:46] <wikibugs>	 (03PS7) 10Andrew Bogott: cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136
[15:35:04] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] hiera: enable ATS9 on cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/818144 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[15:35:06] <wikibugs>	 (03CR) 10Ori: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/818145 (https://phabricator.wikimedia.org/T211661) (owner: 10Ori)
[15:36:03] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10odimitrijevic) Approved!
[15:36:12] <wikibugs>	 (03PS8) 10Andrew Bogott: cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136
[15:36:33] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10odimitrijevic) Approved
[15:36:47] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to the Desktop Improvements project statistics for SGrabarczuk - https://phabricator.wikimedia.org/T313616 (10odimitrijevic) Approved
[15:36:59] <wikibugs>	 (03PS2) 10Ori: Randomize thumbnail TTL to prevent stampedes [puppet] - 10https://gerrit.wikimedia.org/r/818145 (https://phabricator.wikimedia.org/T211661)
[15:37:47] <sukhe>	 !depool ats-be on cp4026 for ATS9 testing
[15:37:47] <wm-bot>	 for s in nginx varnish-fe varnish-be varnish-be-rand; do confctl --tags dc=eqiad,cluster=cache_text,service=$s --action set/pooled=no cp1053.eqiad.wmnet; done
[15:37:51] <sukhe>	 ha
[15:37:56] <sukhe>	 !log depool ats-be on cp4026 for ATS9 testing
[15:37:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:57] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4026.ulsfo.wmnet,service=ats-be
[15:43:19] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] mwdebug: Switch conf100[456] to conf100[789] [deployment-charts] - 10https://gerrit.wikimedia.org/r/818140 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris)
[15:43:22] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: enable ATS9 on cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/818144 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[15:43:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T312990)', diff saved to https://phabricator.wikimedia.org/P32066 and previous config saved to /var/cache/conftool/dbconfig/20220728-154323-marostegui.json
[15:43:25] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1128.eqiad.wmnet with reason: Maintenance
[15:43:28] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[15:43:38] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1128.eqiad.wmnet with reason: Maintenance
[15:43:41] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) (owner: 10Ebernhardson)
[15:43:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T312990)', diff saved to https://phabricator.wikimedia.org/P32067 and previous config saved to /var/cache/conftool/dbconfig/20220728-154344-marostegui.json
[15:45:35] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] datahub: Switchover conf1004 to conf1007 [deployment-charts] - 10https://gerrit.wikimedia.org/r/818141 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris)
[15:46:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T312990)', diff saved to https://phabricator.wikimedia.org/P32068 and previous config saved to /var/cache/conftool/dbconfig/20220728-154607-marostegui.json
[15:46:57] <wikibugs>	 (03Merged) 10jenkins-bot: mwdebug: Switch conf100[456] to conf100[789] [deployment-charts] - 10https://gerrit.wikimedia.org/r/818140 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris)
[15:49:30] <wikibugs>	 (03Merged) 10jenkins-bot: datahub: Switchover conf1004 to conf1007 [deployment-charts] - 10https://gerrit.wikimedia.org/r/818141 (https://phabricator.wikimedia.org/T311408) (owner: 10Alexandros Kosiaris)
[15:52:18] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for FNegri - https://phabricator.wikimedia.org/T314066 (10fnegri)
[15:52:49] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: sync
[15:52:53] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync
[15:53:18] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for FNegri - https://phabricator.wikimedia.org/T314066 (10fnegri) Please note I was already added to the "wmf" LDAP group by @Andrew because we both didn't realize the correct procedure was to go through this ticket! I still need to be added to the wmf-nda Pha...
[15:54:19] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[15:54:43] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp4026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[15:55:16] <wikibugs>	 (03PS3) 10Jbond: P:gerrit: Export sshkey for gerrit shared services [puppet] - 10https://gerrit.wikimedia.org/r/816715 (https://phabricator.wikimedia.org/T303857)
[15:57:09] <icinga-wm>	 PROBLEM - Ensure traffic_server is running for instance backend on cp4026 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[15:57:22] <vgutierrez>	 ^^ that's sukhe & me
[15:57:45] <mutante>	 thanks
[15:59:14] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36489/console" [puppet] - 10https://gerrit.wikimedia.org/r/816715 (https://phabricator.wikimedia.org/T303857) (owner: 10Jbond)
[16:00:05] <jouncebot>	 jbond and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220728T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:01:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P32069 and previous config saved to /var/cache/conftool/dbconfig/20220728-160113-marostegui.json
[16:08:43] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4026 is OK: HTTP OK: HTTP/1.1 200 Ok - 35278 bytes in 0.236 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[16:09:07] <icinga-wm>	 RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp4026 is OK: HTTP OK: HTTP/1.0 200 OK - 24940 bytes in 0.231 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[16:09:15] <icinga-wm>	 RECOVERY - Ensure traffic_server is running for instance backend on cp4026 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[16:11:47] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.dns.netbox
[16:11:57] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: sync
[16:12:06] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync
[16:12:33] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:15:11] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host elastic1056.eqiad.wmnet
[16:15:12] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: Avoid loading plugins from /run [puppet] - 10https://gerrit.wikimedia.org/r/818172 (https://phabricator.wikimedia.org/T309651)
[16:16:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P32070 and previous config saved to /var/cache/conftool/dbconfig/20220728-161621-marostegui.json
[16:16:51] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] trafficserver: Avoid loading plugins from /run [puppet] - 10https://gerrit.wikimedia.org/r/818172 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez)
[16:16:57] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36491/console" [puppet] - 10https://gerrit.wikimedia.org/r/818172 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez)
[16:17:17] <wikibugs>	 (03CR) 10Jbond: "also need to add querysort vmod" [puppet] - 10https://gerrit.wikimedia.org/r/818134 (owner: 10Jbond)
[16:17:26] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Avoid loading plugins from /run [puppet] - 10https://gerrit.wikimedia.org/r/818172 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez)
[16:21:30] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/datahub: sync on main
[16:21:35] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[16:21:41] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: sync on main
[16:21:49] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main
[16:21:59] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: sync on main
[16:22:03] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main
[16:23:31] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[16:23:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job etcd in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:23:50] <sukhe>	 ^ expected
[16:23:55] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp4026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[16:23:57] <sukhe>	 vgutierrez and I are fixing
[16:23:59] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:24:10] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:24:12] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts conf[1004-1006].eqiad.wmnet
[16:24:41] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host elastic1056.eqiad.wmnet
[16:26:58] <wikibugs>	 (03PS1) 10Ssingh: trafficserver: add top-level tag for logging [puppet] - 10https://gerrit.wikimedia.org/r/818174 (https://phabricator.wikimedia.org/T309651)
[16:28:06] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36492/console" [puppet] - 10https://gerrit.wikimedia.org/r/818174 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[16:29:16] <wikibugs>	 (03PS2) 10Ssingh: trafficserver: add top-level tag for logging [puppet] - 10https://gerrit.wikimedia.org/r/818174 (https://phabricator.wikimedia.org/T309651)
[16:30:09] <icinga-wm>	 PROBLEM - Ensure traffic_server is running for instance backend on cp4026 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[16:30:11] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36493/console" [puppet] - 10https://gerrit.wikimedia.org/r/818174 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[16:31:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T312990)', diff saved to https://phabricator.wikimedia.org/P32071 and previous config saved to /var/cache/conftool/dbconfig/20220728-163127-marostegui.json
[16:31:30] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[16:31:34] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[16:31:43] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[16:31:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T312990)', diff saved to https://phabricator.wikimedia.org/P32072 and previous config saved to /var/cache/conftool/dbconfig/20220728-163149-marostegui.json
[16:33:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Please verify location of an-worker1111.eqiad.wmnet - https://phabricator.wikimedia.org/T298785 (10Cmjohnson) 05Open→03Resolved
[16:34:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T312990)', diff saved to https://phabricator.wikimedia.org/P32073 and previous config saved to /var/cache/conftool/dbconfig/20220728-163412-marostegui.json
[16:34:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Please verify location of an-master1001.eqiad.wmnet - https://phabricator.wikimedia.org/T298621 (10Cmjohnson) 05Open→03Resolved
[16:35:19] <icinga-wm>	 RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp4026 is OK: HTTP OK: HTTP/1.0 200 OK - 24969 bytes in 0.231 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[16:35:25] <icinga-wm>	 RECOVERY - Ensure traffic_server is running for instance backend on cp4026 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[16:37:10] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] trafficserver: add top-level tag for logging [puppet] - 10https://gerrit.wikimedia.org/r/818174 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[16:37:19] <wikibugs>	 (03PS5) 10Dzahn: gerrit: turn gerrit2002 into a gerrit migration dest host [puppet] - 10https://gerrit.wikimedia.org/r/817841 (https://phabricator.wikimedia.org/T313250)
[16:38:08] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[16:38:51] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4026 is OK: HTTP OK: HTTP/1.1 200 Ok - 35353 bytes in 0.237 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[16:40:59] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] trafficserver: add top-level tag for logging [puppet] - 10https://gerrit.wikimedia.org/r/818174 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[16:41:12] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "now it works. creates firewall rules/rsync server/monitoring on gerrit2002, while on gerrit1001/2001 it does nothing except add the new ho" [puppet] - 10https://gerrit.wikimedia.org/r/817841 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[16:41:59] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10Cmjohnson) a:03Jclark-ctr
[16:42:46] <mutante>	 !log disabling puppet on gerrit servers for a change in gerrit puppet code
[16:42:49] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:42:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:11] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/36495/gerrit2002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/817841 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[16:44:17] <wikibugs>	 (03PS6) 10Dzahn: gerrit: turn gerrit2002 into a gerrit migration dest host [puppet] - 10https://gerrit.wikimedia.org/r/817841 (https://phabricator.wikimedia.org/T313250)
[16:45:12] <vgutierrez>	 !log pooling ats-be@cp4026 running ATS 9.1.2 - T309651
[16:45:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:45:17] <stashbot>	 T309651: Package and deploy ATS 9.1.2 - https://phabricator.wikimedia.org/T309651
[16:49:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P32074 and previous config saved to /var/cache/conftool/dbconfig/20220728-164918-marostegui.json
[16:55:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Idea LGTM! See inline" [puppet] - 10https://gerrit.wikimedia.org/r/818145 (https://phabricator.wikimedia.org/T211661) (owner: 10Ori)
[16:55:43] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:57:28] <wikibugs>	 (03CR) 10Dzahn: "gerrit2002 getting rsyncd / firewall rules.. and noop confirmed on prod hosts gerrit1001/gerrit2001 -" [puppet] - 10https://gerrit.wikimedia.org/r/817841 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[16:58:35] <icinga-wm>	 PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:00:05] <jouncebot>	 bd808: I, the Bot under the Fountain, call upon thee, The Deployer, to do Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220728T1700).
[17:02:42] <wikibugs>	 (03PS1) 10Dzahn: admin: add gerrit access groups to gerrit migration role [puppet] - 10https://gerrit.wikimedia.org/r/818183 (https://phabricator.wikimedia.org/T280597)
[17:04:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P32075 and previous config saved to /var/cache/conftool/dbconfig/20220728-170424-marostegui.json
[17:04:54] <wikibugs>	 (03CR) 10Dzahn: "@Jbond This situation does not make it an access request, right? No changes to groups, new hardware replacing old hardware..." [puppet] - 10https://gerrit.wikimedia.org/r/818183 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[17:07:23] <wikibugs>	 (03CR) 10Dzahn: "also sets contact groups for Icinga monitoring which just got added. so in theory you get notified and have privs in Icinga because you ar" [puppet] - 10https://gerrit.wikimedia.org/r/818183 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[17:10:05] <wikibugs>	 (03CR) 10Dzahn: "@Jbond same here, it's only about giving access to new hardware that is replacing old hardware with the twist that it would be nice to hav" [puppet] - 10https://gerrit.wikimedia.org/r/817811 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[17:16:27] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:18:11] <wikibugs>	 (03PS2) 10Dduvall: scap: Deploy configuration using scap3 templates [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/817915 (https://phabricator.wikimedia.org/T313950)
[17:18:18] <ryankemper>	 !log [Elastic] `sudo disable-puppet "production issue"` && `sudo systemctl stop mjolnir-kafka-bulk-daemon.service` on `ryankemper@search-loader1001`
[17:18:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T312990)', diff saved to https://phabricator.wikimedia.org/P32076 and previous config saved to /var/cache/conftool/dbconfig/20220728-171930-marostegui.json
[17:19:32] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[17:19:37] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[17:19:46] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[17:19:47] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[17:19:50] <ryankemper>	 !log [Elastic] `ryankemper@search-loader2001:~$ sudo disable-puppet "production issue" && sudo systemctl stop mjolnir-kafka-bulk-daemon.service` just to be safe (we prob only needed to halt eqiad)
[17:19:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:20:03] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[17:20:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T312990)', diff saved to https://phabricator.wikimedia.org/P32077 and previous config saved to /var/cache/conftool/dbconfig/20220728-172008-marostegui.json
[17:22:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T312990)', diff saved to https://phabricator.wikimedia.org/P32078 and previous config saved to /var/cache/conftool/dbconfig/20220728-172235-marostegui.json
[17:23:14] <ryankemper>	 !log [Elastic] Restarting `elastic1072` after halting mjolnir bulk daemons: `ryankemper@elastic1072:~$ sudo depool && sleep 30 && sudo systemctl restart elasticsearch_6* && sleep 30 && sudo pool`
[17:23:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:24:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: ps1-e4-eqiad alerts - https://phabricator.wikimedia.org/T314027 (10Papaul) @fgiunchedi f1-f4 PDU's are not setup yet
[17:33:33] <wikibugs>	 (03PS14) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723
[17:34:30] <wikibugs>	 (03PS15) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723
[17:34:59] <wikibugs>	 (03PS9) 10Andrew Bogott: cloudrabbit hosts: add service names to rabbit certs [puppet] - 10https://gerrit.wikimedia.org/r/818136
[17:35:01] <wikibugs>	 (03PS3) 10Andrew Bogott: hieradata: switch traffic to cloudrabbit1001-3 [puppet] - 10https://gerrit.wikimedia.org/r/816818 (owner: 10Majavah)
[17:36:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: ps1-e4-eqiad alerts - https://phabricator.wikimedia.org/T314027 (10wiki_willy) a:03Cmjohnson
[17:37:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P32079 and previous config saved to /var/cache/conftool/dbconfig/20220728-173742-marostegui.json
[17:38:58] <wikibugs>	 (03PS16) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723
[17:39:50] <wikibugs>	 (03CR) 10Jbond: C:varnish: Rate limit hotlinking (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/768723 (owner: 10Jbond)
[17:40:15] <wikibugs>	 (03PS3) 10Andrew Bogott: wikimediacloud.org: add cname records for rabbitmq [dns] - 10https://gerrit.wikimedia.org/r/817877
[17:41:04] <ryankemper>	 !log [Elastic] Re-running `delete`s and `update`s from `2022-07-28T15:00:00Z` until `2022-07-28T17:30:00Z` on `ryankemper@mwmaint1002` tmux `mlr_outage`
[17:41:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:50:25] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:52:35] <wikibugs>	 10SRE, 10Gerrit, 10serviceops, 10serviceops-collab, and 2 others: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) on gerrit2002 we now have, created by the migration class:  - a group "gerrit2" - a user "gerrit2" - a directory /srv/gerrit - package rsync installed, /etc/def...
[17:52:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P32080 and previous config saved to /var/cache/conftool/dbconfig/20220728-175248-marostegui.json
[17:53:05] <wikibugs>	 (03PS3) 10Ori: Randomize thumbnail TTL to prevent stampedes [puppet] - 10https://gerrit.wikimedia.org/r/818145 (https://phabricator.wikimedia.org/T211661)
[17:53:53] <wikibugs>	 (03CR) 10Ori: Randomize thumbnail TTL to prevent stampedes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/818145 (https://phabricator.wikimedia.org/T211661) (owner: 10Ori)
[17:54:04] <wikibugs>	 10SRE, 10Gerrit, 10serviceops, 10serviceops-collab, and 2 others: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn)
[17:54:09] <wikibugs>	 (03CR) 10Dzahn: [C: 04-2] "this is blocked on https://phabricator.wikimedia.org/T313972" [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[17:55:11] <wikibugs>	 (03CR) 10Dzahn: "blocked by https://phabricator.wikimedia.org/T313972" [puppet] - 10https://gerrit.wikimedia.org/r/815396 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[17:55:32] <wikibugs>	 (03CR) 10Dzahn: [C: 04-2] "blocked by https://phabricator.wikimedia.org/T313972" [dns] - 10https://gerrit.wikimedia.org/r/815395 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[17:55:58] <wikibugs>	 (03PS2) 10Dzahn: add gerrit-replica-new.wikimedia.org, point to 208.80.153.109 [dns] - 10https://gerrit.wikimedia.org/r/815395 (https://phabricator.wikimedia.org/T313250)
[17:56:51] <wikibugs>	 (03PS3) 10Dzahn: gerrit: add hiera settings for replica to gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/815396 (https://phabricator.wikimedia.org/T313250)
[18:00:04] <jouncebot>	 brennen and jeena: Dear deployers, time to do the MediaWiki train - Utc-7 Version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220728T1800).
[18:00:51] <jeena>	 o/
[18:01:25] <brennen>	 o/ - currently blocked on T314058
[18:01:25] <stashbot>	 T314058: TypeError: Argument 1 passed to Flow\Hooks::onSpecialCheckUserGetLinksFromRow() must be SpecialPage, CheckUserGetEditsPager given - https://phabricator.wikimedia.org/T314058
[18:01:51] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:03:01] <brennen>	 there's a patch in the works for that blocker, shouldn't be too long i think.
[18:03:34] <wikibugs>	 (03PS2) 10Jbond: C:varnish: fix varnish confd test data [puppet] - 10https://gerrit.wikimedia.org/r/818134
[18:03:36] <wikibugs>	 (03PS17) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723
[18:06:47] <ryankemper>	 !log [Elastic] Finished re-running `delete`s and `update`s from `2022-07-28T15:00:00Z` until `2022-07-28T17:30:00Z`
[18:06:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T312990)', diff saved to https://phabricator.wikimedia.org/P32081 and previous config saved to /var/cache/conftool/dbconfig/20220728-180754-marostegui.json
[18:07:56] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[18:07:59] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[18:08:09] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[18:08:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T312990)', diff saved to https://phabricator.wikimedia.org/P32082 and previous config saved to /var/cache/conftool/dbconfig/20220728-180815-marostegui.json
[18:10:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T312990)', diff saved to https://phabricator.wikimedia.org/P32083 and previous config saved to /var/cache/conftool/dbconfig/20220728-181044-marostegui.json
[18:15:56] <wikibugs>	 (03PS18) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723
[18:25:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P32084 and previous config saved to /var/cache/conftool/dbconfig/20220728-182550-marostegui.json
[18:28:19] <mutante>	 !log gerrit: rsyncing /home from prod gerrit1001 to /srv/home-gerrit1001.wikimedia.org on  gerrit2002 new replica T243027 T313250
[18:28:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:24] <stashbot>	 T243027: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027
[18:28:25] <stashbot>	 T313250: Bring up Gerrit2002 - https://phabricator.wikimedia.org/T313250
[18:31:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Cmjohnson) a:05Cmjohnson→03Andrew @andrew what do you need one with these?  The task was re-opened and I see some action but not sure what...
[18:32:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1082 - https://phabricator.wikimedia.org/T312626 (10Cmjohnson) a:05Cmjohnson→03BTullis @BTullis Can we try and do this Monday, please?
[18:33:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10Cmjohnson) @nskaggs I would like to schedule this to be completed on Monday around 1600UTC. Does that work for you?
[18:36:07] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:38:59] <wikibugs>	 (03CR) 10AOkoth: [C: 03+1] gitlab: add reserved service IP 208.80.153.8, point to replica-new [dns] - 10https://gerrit.wikimedia.org/r/816835 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[18:40:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P32085 and previous config saved to /var/cache/conftool/dbconfig/20220728-184056-marostegui.json
[18:46:10] <wikibugs>	 (03PS1) 10Zabe: Update CheckUser hook for pagination [extensions/Flow] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818154 (https://phabricator.wikimedia.org/T314058)
[18:46:16] <wikibugs>	 (03PS2) 10Brennen Bearnes: Update CheckUser hook for pagination [extensions/Flow] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818154 (https://phabricator.wikimedia.org/T314058) (owner: 10Zabe)
[18:46:30] <wikibugs>	 (03PS1) 10Zabe: Add CheckUser to phan analysis [extensions/Flow] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818155
[18:46:56] <wikibugs>	 (03PS3) 10Zabe: Update CheckUser hook for pagination [extensions/Flow] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818154 (https://phabricator.wikimedia.org/T314058)
[18:47:02] <brennen>	 zabe: sorry to step on your toes there
[18:47:18] <zabe>	 no worries :)
[18:47:48] <zabe>	 (btw. I can test the fix once its merged)
[18:48:33] <brennen>	 cool
[18:53:08] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] Update CheckUser hook for pagination [extensions/Flow] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818154 (https://phabricator.wikimedia.org/T314058) (owner: 10Zabe)
[18:53:50] <zabe>	 brennen, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Flow/+/818155/1 needs to be merged first, in order to make CI pass for the actual fix
[18:53:59] <zabe>	 (but that patch does not need to be synced)
[18:56:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T312990)', diff saved to https://phabricator.wikimedia.org/P32086 and previous config saved to /var/cache/conftool/dbconfig/20220728-185603-marostegui.json
[18:56:04] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[18:56:09] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[18:56:18] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[18:56:23] <wikibugs>	 (03CR) 10Brennen Bearnes: Update CheckUser hook for pagination [extensions/Flow] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818154 (https://phabricator.wikimedia.org/T314058) (owner: 10Zabe)
[18:56:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T312990)', diff saved to https://phabricator.wikimedia.org/P32087 and previous config saved to /var/cache/conftool/dbconfig/20220728-185624-marostegui.json
[18:56:33] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] Add CheckUser to phan analysis [extensions/Flow] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818155 (owner: 10Zabe)
[18:56:43] <brennen>	 ah, right.  i'm eternally getting myself confused by patch chains in gerrit.
[18:58:05] <mutante>	  !log gerrit: starting rsync of /srv/gerrit (>240GB) from prod gerrit1001 to /srv/gerrit on gerrit2002 new replica T243027 T313250 ..slowly ..with --bwlimit=1000
[18:58:05] <stashbot>	 T243027: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027
[18:58:06] <stashbot>	 T313250: Bring up Gerrit2002 - https://phabricator.wikimedia.org/T313250
[18:58:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T312990)', diff saved to https://phabricator.wikimedia.org/P32088 and previous config saved to /var/cache/conftool/dbconfig/20220728-185847-marostegui.json
[18:58:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder  - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[19:00:07] <logmsgbot>	 !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@82e0383]: (no justification provided)
[19:00:25] <logmsgbot>	 !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@82e0383]: (no justification provided) (duration: 00m 17s)
[19:03:00] <wikibugs>	 (03PS10) 10Krinkle: multiversion: Add dblists-index.php for fast runtime lookups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816029 (https://phabricator.wikimedia.org/T169821)
[19:03:15] <wikibugs>	 (03PS9) 10Krinkle: multiversion: Switch getTagsForWiki() to fast dblists-index.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816089 (https://phabricator.wikimedia.org/T169821)
[19:03:18] <wikibugs>	 (03PS10) 10Krinkle: multiversion: Switch getTagsForWiki() to fast dblists-index.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816089 (https://phabricator.wikimedia.org/T169821)
[19:06:20] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] Update CheckUser hook for pagination [extensions/Flow] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818154 (https://phabricator.wikimedia.org/T314058) (owner: 10Zabe)
[19:12:23] <wikibugs>	 (03Merged) 10jenkins-bot: Add CheckUser to phan analysis [extensions/Flow] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818155 (owner: 10Zabe)
[19:13:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P32089 and previous config saved to /var/cache/conftool/dbconfig/20220728-191353-marostegui.json
[19:18:18] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wikimediacloud.org: add cname records for rabbitmq [dns] - 10https://gerrit.wikimedia.org/r/817877 (owner: 10Andrew Bogott)
[19:21:51] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:22:17] <wikibugs>	 10SRE, 10ops-codfw, 10Machine-Learning-Team: codfw: ml-serve2001 memmory issue DIMM A2 - https://phabricator.wikimedia.org/T313822 (10Papaul) 05Open→03Resolved a:03Papaul
[19:23:30] <wikibugs>	 (03Merged) 10jenkins-bot: Update CheckUser hook for pagination [extensions/Flow] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818154 (https://phabricator.wikimedia.org/T314058) (owner: 10Zabe)
[19:27:20] <brennen>	 zabe: should be on mwdebug1002
[19:27:57] <zabe>	 lemme see
[19:28:59] <zabe>	 brennen, looks good
[19:28:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P32090 and previous config saved to /var/cache/conftool/dbconfig/20220728-192859-marostegui.json
[19:29:59] <brennen>	 zabe: cool, syncing
[19:31:47] <wikibugs>	 10SRE, 10Data Engineering Planning: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10RhinosF1)
[19:32:02] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10RhinosF1)
[19:32:17] <RhinosF1>	 volans: ^ is marked as UBN!
[19:34:33] <logmsgbot>	 !log brennen@deploy1002 Synchronized php-1.39.0-wmf.22/extensions/Flow: Backport: [[gerrit:818154|Update CheckUser hook for pagination (T314058 T314069)]] (duration: 03m 16s)
[19:34:39] <stashbot>	 T314058: TypeError: Argument 1 passed to Flow\Hooks::onSpecialCheckUserGetLinksFromRow() must be SpecialPage, CheckUserGetEditsPager given - https://phabricator.wikimedia.org/T314058
[19:34:39] <stashbot>	 T314069: Fatal exception of type "TypeError" when using checkuser "Get edits" on Wikidata  - https://phabricator.wikimedia.org/T314069
[19:35:03] <wikibugs>	 (03PS1) 10Andrew Bogott: Reorder the list of of profile::openstack::eqiad1::openstack_controllers [puppet] - 10https://gerrit.wikimedia.org/r/818210 (https://phabricator.wikimedia.org/T313268)
[19:35:55] <brennen>	 !log 1.39.0-wmf.22 train (T308075): blocker resolved, rolling to all wikis
[19:35:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:36:00] <stashbot>	 T308075: 1.39.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T308075
[19:36:45] <wikibugs>	 (03PS1) 10TrainBranchBot: all wikis to 1.39.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818212 (https://phabricator.wikimedia.org/T308075)
[19:36:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.39.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818212 (https://phabricator.wikimedia.org/T308075) (owner: 10TrainBranchBot)
[19:37:33] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "code matches the explanation, makes sense to me" [puppet] - 10https://gerrit.wikimedia.org/r/817759 (https://phabricator.wikimedia.org/T311746) (owner: 10Jelto)
[19:39:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Reorder the list of of profile::openstack::eqiad1::openstack_controllers [puppet] - 10https://gerrit.wikimedia.org/r/818210 (https://phabricator.wikimedia.org/T313268) (owner: 10Andrew Bogott)
[19:39:57] <icinga-wm>	 PROBLEM - Check systemd state on mw2389 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:40:00] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/818212 (https://phabricator.wikimedia.org/T308075) (owner: 10TrainBranchBot)
[19:44:00] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.22  refs T308075
[19:44:05] <stashbot>	 T308075: 1.39.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T308075
[19:44:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T312990)', diff saved to https://phabricator.wikimedia.org/P32091 and previous config saved to /var/cache/conftool/dbconfig/20220728-194405-marostegui.json
[19:44:07] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[19:44:12] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[19:44:20] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[19:44:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T312990)', diff saved to https://phabricator.wikimedia.org/P32092 and previous config saved to /var/cache/conftool/dbconfig/20220728-194426-marostegui.json
[19:45:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Andrew) >>! In T305414#8113020, @Cmjohnson wrote: > @andrew what do you need one with these?  The task was re-opened and I see some action but...
[19:45:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[19:46:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[19:46:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[19:46:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10Andrew) @nskaggs is out for several weeks, so this should wait until late August unless someone else appears who wants to coordinate on this.
[19:46:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T312990)', diff saved to https://phabricator.wikimedia.org/P32093 and previous config saved to /var/cache/conftool/dbconfig/20220728-194654-marostegui.json
[19:47:03] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: alert on per-node indexing not occurring [alerts] - 10https://gerrit.wikimedia.org/r/818214 (https://phabricator.wikimedia.org/T314078)
[19:47:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10taavi) 05Open→03Resolved That firewall issue should be sorted with my latest patch above.
[19:47:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[19:49:43] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "I consider this just part of the setup task or prep for migration, not an access request." [puppet] - 10https://gerrit.wikimedia.org/r/818183 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[19:50:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Andrew) Yep, it's closed!
[19:50:28] <wikibugs>	 (03PS2) 10Ryan Kemper: elastic: alert on per-node indexing not occurring [alerts] - 10https://gerrit.wikimedia.org/r/818214 (https://phabricator.wikimedia.org/T314078)
[19:53:17] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "this was actually a noop, you all already had access. it was done in ./hosts/gerrit2002.yaml though, not as nice as by role" [puppet] - 10https://gerrit.wikimedia.org/r/818183 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[19:54:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] elastic: alert on per-node indexing not occurring [alerts] - 10https://gerrit.wikimedia.org/r/818214 (https://phabricator.wikimedia.org/T314078) (owner: 10Ryan Kemper)
[19:54:25] <wikibugs>	 (03PS1) 10Dzahn: gerrit/hieradata: delete ./hosts/gerrit2002.yaml [puppet] - 10https://gerrit.wikimedia.org/r/818216
[19:54:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/818216/" [puppet] - 10https://gerrit.wikimedia.org/r/818183 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[19:55:10] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit/hieradata: delete ./hosts/gerrit2002.yaml [puppet] - 10https://gerrit.wikimedia.org/r/818216 (owner: 10Dzahn)
[19:56:07] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:56:13] <wikibugs>	 (03PS3) 10Ryan Kemper: elastic: alert on per-node indexing not occurring [alerts] - 10https://gerrit.wikimedia.org/r/818214 (https://phabricator.wikimedia.org/T314078)
[20:00:05] <jouncebot>	 brennen: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220728T2000).
[20:00:05] <jouncebot>	 jan_drewniak and stephanebisson: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:09] <thcipriani>	 o/
[20:00:19] <stephanebisson>	 Hi
[20:01:21] <brennen>	 o/
[20:01:22] <stephanebisson>	 I have to run in about 20 minutes. It would be great if we can start with my patch. If not, no big deal, I'll reschedule it for next week.
[20:01:45] <thcipriani>	 sure
[20:01:55] <thcipriani>	 the other looks like it'll take longer to merge
[20:02:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P32094 and previous config saved to /var/cache/conftool/dbconfig/20220728-200200-marostegui.json
[20:02:11] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Revert "styles: Unify on standard external link icon" [skins/Vector] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818147 (https://phabricator.wikimedia.org/T261391) (owner: 10Jdrewniak)
[20:03:14] <thcipriani>	 stephanebisson: could you rebase your patch? Gerrit isn't able to rebase it automagically.
[20:04:06] <stephanebisson>	 on it
[20:05:35] <wikibugs>	 (03PS3) 10Sbisson: Register Wikistories streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817263 (https://phabricator.wikimedia.org/T313633)
[20:05:59] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Register Wikistories streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817263 (https://phabricator.wikimedia.org/T313633) (owner: 10Sbisson)
[20:07:05] <wikibugs>	 (03Merged) 10jenkins-bot: Register Wikistories streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/817263 (https://phabricator.wikimedia.org/T313633) (owner: 10Sbisson)
[20:08:36] <thcipriani>	 stephanebisson: you patch is on mwdebug1002, check please
[20:08:46] <stephanebisson>	 thcipriani ok
[20:12:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:13:27] <stephanebisson>	 thcipriani it looks like I cannot fully test it from a single test server but I think it's ok to sync.
[20:13:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:13:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:13:45] <thcipriani>	 stephanebisson: ok, going live
[20:14:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:14:37] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] P:gerrit: Export sshkey for gerrit shared services [puppet] - 10https://gerrit.wikimedia.org/r/816715 (https://phabricator.wikimedia.org/T303857) (owner: 10Jbond)
[20:16:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): More public IPs for codfw1dev - https://phabricator.wikimedia.org/T313977 (10Andrew) 05Open→03Resolved This works!   ` +----------------------+--------------------------------------+ | Field                | Value...
[20:17:00] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:17:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P32095 and previous config saved to /var/cache/conftool/dbconfig/20220728-201706-marostegui.json
[20:17:58] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.2.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[20:18:21] <logmsgbot>	 !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:817263|Register Wikistories streams (T313633)]] (duration: 03m 24s)
[20:18:25] <stashbot>	 T313633: Register Wikistories streams in InitialiseSettings.php - https://phabricator.wikimedia.org/T313633
[20:18:34] <thcipriani>	 ^ stephanebisson should be live now
[20:18:46] <stephanebisson>	 thcipriani thanks!
[20:19:12] <thcipriani>	 I wonder if scap did something to make that high average get latency with the restart? also...codfw?
[20:19:48] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[20:20:03] <thcipriani>	 well
[20:20:05] <thcipriani>	 nevermind
[20:23:29] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1001/36501/phab2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/685136 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[20:26:09] <wikibugs>	 (03CR) 10Dzahn: "note how production catalog on phab1001 has git-ssh stuff.. but this host does not..even though it gets other phabricator things." [puppet] - 10https://gerrit.wikimedia.org/r/685136 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[20:26:38] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "styles: Unify on standard external link icon" [skins/Vector] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818147 (https://phabricator.wikimedia.org/T261391) (owner: 10Jdrewniak)
[20:28:28] <thcipriani>	 ok, merged
[20:28:46] <thcipriani>	 jan_drewniak: around?
[20:28:47] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Expand retry logic for cinder backups [puppet] - 10https://gerrit.wikimedia.org/r/817378 (https://phabricator.wikimedia.org/T310640) (owner: 10Nskaggs)
[20:32:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T312990)', diff saved to https://phabricator.wikimedia.org/P32096 and previous config saved to /var/cache/conftool/dbconfig/20220728-203212-marostegui.json
[20:32:14] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[20:32:18] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[20:32:28] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[20:32:52] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2103.codfw.wmnet with reason: Maintenance
[20:33:06] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2103.codfw.wmnet with reason: Maintenance
[20:33:07] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 16 hosts with reason: Maintenance
[20:33:30] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 16 hosts with reason: Maintenance
[20:33:52] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[20:34:06] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[20:34:27] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[20:34:40] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[20:34:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T312990)', diff saved to https://phabricator.wikimedia.org/P32097 and previous config saved to /var/cache/conftool/dbconfig/20220728-203446-marostegui.json
[20:37:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T312990)', diff saved to https://phabricator.wikimedia.org/P32098 and previous config saved to /var/cache/conftool/dbconfig/20220728-203709-marostegui.json
[20:38:46] <wikibugs>	 (03PS1) 10Thcipriani: Revert "Revert "styles: Unify on standard external link icon"" [skins/Vector] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818157
[20:38:57] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Revert "Revert "styles: Unify on standard external link icon"" [skins/Vector] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818157 (owner: 10Thcipriani)
[20:43:22] <wikibugs>	 (03CR) 10Dzahn: "a lot of the changes are all about exim because once upon a time "mail to phab tasK" was a thing. we need to go through the whole https://" [puppet] - 10https://gerrit.wikimedia.org/r/685136 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[20:45:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Papaul) ` papaul@lsw1-e2-eqiad> show interfaces descriptions | match db1191 ge-0/0/40                  db1191 {#2013339101930} ` ` papaul@lsw1-e2-...
[20:52:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P32099 and previous config saved to /var/cache/conftool/dbconfig/20220728-205215-marostegui.json
[20:53:24] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+1] "Just looking this over with Jeena - approach seems totally reasonable." [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/817915 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall)
[20:58:30] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "styles: Unify on standard external link icon"" [skins/Vector] (wmf/1.39.0-wmf.22) - 10https://gerrit.wikimedia.org/r/818157 (owner: 10Thcipriani)
[21:03:50] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): More public IPs for codfw1dev - https://phabricator.wikimedia.org/T313977 (10Andrew) 05Resolved→03Open These IPs are reachable from within codfw1dev but not from the greated Internet. @cmooney is that what you'd expect? It's...
[21:03:56] <logmsgbot>	 !log brennen@deploy1002 Started deploy [phabricator/deployment@a21dea9]: test deploy to phab2001
[21:04:22] <logmsgbot>	 !log brennen@deploy1002 Finished deploy [phabricator/deployment@a21dea9]: test deploy to phab2001 (duration: 00m 27s)
[21:06:57] <logmsgbot>	 !log brennen@deploy1002 Started deploy [phabricator/deployment@a0f0699]: test deploy to phab2001 (take 2)
[21:07:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P32100 and previous config saved to /var/cache/conftool/dbconfig/20220728-210721-marostegui.json
[21:07:24] <logmsgbot>	 !log brennen@deploy1002 Finished deploy [phabricator/deployment@a0f0699]: test deploy to phab2001 (take 2) (duration: 00m 27s)
[21:08:34] <wikibugs>	 (03PS1) 10Dduvall: phabricator: Support scap3 deployment of configuration [puppet] - 10https://gerrit.wikimedia.org/r/818227 (https://phabricator.wikimedia.org/T313950)
[21:09:10] <wikibugs>	 (03PS3) 10Dduvall: scap: Deploy configuration using scap3 templates [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/817915 (https://phabricator.wikimedia.org/T313950)
[21:17:08] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10odimitrijevic) a:03RKemper
[21:18:06] <logmsgbot>	 !log mforns@deploy1002 Started deploy [airflow-dags/analytics@5ec2435]: (no justification provided)
[21:18:15] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@5ec2435]: (no justification provided) (duration: 00m 09s)
[21:22:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T312990)', diff saved to https://phabricator.wikimedia.org/P32102 and previous config saved to /var/cache/conftool/dbconfig/20220728-212227-marostegui.json
[21:22:33] <stashbot>	 T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990
[21:26:08] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:30:48] <wikibugs>	 (03PS1) 10Brennen Bearnes: scap: stub out a checks.yaml [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/818231 (https://phabricator.wikimedia.org/T313953)
[21:38:37] <jan_drewniak>	 oh no! I missed the backport window 🤦‍♂️ thcipriani: is it too late to do my backport now?
[21:40:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:41:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:41:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:42:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:46:08] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[21:51:12] <logmsgbot>	 !log mforns@deploy1002 Started deploy [airflow-dags/analytics@e8d4704]: (no justification provided)
[21:51:22] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@e8d4704]: (no justification provided) (duration: 00m 09s)
[22:00:02] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:02:46] <icinga-wm>	 PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: session-c1505.scope,user@114.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:09:00] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[22:15:43] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] "This looks fine to me. Should we deploy this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809932 (https://phabricator.wikimedia.org/T303359) (owner: 10MarcoAurelio)
[22:18:08] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:21:53] <logmsgbot>	 !log mforns@deploy1002 Started deploy [airflow-dags/analytics@9ea9cd1]: (no justification provided)
[22:22:03] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@9ea9cd1]: (no justification provided) (duration: 00m 09s)
[22:27:19] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+1] site: add phabricator role to phab2002 [puppet] - 10https://gerrit.wikimedia.org/r/685136 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[22:32:48] <wikibugs>	 (03PS1) 10BCornwall: acme-chief: use /usr/bin/env as python interpreter [puppet] - 10https://gerrit.wikimedia.org/r/818234
[22:54:42] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[22:58:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder  - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[23:06:13] <wikibugs>	 (03PS1) 10Ebernhardson: Release updated version of search-extra for 6.8.23 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/818241
[23:06:35] <wikibugs>	 (03CR) 10Ori: "Friendly ping" [deployment-charts] - 10https://gerrit.wikimedia.org/r/816203 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori)
[23:08:31] <wikibugs>	 (03CR) 10Bking: [C: 03+2] "Plugins and changelog...nice!" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/818241 (owner: 10Ebernhardson)
[23:09:48] <wikibugs>	 (03CR) 10Bking: [V: 03+2 C: 03+2] Release updated version of search-extra for 6.8.23 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/818241 (owner: 10Ebernhardson)
[23:21:36] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:22:07] <wikibugs>	 (03PS2) 10Tim Starling: Multi-DC routing special cases for OAuth [puppet] - 10https://gerrit.wikimedia.org/r/817086 (https://phabricator.wikimedia.org/T313578)
[23:22:23] <wikibugs>	 (03CR) 10Tim Starling: Multi-DC routing special cases for OAuth (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/817086 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling)
[23:29:50] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10RKemper) p:05Unbreak!→03High
[23:30:17] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning: Add xcollazo to analytics-admins - https://phabricator.wikimedia.org/T311176 (10RKemper) >>! In T311176#8113236, @RhinosF1 wrote: > @EChetty: Can you please clarify raising this as UBN? Is this something work should be dropped immediately to do or ca...
[23:39:17] <wikibugs>	 (03PS1) 10Ryan Kemper: analytics-admins: add xcollazo [puppet] - 10https://gerrit.wikimedia.org/r/818266 (https://phabricator.wikimedia.org/T311176)
[23:40:26] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[23:50:46] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Multi-DC routing special cases for OAuth [puppet] - 10https://gerrit.wikimedia.org/r/817086 (https://phabricator.wikimedia.org/T313578) (owner: 10Tim Starling)
[23:53:46] <wikibugs>	 (03PS1) 10Ryan Kemper: kerberos: mraish has kerberos principal now [puppet] - 10https://gerrit.wikimedia.org/r/818267 (https://phabricator.wikimedia.org/T313316)
[23:58:03] <wikibugs>	 (03CR) 10Ryan Kemper: "Merging because the corresponding kerberos principal (user) has been added for mraish" [puppet] - 10https://gerrit.wikimedia.org/r/818267 (https://phabricator.wikimedia.org/T313316) (owner: 10Ryan Kemper)
[23:58:05] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] kerberos: mraish has kerberos principal now [puppet] - 10https://gerrit.wikimedia.org/r/818267 (https://phabricator.wikimedia.org/T313316) (owner: 10Ryan Kemper)