[08:35:01] <arnaudb>	 if someone around has 1min to give my patch a sanity check that would be amazing :)
[08:35:07] <arnaudb>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1071883
[08:35:25] <arnaudb>	 (please and thank you <3)
[10:55:48] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s3 on db2205 is CRITICAL: 7.988e+04 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2205&var-port=9104
[10:55:56] <arnaudb>	 (normal)
[11:18:53] <jynus>	 arnaudb: done
[11:19:17] <arnaudb>	 thanks!
[11:19:57] <jynus>	 but I would give priority to T374425
[11:19:58] <stashbot>	 T374425: db2205 stuck replication/processlist - https://phabricator.wikimedia.org/T374425
[11:20:05] <arnaudb>	 i'm on it already!
[11:20:19] <jynus>	 nobody is going to give you a bad time if you don't finish regular maintence
[11:20:35] <jynus>	 specially if you have a good excuse like that ticket!
[11:20:50] <jynus>	 so please don't stress about rushing it
[12:17:48] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s3 on db2205 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2205&var-port=9104
[12:55:30] <volans>	 Amir1: https://gerrit.wikimedia.org/r/c/operations/software/+/1072168 seems a duplication of the work done in T371351 or am I missing something else? We have two cookbooks for that since end of july that were reviewed and tested and that will be used for the switchover. (cc jynus as you commented on the CR). 
[12:55:31] <stashbot>	 T371351: Automate the pre/post switchover tasks related to databases - https://phabricator.wikimedia.org/T371351
[12:59:46] <jynus>	 I thanked Amir as I asked him to publish it for public review/tracking, rather than being on a paste
[13:00:21] <Amir1>	 volans: That is intentional. For this dc switchover, I'm planning to use the most basic version I wrote because it I couldn't well understand the written cookbook, this is an extremely sensitive operation (if it messes up, it's going to cause data corruption and or split brain which is really hard to clean up) and specially since Manuel won't be around to help in case things go wrong. So I wrote something that basically 
[13:00:21] <Amir1>	 emulates bash
[13:00:37] <Amir1>	 for later switchovers, we can use the cookbook
[13:01:31] <volans>	 I totally and strongly disagree
[13:01:36] <volans>	 for so many reasons
[13:05:01] <Amir1>	 I understand but I am responsible for setting up circular replication in production before the switchover and I can't run something I'm not comfortable with in our production, specially for something this sensitive. 
[13:09:04] <volans>	 the cookbooks have been ready for review and testing since August 1st and my understanding was that they were tested by Arnaud and Manuel already
[13:09:38] <jynus>	 I think we should discuss this more in depth, I though there was more agreement on this topic than I thought
[13:10:36] <volans>	 no concern was voiced in the task or in any meeting I've been in, this is a total surprise for me
[13:10:43] <Amir1>	 it was never tested in actual production section (except test-s4) that worries me a lot. We have specific GTID stuff in production that are different than standard setups
[13:11:25] <Amir1>	 and test-s4 was tested the last day before Manuel's leave
[13:11:25] <jynus>	 I suggest let's discuss this next monday
[13:13:32] <Amir1>	 sounds good
[13:15:31] <jynus>	 going for lunch, clearly there has been some misscommunication here, please keep it cool (as it has been so far) 0:-)
[13:17:58] <jynus>	 arnaudb: one question (no rush on getting an answer)- did you see the semisync issue happening before the script upgrade? Did you used it successfully for some time (trying to discard that as a possible cause)
[13:19:50] <arnaudb>	 I'm not sure which script you're referring to jynus
[13:21:10] <claime>	 hey peeps, last run of mediawiki_job_purge_parsercache_pc5 failed 
[13:21:12] <claime>	 Sep 11 01:00:00 mwmaint1002 mediawiki_job_purge_parsercache_pc5[1649]: InvalidArgumentException from line 1388 of /srv/mediawiki/php-1.43.0-wmf.21/includes/objectcache/SqlBagOStuff.php: Unknown server tag: pc5
[13:21:14] <Amir1>	 the switchover?
[13:21:26] <jynus>	 yes
[13:21:29] <claime>	 Is it safe to restart, or should I let it be and it'll run next time?
[13:21:31] <Amir1>	 claime: that's partially me 
[13:21:49] <Amir1>	 I'm bringing it online today
[13:21:50] <jynus>	 claime: being setup
[13:21:53] <claime>	 aaah
[13:21:55] <claime>	 ok
[13:22:01] <Amir1>	 but claime it shouldn't have even tried to run it
[13:22:12] <Amir1>	 how it started running it before I got it online
[13:22:15] <jynus>	 arnaudb: maybe it started to happen after the last patch or something?
[13:22:36] <Amir1>	 I couldn't well understand the written cookbook, this is an extremely sensitive operation (if it messes up, it's going to cause data corruption and or split brain which is really hard to clean up) and specially since Manuel won't be around to help in case things go wrong
[13:22:43] <Amir1>	 wrong paste
[13:22:46] <jynus>	 I am trying to figure out reasons why it happened to avoid it in the future
[13:22:48] <Amir1>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/972382/2/modules/profile/manifests/mediawiki/maintenance/parsercachepurging.pp
[13:23:00] <Amir1>	 claime: We need to add them explicitly here, I haven't done that :D
[13:23:09] <Amir1>	 (that's pc4 for last year)
[13:23:45] <Amir1>	 jynus: regarding switchover, we have done a lot of switchovers since the upgrade, that'd be weird if it breaks only for that one
[13:24:00] <jynus>	 thanks, that is what I was asking
[13:24:02] <Amir1>	 still might be possible but I don't know
[13:24:09] <claime>	 Amir1: It got added (I suppose by mistake) in Id4bebbcfa86d4a2539ff58d8868d4a98be7b17c4
[13:24:25] <claime>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1071750/6/modules/profile/manifests/mediawiki/maintenance/parsercachepurging.pp
[13:24:29] <Amir1>	 claime: ah, I missed it, sorry
[13:24:31] <jynus>	 Amir1: sure, but if it certainly is with the info you gave me less likely
[13:24:42] <Amir1>	 okay then, I'm putting it online today, it shouldn't cause issues
[13:24:57] <arnaudb>	 ah yep, mybad ↑
[13:25:01] <arnaudb>	 sorry!
[13:25:03] <claime>	 Amir1: cool, just wanted to make sure
[13:25:21] <jynus>	 re: claime I'd suggest acking the alert for now with expiration
[13:25:35] <claime>	 yep that's what I was gonna do
[13:25:42] <jynus>	 there are some round dependencies there that sometimes are hard to solve
[13:25:58] <jynus>	 you need things properly setup to work, and you need things to work to properly setup
[13:26:08] <jynus>	 *just stateful things*
[13:26:14] <Amir1>	 I need to create the tables first, which I have done last year https://phabricator.wikimedia.org/T350367#9309645 but I don't remember how
[13:26:18] * Amir1 kicks his old self
[13:26:29] <Amir1>	 (for not documenting better)
[13:26:40] <jynus>	 I think there is a maintenance mw script
[13:26:46] <jynus>	 which one? idk
[13:26:52] <claime>	 acknowledged for one day
[13:28:59] <jynus>	 arnaudb: not your fault, we just missed downtimeing it while it was being setup
[13:40:25] <Amir1>	 aha, I wrote a script for it
[13:57:20] <Amir1>	 pc5 is live in mwdebug1002 now
[14:06:13] <Amir1>	 it works fine
[14:12:20] <Amir1>	 now the fun part of deploying this everywhere
[14:25:57] <topranks>	 urandom: thanos-fe2004 is on our list of hosts to move switch later, are you able to depool it?
[14:25:59] <topranks>	 https://phabricator.wikimedia.org/T373101
[14:56:06] <urandom>	 topranks: yes, I'll get it
[14:56:38] <urandom>	 topranks: the window starts in ~1 hours though yes?
[14:57:19] <topranks>	 yep we're an hour away, and flexible if you need longer just let me know 
[14:58:02] <urandom>	 No no, it's fine.  Just wanted to make sure I didn't get the time wrong 😀
[14:58:39] <Amir1>	 about to pool pc5
[15:00:08] <Amir1>	 pooled
[15:01:14] <arnaudb>	 no explosion
[15:05:36] <Amir1>	 \o/
[15:05:43] <Amir1>	 I love no explosions
[15:06:01] <Amir1>	 appservers latencies are up but that's intentional: https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s?orgId=1&refresh=1m
[15:06:09] <Amir1>	 *expected 
[15:06:10] <arnaudb>	 its a good indication for movie quality, also for reality
[15:06:19] <arnaudb>	 its due to cache warming up?
[15:06:31] <Amir1>	 yup, entries are being displaced 
[15:23:31] <jynus>	 keep disk space monitored, it shouldn't happen, but it wouldn't be impossible that there is an increase due to stale entries (it happened in the past)
[15:23:45] <jynus>	 *increase in disk usage
[15:29:50] <arnaudb>	 good idea jynus: T374551
[15:29:51] <stashbot>	 T374551: mariadb - monitoring - predict linear on disk/ram usage - https://phabricator.wikimedia.org/T374551
[15:30:26] <jynus>	 arnaudb: we triend in the past, but sadly that is very error prone
[15:30:53] <arnaudb>	 I would not rely only on this, but it's helped me 
[15:30:59] <jynus>	 for example, when loading data up to e.g. 60% in 1h, that will predict that you will run out of disk space in 2 hours
[15:31:08] <arnaudb>	 better have it and not need it than not have it and need it :p
[15:31:24] <jynus>	 well, we have stuff
[15:31:32] <arnaudb>	 we'll have even more :D
[15:31:38] <Amir1>	 thankfully, we did an expansion last year (pc4) which was more disruptive by nature (more keys being displaced) and that didn't cause issues 
[15:31:47] * arnaudb feels like a tool hoarder 
[15:31:54] <jynus>	 arnaudb: actually that is not necesarilly true
[15:32:15] <Amir1>	 you never know but given that last year it was fine, I'm hopeful 
[15:32:25] <jynus>	 more != better, I'd prefer to work on signal-to-noise ratio rathern than imprerfect signals
[15:32:46] <jynus>	 that doesn't mean there are no gaps, but we should be careful about them
[15:33:02] <jynus>	 for example, I would prefer to have that in a dashboard, not a pinging alert
[15:33:08] <jynus>	 and I belive we already have it
[15:33:51] <jynus>	 see: https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1
[15:34:35] <jynus>	 Amir1: yeah, that is why I said it was corrected when it was resharded years ago
[15:34:51] <jynus>	 but for some time it had a lot of issues, I belive Timo helped fixing many of those
[15:35:07] <jynus>	 and you too I want to remember?
[15:35:24] <Amir1>	 we did some changes and I will do a lot more next Q
[15:35:33] <Amir1>	 T373037
[15:35:34] <stashbot>	 T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037
[15:35:51] <jynus>	 in the past I think purges happened only at night
[15:36:08] <jynus>	 and at some point they started happening continously
[15:36:16] <Amir1>	 for example, we removed 30% of entries by merging mobile and desktop PC entries
[15:36:48] <jynus>	 if I were me, I would rebuild it on top of something that is not mysql
[15:37:06] <jynus>	 but you know I am not going to complain about incremental upgrades :-)
[15:37:48] <Amir1>	 I have gone back and forth between mysql and non-mysql and so far has landed on incremental improvements, eventually it should move to another non-mysql solution but future-me problem
[15:38:15] <jynus>	 arnaudb: just to be clear, I wasn't saying no, just giving context that we tried it in the past and it wasn't easy, so will require some extra work
[15:38:45] <Amir1>	 biggest thing is that right now parsoid for read is being rolled out and I want to make sure they can have "both" entries for at least group of wikis 
[15:39:04] <jynus>	 yeah, and if possible something less hand crafted than memcache + mysql
[15:39:22] <jynus>	 something that would handle most of that complexity on a different codebase we don't maintain
[15:40:11] <arnaudb>	 oh 100% on the same page jynus that makes sense to me, I've been dealing with such rules in the past and the saved me a few times. I'll keep in mind to avoid piling on noise  (we have several angles already in the pipes to further improve our current situation)
[15:40:44] <jynus>	 Amir1: and something that values less C and more AP, rather than mysql regular model of CP over P, that core data requires
[15:40:55] <Amir1>	 yup
[15:41:40] <jynus>	 arnaudb: there is still a lot of need for more automated capacity planning
[15:42:09] <arnaudb>	 this is not really automated capacity planning that i'm aiming for at first iteration but more like a stick to bump on the wall before our collective nose does :D
[15:42:10] <jynus>	 I hate when I have to go graph by graph on every backup host to see how many servers we need to by in X years
[15:42:49] <jynus>	 I think that would be a more useful use case for prodictions (long term)
[15:42:53] <jynus>	 *predictions
[15:43:03] <arnaudb>	 yeah but I never managed to get that far in monitoring quality
[15:43:14] <arnaudb>	 this would require some datadog like intelligence behind metrics
[15:43:42] <jynus>	 the capacity planning or the disk full?
[15:44:00] <arnaudb>	 long term capacity planning I mean, outside of browsing graphs
[15:44:41] <jynus>	 not really, for what I need, we could do it a few graphana graphs + prometheus formulas
[15:44:56] <arnaudb>	 oh nice
[15:45:00] <jynus>	 what we need is better organization (and zarcillo doesn't count)
[15:45:21] <arnaudb>	 zarciwho? :p
[15:58:39] <Amir1>	 btw, did we investigate why db1166 failed today?
[15:59:00] <arnaudb>	 index thingy
[15:59:22] <arnaudb>	 i've updated the tracking gsheet
[16:01:08] <Amir1>	 ah, thanks
[16:27:39] <topranks>	 urandom: all done if you want to repool that host and check the be's 
[16:27:39] <topranks>	 thanks
[16:27:58] <topranks>	 arnaudb: same goes for the db hosts 
[16:28:07] <urandom>	 topranks: perfect; thanks!
[16:29:32] <arnaudb>	 neat, thanks topranks!