[00:43:57] <legoktm>	 the email to root@ was me, please ignore
[05:30:42] <Krinkle>	 marostegui: the script seemed to be making progress, about 3% every 4 hours or so. Maybe it got more stuck or maybe that is considered stuck/unusual. I've not checked for years how long is normal for that script. I guess 5 days is not normal then?
[05:31:22] <Krinkle>	 I know it used to take only a few hours total but that was in 2015
[05:32:04] <marostegui>	 Krinkle: it used to take a few hours before, anyways, it was running with the old expire time, so it wasn't useful anyways
[05:32:42] <Krinkle>	 Right. I guess  we'll find out if it goes at the same speed or not :)
[05:32:59] <marostegui>	 if it takes 5 days....then we have another problem :)
[09:20:02] <Amir1>	 _joe_: https://twitter.com/ItalianComments
[09:20:11] <Amir1>	 specially https://twitter.com/ItalianComments/header_photo
[09:26:25] <vgutierrez>	 lol
[09:33:03] <dcaro>	 hahaha xd
[09:33:33] <dcaro>	 i felt like that when they had only chorizo paella in a restaurant in London
[09:35:09] <_joe_>	 Amir1: I know the person behind that twitter account :)
[09:35:29] <_joe_>	 and yes, it's very good
[09:35:34] <_joe_>	 there is also a fb page IIRC
[09:35:54] <vgutierrez>	 dcaro: chorizo paella? and that's called a restaurant?
[09:36:08] <_joe_>	 vgutierrez: "a british restaurant"
[09:36:57] <vgutierrez>	 so in british "restaurant" means "place where you go to suffer"
[09:39:07] <dcaro>	 https://youtu.be/jThoJAwyTXQ?t=126 <- that's supposed to be one of the best cheffs in England
[09:39:33] <dcaro>	 "classic Spanish paella" xd
[09:44:25] <dcaro>	 I'm getting 'cannot load such file -- puppet/util/puppetdb' when trying to add some new tests (profile tests), I'm sure it's something silly, any ideas?
[09:46:22] <vgutierrez>	 dcaro: no tech questions now.. I'm still in shock and tilted from your youtube video
[09:46:35] <dcaro>	 xd
[10:07:46] <Amir1>	 :D
[10:07:55] <Amir1>	 Sorry for hijacking the SRE :D
[10:18:26] <jbond42>	 dcaro: re 'cannot load such file' to you have a patch
[10:18:36] * jbond42 can only appolagise for our cusine
[10:18:48] <dcaro>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/686395
[10:19:05] * jbond42 looking
[10:31:24] <jbond42>	 dcaro: the issues is that the prometheus classes make heave usage of query_resources both directly and via get_clusters.  as such to test that class you would need to mock query_resources.  i took a quick look and i dont think anything elses has needed to do that yet
[10:32:32] <jbond42>	 ~/git/puppet/modules/puppetdbquery/spec/functions/query_resources_spec.rb might be a good startng point
[10:33:06] <jbond42>	 yuo might be able to get away with mocing just prometheus::class_config
[10:34:34] <dcaro>	 ack, thanks! Somehow puppet errors always confuse me :S
[10:38:37] <jbond42>	 dcaro: this seems to be enough https://phabricator.wikimedia.org/P15852
[10:44:05] <dcaro>	 awesome!
[17:58:14] <Krinkle>	 marostegui:
[17:58:15] <Krinkle>	 krinkle at mwmaint1002.eqiad.wmnet in /var/log/mediawiki/mediawiki_job_parser_cache_purging $ tail -n2 syslog.log
[17:58:15] <Krinkle>	 May  7 14:58:45 mwmaint1002 mediawiki_job_parser_cache_purging 1.72%
[17:58:15] <Krinkle>	 May  7 16:59:58 mwmaint1002 mediawiki_job_parser_cache_purging 2.07%
[17:58:26] <Krinkle>	 I don't think it's going any faster than before.
[19:14:35] <marostegui>	 Krinkle: :(
[19:14:36] <marostegui>	 so slow
[19:14:49] <Krinkle>	 marostegui: I don't know how, but somehow the old logs have come back
[19:14:58] <Krinkle>	 I now see logs in that file for several months back
[19:15:04] <Krinkle>	 I thought we had lost them
[19:17:07] <Krinkle>	 marostegui: https://phabricator.wikimedia.org/P15861
[19:17:34] <Krinkle>	 you can see the gaps increase
[19:17:59] <Krinkle>	 I asusme the biggest gap is switchover codfw
[19:21:11] <Krinkle>	 marostegui: I'm wondering if maybe you have insights on https://phabricator.wikimedia.org/T150124 - specifically, why these (simple?) primary key delete operations woudl be causing lag. It confuses me because the deletes happen only from a single thread (single php process, never anything else afaik) and each delete is synchronous on the master, so in general I would think the replicas can apply it just as quickly as the master, so no
[19:21:11] <Krinkle>	 need for sleeping or waiting, right? simple primary key reads, simple primary key deletes.
[19:21:46] <Krinkle>	 removing the half second sleep afer every 100 rows would help a lot :) But last time in 2016 we increased it from 0.1s to 0.5s because jynus noticed it was still causing lag.
[19:22:51] <Krinkle>	 aaron and I never understood why. but maybe it's obvious to you and we should have asked :)
[19:26:46] <jynus>	 Krinkle, it is because it has an extra round trip of 40 seconds for the other datacenter
[19:27:15] <jynus>	 s/seconds/miliseconds/
[19:27:37] <jynus>	 setup those locally on each datacenter and you won't have to wait
[19:30:36] <Krinkle>	 jynus: waiting 40ms seems reasonable, but I don't yet understand why the wait would help avoid lag.
[19:32:22] <Krinkle>	 the natural latency between the DCs seems like that would be a constant, why would it cause more lag if we send more delets within that time period - assuming we only send one at a time and wait for it to be applied on local master (so if its a slow query for some reason, we naturally wait for that; these are without transaction, autocommit; and afaik none of these are slow anyway).
[19:33:15] <jynus>	 yes, but depending on configuration/workload, you don't get the exact same work on both dbs
[19:33:40] <jynus>	 the dbs have to behave as if the write happened punctually at commit time
[19:33:51] <jynus>	 but that is not true for the other writes
[19:34:23] <jynus>	 there is also some overhead in replication
[19:34:38] <jynus>	 so you there is only some much writes you can transmit succesfully
[19:34:41] <Krinkle>	 the delets are typically like: DELETE where key in (100 keys) AND exptime < $timestamp
[19:34:53] <jynus>	 yes, so probably to optimize more
[19:35:04] <jynus>	 more writes per transaction, with same wait
[19:35:16] <jynus>	 to get a higher "throughpu"
[19:35:31] <jynus>	 not sure I am am making myself understood
[19:35:52] <jynus>	 you are not wrong about what you say
[19:36:57] <jynus>	 but sadly real world <> theretical one, there is constraints that affec the throughput and speed to commit
[19:37:18] <Krinkle>	 I guess this is one more reason to stop replicating parsercache between dcs
[19:37:18] <jynus>	 so I would push for a no-wait local application without replicatoin
[19:37:21] <Krinkle>	 as already planned afaik
[19:37:29] <jynus>	 well
[19:37:32] <jynus>	 not sure about changes
[19:37:47] <jynus>	 but purges certainly it is "easier and safer" to do them locally
[19:38:03] <Krinkle>	 right, ou mean run the scfript in both DCs
[19:38:06] <jynus>	 we sometimes find because cache is written unsafely
[19:38:10] <jynus>	 Krinkle, yep
[19:38:21] <Krinkle>	 and then tell mariadb not to replicate these delete queries?
[19:38:26] <jynus>	 that some data is not removed because they didn't have the exact same data
[19:38:28] <Krinkle>	 (is that possible?)
[19:38:29] <jynus>	 Krinkle, correct
[19:38:47] <jynus>	 Krinkle, it is possible to learn that power- just not from a jedy
[19:38:52] <jynus>	 you need a SUPER user
[19:38:58] <jynus>	 :-)
[19:39:01] <jynus>	 but it is doable
[19:39:05] <jynus>	 sql_log_bin=0
[19:39:19] <Krinkle>	 for that its worth, perf team seems fine with not replicating data changes either. when we're active for reads in both DCs, that should be good enough, we do the same with memcached. the replicating righ tnow is mainly I think because of cold switchover
[19:39:38] <jynus>	 so my question to that is, how to warm it up?
[19:40:13] <jynus>	 e.g. now (where we are active-passive) or after a faiover
[19:40:33] <jynus>	 if there is a mechanism for that, sure, if not, replication was the whole reason that was setup
[19:40:43] <jynus>	 but this is not solving that yet
[19:41:02] <jynus>	 this is just doing it for purges, which probably will be easier?
[19:41:17] <jynus>	 talk to manuel, honestly, I am just giving an uninformed opinion
[19:41:23] <Krinkle>	 I mean the transition phase  could go either way. We're much less reliant on PC than we used to be. the accidental cold switch we did 1-2 years ago demonstrated that. But for the multi dc swithc we could either stop replication slightly before and keep what we have, or turn it off shortly afterward.
[19:41:44] <jynus>	 > "the accidental cold switch we did 1-2 years ago demonstrated that"
[19:41:52] <jynus>	 for me that was the worst outage we have in years
[19:41:59] <jynus>	 but that was my perspective
[19:42:14] <jynus>	 mw unresponsive for minutes, half an hour?
[19:42:17] <Krinkle>	 I'll have to look at the numbers again, but afaik it was barely noticable for end users
[19:42:36] <jynus>	 I don't think SRE were very happy with us :-)
[19:42:41] <jynus>	 but ask someone from service ops
[19:42:50] <jynus>	 I don't think we have to solve that now
[19:43:00] <jynus>	 I can give you 3 ideas first:
[19:43:05] <jynus>	 increase transaction size
[19:43:09] <Krinkle>	 which, for a scenario where a DC has to be completely depooled for several days, and then another emergendcy to switch all traffic to that stale DC, seeems not too bad.
[19:43:11] <jynus>	 if that doesn't work
[19:43:21] <jynus>	 make purges local-dc-only
[19:43:29] <jynus>	 and if that doesn't work, try something else
[19:43:43] <Krinkle>	 I assume we can also purge more different shards at the smae time
[19:43:51] <jynus>	 Krinkle, sure, but we don't want that on purpose
[19:43:56] <Krinkle>	 right now we do foreach server: foreach table: delete 100 rows + sleep.
[19:44:02] <Krinkle>	 so all other servers do nothing meanwhile
[19:44:06] <jynus>	 e.g. if you setup some way to fix that, no issue
[19:44:16] <jynus>	 such as a copy of pc contents before a failover
[19:44:26] <jynus>	 but we don't like cross-dc communication
[19:44:38] <jynus>	 in fact, we have to assume the other dc is unressponsive
[19:44:49] <jynus>	 so we need to have indendence between dcs
[19:45:05] <jynus>	 purging locally to dc works nicely with that model (we do it or not)
[19:45:28] <jynus>	 if you find a model that also works well, that is ok too, but I am pointing to the current system
[19:46:11] <jynus>	 Krinkle multiple ways to optimize, indeed
[19:46:20] <Krinkle>	 thx, I'll summarise this on-ticket. this is a new concern to me, I had assumped stopping replication was uncontroversial from sre/dba side and only blocked on perf team from MW-logic perspectice.
[19:46:36] <jynus>	 so, it is uncontroversial to me
[19:46:57] <jynus>	 if there is a method on the app to keep it warm, or we are always active active or whatever
[19:47:28] <Krinkle>	 my naive/stupid thinking is that when a DC goes offline, we slowly ramp it back up and thus naturally warm up (maybe even mirroring some amount of traffic to speed up warm up, this is not just about parser cache anyway). Assuming we don't have both down, that is, seems like there isn't a problem.
[19:47:29] <jynus>	 there is no need for replication
[19:47:49] <jynus>	 uf, that is controversial not for DBAS, but my suspcion is for SRE aims
[19:47:54] <jynus>	 "slowly ramp up"
[19:48:07] <jynus>	 I think the goal is to being able to failover as fast as possible
[19:48:33] <Krinkle>	 I'm talking about after multi-dc yes, so depooling quickly is fine always (intheory)
[19:48:39] <jynus>	 yes
[19:48:44] <jynus>	 depooling is no issue
[19:48:50] <jynus>	 I see
[19:49:04] <jynus>	 so you mean repool will never happen on an emergency
[19:49:23] <Krinkle>	 if we need to switch back in a second emergency that basically means both DCs were down almost at the same time
[19:49:35] <jynus>	 that's fair
[19:49:53] <jynus>	 then I understand, but that asummes active-active on a normal setup
[19:50:10] <jynus>	 which basically answers my question of "how to keep pc warm"
[19:50:29] <jynus>	 no issue with that
[19:50:44] <jynus>	 in that case you absolutely want also no replication of purgues
[19:51:04] <jynus>	 so you can start already testing that with the safest operation
[19:51:16] <jynus>	 and making thins run faster! :-)
[19:52:13] <jynus>	 I am brainstorming wais to speed up that, I don't have any definitive answer, talk to manuel
[19:52:59] <Krinkle>	 jynus: given an inactive codfw, that means codfw dbs are not likelty to be in a very different state, and even if both active, replication keeps them very similar (and we're talking by design about rows that have not changed in many days), so I'm still a bit unsure why a delete query by primary key would be "slower" to apply to a replicat than a master.
[19:53:11] <Krinkle>	 or, why 100ms was not enough in 2016
[19:53:36] <jynus>	 to be fair, 500ms is still not enough in 2021
[19:53:44] <jynus>	 we regularly have lag
[19:54:02] <Krinkle>	 I noticed a few spikes in pc* host lag yeah
[19:54:09] <Krinkle>	 but hard to tell the source/cause
[19:54:14] <Krinkle>	 maybe not the deletes?
[19:54:19] <jynus>	 "many writes"
[19:54:24] <jynus>	 it is not the delete's faults
[19:54:48] <jynus>	 it is just a way not pritize those over the regular replaces
[19:54:55] <jynus>	 "make them slower"
[19:55:12] <jynus>	 not like "it has to be 500ms or 600ms"
[19:55:27] <Krinkle>	 right
[19:55:27] <jynus>	 it is more like "make writes use as little resources as possible"
[19:55:44] <Krinkle>	 the regular inserts are unthrottled (naturally) and can already cause lag
[19:55:49] <Krinkle>	 so we want to be graceful toward that
[19:56:06] <jynus>	 and to be fair, there is less and less reasons to use mysql for this
[19:56:11] <jynus>	 if we won't replicate
[19:56:19] <jynus>	 sure, we can use a disk-backed store
[19:56:30] <jynus>	 but some that is way lighter
[19:56:35] <Krinkle>	 I hear redis is good at keeping memory store on disk
[19:56:58] <Krinkle>	 (don't kill me joe)
[19:56:58] <jynus>	 cannot say if serious or now
[19:57:00] <jynus>	 ok
[19:57:07] <jynus>	 *not
[19:57:25] <jynus>	 I was thinking something like memcached but persisted to disk
[19:57:42] <jynus>	 but very very light on writes
[19:57:54] <jynus>	 even if we lose half of the writes
[19:57:59] <jynus>	 on a crash
[19:58:58] <jynus>	 I really don't have anything that comes to me, because we are going to transition of how we use it
[19:59:09] <jynus>	 but
[19:59:43] <jynus>	 to not do any refactoring- if we remove binlog and replication overhead, you could get quite a lot of overhead back
[19:59:54] <jynus>	 until we find something better
[20:01:02] <Krinkle>	 I expect increasing batch size will initially possibley cut it in half, but on subsequent runs ideally more because right now we are also having to delete more each run because we don't run daily.
[20:01:15] <Krinkle>	 so it'll naturally get better after that
[20:01:45] <jynus>	 if with batch size you mean the rows per transaction
[20:01:56] <Krinkle>	 per delete query yeah ( we don't use transactions in this code)
[20:02:05] <Krinkle>	 we do a select limit 100 for primary keys and then delete
[20:02:18] <jynus>	 well, from mysql perspective autocommit just does a transaction per query
[20:02:18] <Krinkle>	 could bump to 150 or 200, will ask manuel
[20:02:34] <jynus>	 I think for regular dbs
[20:02:41] <jynus>	 the "rule" is 1000 rows
[20:03:08] <Krinkle>	 mediawiki is configured to do 100 rows per update, across all features. has been constant for a very long time.
[20:03:09] <jynus>	 although maybe pc rows are larger
[20:03:19] <jynus>	 ok
[20:03:26] <Krinkle>	 parsercache is not currently overriding that.
[20:03:36] <jynus>	 not sure, just talk to manuel and test several options
[20:03:53] <jynus>	 certainly the thrughput will increase like that
[20:04:13] <jynus>	 now what exactly you can increase without creating lag has to be tested :-)