[00:34:19] SMalyshev: did you ever sort out that chronologyprotector concern? [00:41:03] AaronSchulz: yes and no [00:41:13] by yes I mean I added function to work around it [00:41:23] by no I mean the original problem is still there [00:41:29] and happens from time to time [00:41:48] AaronSchulz: so if you have any ideas on how to solve it completely I'm interested [00:46:23] is there a task? [00:47:23] AaronSchulz: https://phabricator.wikimedia.org/T210901 is the immediate task [00:48:59] and Special:EntityData has not memcached use or anything? [00:49:11] not as far as I know [00:49:21] did someone from wikidata confirm that [00:49:28] it's cached by varnish probably but we use different URL each time [00:51:16] I don't see any reference to any caching in EntityData code [00:51:45] and it's used rarely enough so there would be no point in putting that into memcached [00:52:50] also if it was memcached, wouldn't all servers get the same stale data? [00:54:27] presumably, unless there were missed purges...but DB lag seems the most likely. [00:55:07] that's what I am thinking too. Repeating request in a short while seems to fix the issue [00:55:10] So, the WDQS updater is reacting to events (in kafka or something?) and then doing API requests? And I assume those events are enqueued at the same time the master is updated. That sort of thing doesn't really have a canned answer at the moment. [00:55:35] AaronSchulz: yes, Updater gets events from Kafka and uses API to retrieve RDF [00:56:03] and yes, kafka events are the same events that edit process generates [00:56:15] I don't want to store master positions in kafka or anything like that [00:56:45] usually it works fine but on rare occasions, when Updater comes to pick up RDF, turns out the revision it picks up is older than one that is in Kafka message [00:57:11] CP does have the idea of a clientId cookie/header, but the code pushing events probably does not update the corresponding DB positions in redis until later in the request, making that useless here. [00:57:27] luckily, this allows to know what happens, and I can re-schedule the update for later, but ideally I'd prefer somehow to wait until the update is done [00:57:38] I guess this will only get worse with more services being added...unless dealt with, maybe in some semi-general way. [00:58:23] well, it's kinda already worse because things updated in jobs don't work very well already... but at least things that are updated in main edit I'd like to work smoothly [00:59:10] AaronSchulz: see also discussion here: https://phabricator.wikimedia.org/T149239 [01:00:19] ideally, if there was a way to know when all servers processed the edits, then that's when I'd want to go on processing the event (and I don't care if it happens a bit later as I presume it's not worse than Web clients get it due to CP) but I have no idea how to ensure it [01:00:51] with jobs of course it's even trickier, so there I am not sure yet how to handle it... probably would require some tricks [01:03:30] AaronSchulz: if we had an option that would allow basically to wait until DB has caught up to revision X that would solve some of the problems [01:04:26] but then of course the question is what happens if it's too long (fail with some 5xx code and retry later?) or how to know that efficiently... [01:05:45] so, in preOutputCommit(), the main DB commit happens, deferred updates run, CP positions are saved, then post-send deferred updates. I suppose if the code that enqueues to kafka put the ChronologyProtector::getClientId() value in the message, and made sure to enqueue post-send, then the updater could relay that client ID as a header for the RDF HTTP request. [01:06:22] so what that ID is? [01:06:26] that would make it wait for the DB to catch up to that point for regular edits (to say nothing of edit jobs, though those already seem broken now) [01:07:26] and how does one use it to wait for replicas? [01:08:32] if MW gets a requests with a ChronologyClientId header, then the redis key with DB master positions will be keyed on that, otherwise it will be keyed on agent/IP. Even in the later case, getClientId() returns the hash used in the redis key, so other stuff with different IP/agents can keep using the same key. [01:09:26] so the updater would want to grab values from kafka (themselves from MW) to use for the ChronologyClientId HTTP header to Special:EntityData [01:09:55] aha, so if I add this header on the client then it's automatically see the proper replica of the DB? [01:10:04] would there be any performance effects to this? [01:10:15] MW should automatically apply CP (unless there is a ChronologyProtection: false header) [01:11:51] in the normal case of the DB being caught up, there should be very little perf effect. [01:12:21] similar to when users cause master changes and CP applies to them. [01:14:05] ok then, it looks like doable on both sides, I'll create a task and try to implement it and see what happens [01:14:58] do you normally send ChronologyProtection: false? That would make sense when you are not given a custom client ID to use. I know some services do that. [01:15:10] no, I dont send any special headers [01:15:56] does the updater trigger writes via MW APIs or is it just doing stuff in blazegraph and so on? [01:16:03] I do have nocache={random} in the URL to avoid varnish caching, but otherwise nothing special [01:16:28] AaronSchulz: wait, another thing - Special: is not MW API, strictly speaking - is it still going to work ok? [01:16:47] AaronSchulz: no, Updater never does writes to Mediawiki [01:16:55] only reads [01:17:07] if it never causes MW to do writes in a request (e.g. via an API), I guess using ChronologyProtection: false doesn't make anything faster, since there would never be positions to wait for. [01:17:20] ah ok [01:17:20] (well, excluding some integration tests but I assume that's not what you were talking about) [01:17:31] CP still applies that same (special or API) [01:17:39] ok, great [01:18:44] AaronSchulz: I wonder though what happens if we consume RecentChanges API... [01:19:41] do you mean as opposed to kafka? [01:20:27] yes [01:20:45] would we also have to add something to RC records? [01:22:35] though with RC it actually can just avoid consuming events from the last 15 secs or so... should be enough for the db to catch up, right? [01:22:58] with Kafka I have to read it as stream but with RC I can give bounds to the API I call so it's easier [01:25:53] wouldn't you be consuming from replica DBs anyway? [01:26:13] do we have a recentchanges API that uses DB_MASTER? [01:26:41] I suppose one replica could be way more lagged than a 'recentchanges' group one, but it's less of a problem I'd reckon [01:27:54] anyway, job based edits are trickier, and the job queue uses ChronologyProtection:false, so there would be no DB positions in redis to wait for. [01:27:58] yeah it consumes from replica, but since it's two separate requests it may be different replica [01:28:47] what's the typical replica lag? is there realistic upper bound? [01:31:03] I rarely see over 1sec for s8 (not counting labs replicas) on https://grafana.wikimedia.org/d/000000303/mysql-replication-lag?orgId=1 [01:31:38] AaronSchulz: aha, so gap of 5 secs (which will be mostly unnoticeable) would do well [01:32:04] AaronSchulz: I've created https://phabricator.wikimedia.org/T212550 for implementing your advice, will get to it in January already, probably [01:32:27] please feel free to comment there if you have any thoughts/advice/comments [01:32:30] right, getting >5 sec lag to almost never happen has been a goal of mine a while back [01:33:33] e.g. WAN cache has MAX_READ_LAG and friends that decide thresholds for lowering TTLs based on lag [01:34:24] that's higher than 5, though it also has to kludge for REPEATABLE_READ [01:34:59] also related to HOLDOFF_TTL (how long keys have super low TTLs after purge) [01:35:09] anyway, 5 sounds sane [01:35:16] great, thanks!