[05:29:52] // temporary for testing -- legoktm 2015-07-02 [05:29:52] if ( $wgDBname === 'metawiki' ) { [05:29:53] $wgCentralAuthEnableUserMerge = false; [05:30:07] legoktm, is it still needed? [05:30:23] ughhhh [05:30:23] we need to delete all of that. [05:30:29] ok [05:33:37] oops, found a bug [05:34:28] in CA? [05:34:41] https://phabricator.wikimedia.org/T141599 [05:39:29] oops! [05:39:31] invalid [18:08:08] anybody here that knows recentchanges API code well? [18:08:34] SMalyshev: Maybe. What's your question about it? [18:09:13] anomie: well, I have this strange things happening with WDQS updater that misses some updates. I wonder if it's possible that rcstream updates may appear out of order? [18:09:43] I mean, can it be that if I ask for rcstream from certain point, some updates made after this point didn't arrive yet? [18:09:45] SMalyshev: rcstream !== recentchanges API code [18:09:57] I mean recent changes API, sorry [18:09:58] * anomie doesn't know anything about rcstream besides that it exists [18:11:03] it looks like my tool when going with certain point thought next item is X, but now I am looking at it and I see it's X-1. So I wonder if it's possible that X-1 arrived late? [18:11:11] and was kind-of back-inserted? [18:12:13] specifically, when the tool looked at the moment at https://www.wikidata.org/w/api.php?format=json&action=query&list=recentchanges&rcdir=newer&rcprop=title|ids|timestamp&rcnamespace=0|120&rclimit=100&rccontinue=20160720152523|372669870, the first item wasn't there... but now it is [18:12:28] so I wonder if it's possible or my tool is just buggy somehow? [18:14:57] It's possible if e.g. the transaction writing that entry was slow enough. [18:16:39] anomie: ok, then is there a) some way to work around that (after all, I need to get the actual stream) or b) at least to know how far back I have to look so that it doesn't happen? [18:17:26] or some better way to get updates maybe? because I need update stream, but if it randomly skips items that is not very good for me... [18:17:31] There's the actual rcstream thing, https://wikitech.wikimedia.org/wiki/RCStream [18:17:45] anomie: rcstream is not seekable [18:18:09] I need to be able to start a while back and load all the updates since then [18:18:58] Load back-updates from the action API at a delay of however many seconds (5, I think) are necessary for transactions to commit, and use rcstream for real-time going forward? [18:19:43] Or if you're only doing periodic back-updates anyway, just do them at the delay. [18:19:48] anomie: real-time is not very useful for me - I need to ensure continuity, not real-time [18:19:58] it's much better to be a bit behind than to skip an update [18:20:15] So instead of "last update until now", do "last update until now minus 1 minute" [18:20:20] anomie: ok, so how far back I need to go, time-wise and RCid-wise? [18:21:05] anomie: hmmm so not using rccontinue but just use rcstart with time 1 minute back? I'll try that [18:21:31] If it's just slow transactions you're having to wait for, a minute's delay should be plenty more than enough. If there's slow parses before the transactions or something, you might want a bit longer. [18:21:53] it'll load a ton of repeated updates, which sucks, but it's not too bad since the batch is huge (1000) and it usually gets about 10 real ones anyway... so another 10 old ones won't be a big deal [18:22:19] Either rcstart with dir=older, or dir=newer with rcend to stop at the T-1 minute mark. [18:22:29] anomie: hmm... so it can happen that rc entry is inserted with rcid/timestamp like 10 mins back? [18:23:05] That seems unlikely to me, but I can't promise it doesn't happen if the page takes forever to parse for some reason. I don't know at what point it chooses the timestamp for the update. [18:23:14] I have to do dir=newer I think... [18:23:24] (i.e. timestamp, parse, save versus parse, timestamp, save) [18:23:41] it's be much better if it chose the IDs when it's done updating... but I understand there might be other considerations :) [18:24:47] anyway, looks like I'll have to add accommodations for recentchanges back-inserts... that will compilicate stuff :) [18:24:59] The IDs are chosen when it writes, it's an auto-increment column. But other delays might cause the timestamps and ID orderings to not match, as in the case of the example you linked 372669871 has an earlier timestamp than 372669870. [18:25:00] :( rather... but I can deal with it I guess [18:25:48] yeah so 372669871 is the one missing [18:25:52] It's not really a "back-insert", it's just that you're having a race condition with whatever is inserting the records. [18:27:26] hmm so maybe I could go just by rcid? Is there a way to do that? [18:28:18] I guess I could make synthetic rccontinue with timestamp way back in the past but rcid from last one [18:28:31] anomie: you say rcid's would be always in sequence? [18:28:32] That probably wouldn't help much because transactions could still be slow. And there isn't a way to do that with the action API anyway. [18:29:11] anomie: but if I've seen rcid X, can rcid Y Both are in a sequence. The problem is that the server writes the row but doesn't commit the transaction immediately, then you query, then the transaction commits. And slave lag complicates it further. [18:29:42] * SMalyshev not sure how mysql actually allocates auto-ids with replication & stuff [18:30:31] https://dev.mysql.com/doc/refman/5.7/en/innodb-auto-increment-handling.html#innodb-auto-increment-lock-modes [18:30:53] anomie: right... what I am tryng to understand is whether it's possible I'll see rcids out of sequence. If I don't see some of them in time, but see them later, it's fine as long as it's in-sequence [18:41:11] SMalyshev: "out of sequence" doesn't make much sense here, and it doesn't sound like you actually care. No matter what you do it's possible that you might send a query for rows "until now", then repeat that query later and have additional rows that you didn't see the first time thanks to races with the insertion transactions. The solution in your case is to identify the maximum time something on the server is allowed to run before getting killed [18:41:11] and make your query be "until now-T" instead of "until now". For database transactions that's $wgMaxUserDBWriteDuration, which is currently 5 seconds on WMF wikis. But you might want to wait extra time in case parsing happens after timestamping but before the transaction start, or else dig into the code and disprove that possibility. [18:42:47] anomie: right... the thing is currently "now" is defined for my code as rccontinue string - i.e. time+rcid as I understand? [18:42:59] does rccontinue use revid or rcid btw? [18:43:05] "now" is defined by whatever value you pass as rcend. [18:43:15] (when using dir=newer) [18:43:22] don't you mean rcstart? [18:43:29] I don't use rcend [18:44:51] You're saying "Give me rows from $LASTUPDATE until $NOW". With rcdir=newer, that means rcstart=$LASTUPDATE and rcend=$NOW. If you don't specify rcend, it's de facto the exact "now" when the query is submitted. [18:45:18] my query is actually "Give me up to 1000 rows from $LASTUPDATE" [18:45:38] so yeah maybe implied rcend but I don't care about it [18:46:19] I'm telling you that you should care about it if you want to avoid racing with updates-in-progress at the moment you send the query. [18:48:04] anomie: so that point is something that I'm not sure I understand. Suppose I miss update in progress. Why is it important for me if I can pick it up at the next update? [18:49:01] Your problem is that you don't know how to pick it up in the next update, because if races resulted in something just before $LASTUPDATE being missing last time your next query will skip over it. [18:49:48] There's no way for the API to know that you happened to miss that one row but got some later rows in the time-sequence. [18:50:03] anomie: right. So that's why I am asking about rcid... if rcid's are in sequence then I could base off that [18:50:18] but I'm not sure if API even looks for rcid... [18:50:37] looks like while rcid is part of rccontinue, API kind of ignores it [18:51:07] I already pointed out that ordering by IDs won't solve the problem, then you're just replacing time-sequence with id-sequence but still having the possibility of races. Maybe less possibility since we know for sure rc_id is limited to the 5-second $wgMaxUserDBWriteDuration, but it's still there. [18:51:42] so you're saying the only way is just go X time back from last timestamp and hope X is big enough [18:51:43] The API modules doesn't ignore it. The query is ORDER BY rc_timestamp, rc_id, and the continuation is similarly "$rc_timestamp|$rc_id". [18:52:11] anomie: right, but when I supply rccontinue, what the query WHERE condition? [18:52:37] You could do that, but then you'll always be getting X time worth of repeat rows that weren't raced. I'm telling you to use rcend to not even try to see rows newer than X time. [18:53:01] the WHERE is rc_timestamp > $timestamp OR (rc_timestamp = $timestamp AND rc_id >= $id) [18:53:14] ah dammit [18:53:15] $this->addWhere( [18:53:15] "rc_timestamp $op $timestamp OR " . [18:53:15] "(rc_timestamp = $timestamp AND " . [18:53:15] "rc_id $op= $id)" [18:53:16] ); [18:53:37] that is bad for me. That means rc_id only applies if timestamp is the same [18:53:53] which means if I set timestamp back rcid is useless [18:54:04] I might as well just use rcstart [18:54:31] I was hoping to avoid constant reloading of last 1000 updates.... [18:54:54] I already told you how to do that, about five times now: use rcend. [18:54:55] maybe I can keep rcid and check manually after fetching the feed [18:55:20] anomie: I'm sorry I am not seeing how rcend helps me? what I'll set rcend to? [18:55:40] now minus X time. [18:56:05] anomie: wait, wouldn't that mean I never get updates that happened within X time from now? [18:56:21] that sounds like opposite of what I want... [18:56:27] I must be missing something here [18:56:41] Aren't you running this repeatedly from a cron job or something? [18:57:09] it's not cron job, it's java process. yes it runs repeatedly (or, more precisely, continuously) [18:57:34] it fetches the rc stream, then parses it, fecthes the entities and updates them [18:57:51] and so on, repeatedly, all the time [18:58:40] Say X time is 10 minutes. You run it at 18:58, it'll get events up intil 18:48. Then 3 minutes later it runs again and gets events between 18:48 and 18:51. [18:59:08] ah I see what you mean. But that would mean constant 10-minutes lag [18:59:21] I'm not sure it's good [19:00:06] It doesn't have to be 10 minutes. It might be 10 seconds, balance risk of missing a row with whatever the consequences of lag are. [19:00:55] Or, if you really want real-time updates, do what I said earlier: use rcstream going forward and just use the API to backfill from the time you first connected to rcstream. [19:01:02] ok, I see what you mean... But I'm not sure I'm happy with mandatory lag [19:02:07] no, I don't want to rely on rcstream, since I am always backfilling essentially. I don't want to have two mechanisms, switching over between them is too complex [19:02:35] also, updates take time, so if we have 10000 updates in a second, I may not be able to process them real-time. I don't want to depend on real-time [19:03:15] I want it to be a marker-based, but keep close to real-time if possible, and lag temporatily if it can't deal with the load [19:04:18] matching between rcstream and backfill would be very tricky (since I will have to first start rcstream to know what is my backfill starting point and only then backfill, and that'd be messy) [19:04:47] and data is updating, so it'd be two updates going one over other at the same time.... too complex. [19:04:50] one system is better [19:06:48] I can't think of any options besides "intentional lag to avoid races", "overlap query time ranges to catch races", or "use two systems", because "have the API do a full table lock" isn't scalable ;). [19:07:05] I think I'll go with overlap [19:07:50] it's not too bad to load rc's repeatedly since I have duplicate detection anyway. It complicates code a little but probably is workable [19:08:00] I'll try to implement it and see if it works [19:08:06] thanks for the explanations! [19:10:26] SMalyshev: BTW, note that the API considers rccontinue to be opaque data and reserves the right to change the format at any time without notice. A less risky way to do it would be to ignore rccontinue entirely and just use rcstart for each overlapping query. [19:15:17] yeah that's what I probably would do [23:35:49] anomie: Update on CSP RFC? Are there already approved steps waiting for implementation (blocked on resources), or are the next steps still in draft and coudl perhaps be the subject of another RFC meeting to reach consensus? Or in draft to be perfected by authors and stakeholders first (maybe wikitech reminder for people to leave feedback and further the RFC)