[04:03:55] duesen: what is Lua core? https://gerrit.wikimedia.org/r/c/mediawiki/core/+/684967 (thought maybe typo, but it's in multiple places). [08:46:52] o/ Could anyone enlighten me on the current way that after a page has had page props added to it at the time of ParserOutout generation, those properties eventually make their way into the page_props table? [08:46:59] im guessing links updates, jobs, deferred updates etc. Just trying to figure out why such an update takes so many minutes, and if it is to be expected, or if I am hunting a real bug [09:37:46] duesen: ^^ ;) poke :D [09:37:53] you're in demand today [11:28:46] Krinkle: Lua code. darn. [11:29:45] addshore: iirc page_props should be written along with backlings, in a LinksUpdate that is executed immediately after the response is sent. Not in a job. [11:29:55] Some deferred updates get re-scheduled as jobs when they fail though [11:51:10] duesen: so the case here would be dispatching from wikibase. What we see is 1) change on wikibase 2) timestamp for parserouput generation on client change in a few seconds, and page prop is in the parser output 3) some minutes later it seems to appear in the page_props table [11:51:54] in that case, that is not an edit with a response sent, this would all be triggered from a job in the first place, so I imagine all of the other things would also end up in jobs? but perhaps not the same initial job [11:52:42] the confusing thing from our point of view right now is we see the page_prop appear in parser output straight away, but not in the table for some minuites later [11:53:05] is that an expected sort of time for whatever this job is to be in the job queue waiting? (if so then we have no problem :)) [12:25:09] addshore: when you saw "in the parser output", where/how are you looking? If you're looking based on a page view, then that is just a page view reparse, not result of any job having run. Might very well be that the job just hasn't run by then. The reason it reparses on view is because page_touched propagates ahead of the jobs when changes propagate from templates/wikibase. [12:25:56] I was testing in shell.php and a snippet getting the PO directly, but it should try and eliminate that as a posiblity [12:25:57] (MediaWiki\MediaWikiServices::getInstance()->getParserCache())->get(Title::newFromText('PagePropsies')->toPageRecord(),ParserOptions::newCanonical())->getProperty('wikibase_item'); [12:26:17] That's cached but otherwise view-only and not going to make DB writes during random page views [12:27:52] I have another question while im wriitng this up, so we have getSecondaryDataUpdates which creates some updates. As far as I can tell they get executed via either DeferredUpdates::attemptUpdate or DeferredUpdates::addUpdate in DerivedPageDataUpdater based on $options['defer']. But what decides that option? [12:31:53] Ahh Krinkle, right so that probably aslo plays into this yes, I should verify this [12:32:18] if the page is purged, but a user view causes the reparse, then the page_props table update would always land in a job [12:34:12] but the only thing that should trigger the purge / reparse in that case is the dispatching process, as the page was 100% cached before this process gets kicked off [12:40:39] Krinkle: you're spot in, I see a purge (somehow) happen within a few seconds, and then the parsercache remains empty [12:41:25] so then a user view is what triggers the PO regeneration [12:44:32] As part of dispatching we trigger a HTMLCacheUpdateJob and RefreshLinksJob [12:44:46] addshore: yeah in general, link updates happen only post-send after a direct edit, or during refresh links jobs. Parser cache is populated more often (eg every 30 days after expiry), and sometimes a little bit ahead of the job if someone views it between cascading purge and job [12:45:28] Html cache job is tiny. Sends our cdn purge and bumps page touched, which makes loggedins and cdn miss see their own changes right away [12:45:50] right, so the html cache job essentially invalidates the parser cache [12:46:13] and then some time later the RefreshLinksJob gets completed actually filling the table? [12:46:34] https://www.mediawiki.org/wiki/Michael_Jackson_effect [12:46:41] We don't invalidate it [12:46:52] But yes effectively ignore it in most cases [12:47:28] This is why parser cache is page/title based and not keyed in rev id or touched time [12:47:56] * addshore looks up the current delays around RefreshLinksJob] [12:48:00] That's right. Reparses take longer than bumping an int field [12:48:21] As well as big db writes [12:48:58] And lower job queue concurrency assigned in WMF prod change prop config. [12:49:03] Right, and https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=refreshLinks indicates a backlog of a frw hours for the job? [12:51:04] I don't trust/understand that metric currently. Is being refactored. But yes it would seem that it's at least 15min behind over the past few weeks when zooming in [12:51:57] addshore: regarding "defer" I think that's mostly a post-edit thing not used otherwise. In jobs deferred are almost always synchronous anyway. [12:53:22] I think we set it based on whether doing it "now"during your edit would benefit you as editor. Eg if you have default parser preferences it helps speed up your page view after edit redirect. If not, then we defer the canonical one and do your personal one instead [12:58:36] okay, I think all of this helps, ty, I'm going to need to write up another comment [12:58:38] :) [13:33:34] addshore: html cache job invalidates the CDN, not the parser cache. [13:33:47] (maybe that too? but I don't think so) [13:34:56] Is the html cache job what invalidates page_touched? or is that the links update part? but krinkle said that page_touched would propagate ahead of the thing doing the work anyway [13:35:14] addshore: also, generally "defer" means "later in this request", not "some time later in a background job". [13:36:07] addshore: page touched would get bumped before scheduling the jobs that update the parser cache and invalidate the cdn. [13:36:24] hmmm, so in wikibase repo code? are you sure? [13:36:26] basically, it marks the parser cache for the page as dirty. [13:36:47] I was talking about general core behavior [13:36:48] scratch that last bit *checks the code* [13:36:51] not sure when wikibase does it [13:36:51] gotcha [13:38:25] well, all wikibase does is 1) purge webcase, 2) refresh links 3) recent changes injection [13:38:45] so I'm now unclear when this page_touched thing is bumped, unless it is part of refresh links [13:39:20] I vaguely remember wiring that code, and being very confused about the same thing [13:39:26] it'S quite possible that i got it wrong at the time [13:39:56] addshore: look for Title::invalidateCache [13:40:28] not called in Wikibe code at all, so it must be in one of these mediawiki parts [13:41:05] RefreshLinksJob has a InfoAction::invalidateCache( $title ); [13:41:43] that'S an unrelated cache [13:42:32] bah, I indeed don't see what causes the parsercache to be missed so quickly then [13:43:24] addshore: at a glance, I think WikiPageUpdater::scheduleRefreshLinks should call Title::invalidateCache(). Note however it uses a deferredupdate internally (not sure why, that seems a bit pointless). [13:43:47] ...especially since it uses PRESEND, not POSTSEND... [13:46:29] should but currently doesnt look liek it does? or should and is probably hidden in the job? [13:46:45] shouldn't be. [13:46:55] marking the pages as dirty should be done asap, and it's cheap. [13:47:11] could however be done in bulk with a single update query for all the pages. [13:52:51] addshore: duesen html cache update job does page touches, as I said. [13:53:17] That's the one that recurses very quickly to purge/bump stuff [13:53:34] That refresh job recurses on its own for reparses and links updates [13:53:36] okay, so I think all of the current stuff makes sense :) [13:54:18] https://github.com/wikimedia/mediawiki/blob/master/includes/jobqueue/jobs/HTMLCacheUpdateJob.php#L143 [13:54:43] The code has moved, but overall this has been fairly stable since 2013 ish [19:08:17] is there a magic way to see when a linksupdate job last run for a page? [19:50:51] addshore: debug logs I suppose. There is also page.page_links_updated in the database [19:51:10] ooooh page.page_links_updated might be useful for me, thanks! [19:51:13] JobExecutor or JobRunner is a channel, but debuglogfile should give you it all [19:53:03] im having to test this in production unfortunately, as we have all failed to find this apparent edge case locally [20:30:54] Looks like it takes 15 mins for a links update on testwiki. [20:30:54] I'm still waiting for the one I scheduled on commonswiki to run, will check back on the links update timestamp later [20:31:57] I'm guessing each wiki has its own queue that is processed (or something) within kafka? Not sure if Pchelolo might be able to help me see into this part a little more. [20:33:35] no, each mysql shard has it's own queue [20:33:59] aaah okay each shard, not each wiki! [20:33:59] so, all wikis on s1 run together, all on s2 etc [20:34:05] gotcha [20:34:29] I'm guessing there isn't really any way to see where in the queue my job sits for commons wiki? [20:34:46] I was suing kafkacat to try and see it land in the topic at all for processing, but I couldn't see it [20:37:27] addshore: is your job refreshLinks or refreshLinksPrioritized? [20:38:22] How do I know that? I was assuming refreshLinks, but perhaps thats where I went wrong? [20:38:33] since those 2 have VERY different guarantees.. for prioritize one mean delay is a few seconds, while for regular it's 6 hours [20:39:11] do you have a ticket where it's discussed more? I'll have a look [20:41:48] Yes, the current investigation ticket is https://phabricator.wikimedia.org/T280627 with a parent ticket with older investigation in https://phabricator.wikimedia.org/T233520, I'm also part way through writing another comment whcih I'll dump here for now https://phabricator.wikimedia.org/P15787 [20:42:20] probably the last comment on T280627 and my in progress comment are a good starting context [20:42:20] T280627: Investigation of remaining page_props update issue [timebox: 2 days] - https://phabricator.wikimedia.org/T280627 [20:44:12] I was expecting something like `kafkacat -C -b kafka-jumbo1001.eqiad.wmnet:9092 -t eqiad.mediawiki.job.refreshLinks | grep test.wikipedia.org | grep addshore` to show me my job when it was scheduled, I also tried this for the prioritized queue [20:45:09] kafkacat -b kafka-main1001.equad.wmnet [20:45:46] kafka-jumbo is an analytics cluster where jobs were replicated at some point, but ottomata is in the process of stopping the mirroring [20:46:18] aaaah! I think I need to go update some docs somewhere ;0 [20:47:18] so, with kafka-main you should see you job in non-prioritized non-partitioned topic. e.g kafkacat -b kafka-main1001.eqiad.wmnet:9092 -p 0 -t eqiad.mediawiki.job.refreshLinks [20:47:44] if you are sending it via wikibase WikiPageUpdater like in your paste on phab [20:49:03] I am indeed [20:49:52] So I'm trying for testwiki now, but still not able to grep with `addshore` to see my job :/ [20:51:23] As what point in history did these prioritized vs non prioritized jobs happen? [20:52:01] before today [20:52:15] ohia Reedy ;) [20:52:22] like, 10-15 years ago [20:52:25] * bd808 hands addshore a trout to wave at Reedy [20:52:32] before my time in WMF [20:52:52] Maybe Tim has a story about how that happened [20:52:53] ahhh right, not 2019 then :P [20:52:53] cool [21:00:05] Pchelolo: so I indeed don't appear to be able to see either my htmlcacheupdate job or the links update using `kafkacat -b kafka-main1001.eqiad.wmnet:9092 -p 0 -t eqiad.mediawiki.job.htmlCacheUpdate | grep testwiki | grep "Propsies"` (where "Propsies" is in the title so should appear in the json) [21:00:53] even though I already confirmed the htmlCacheUpdate is running anyway [21:04:15] how're you posting your job? https://phabricator.wikimedia.org/P15787 ? [21:05:56] shell.php for testwiki, and then https://phabricator.wikimedia.org/P15788 [21:07:08] I can see that the job runs in a few seconds, as page_touched ends up being updated, but i don't see it in the kafka stream. [21:10:49] (I'm guessing you ran the code snippet? as I saw the page_tocuhed change again) [21:11:35] yeah [21:11:43] this is so wierd [21:11:54] glad it is not just me :D [21:15:01] this potentially has something to do with this overall issue, could also be a giant red herring [21:18:20] I'm very not understanding what's happening [21:21:39] my understanding of the situation right now is that if it isn't in the kafka topic, it shouldn;t be getting run? [21:22:33] yup [21:23:25] not sure if using shell.php on a debug server might be interacting with this somehow? [21:25:10] ok, one very weird possible thing is that MW deduplicates root jobs upon submission... [21:25:30] and if I do kafkacat -b kafka-main1001.eqiad.wmnet -p 0 -t 'eqiad.mediawiki.job.htmlCacheUpdate' -o -100000 | grep test.wikipedia.org | grep Propsies I get that some were submitted at some point [21:26:55] i just flicked back ot my kafkacat and I saw things for the job now [21:30:07] deduplication seems possible, my assumption would hve been deduplication wouldn't happen as there is not pending job in the queue to run? [21:37:44] oh actuall EventBus jobqueue doesn't do any deduplication on MW side [21:41:49] Pchelolo: I made some jobs with different parameters and I see them in your above command now [21:42:18] maybe there is just too much in the kafkacat firehose for these jobs? and if we waited for ever eventually we would get to it? [21:44:37] but still it looks like some things end up missing [21:45:54] but using -o I can also see some of the other previous jobs that I scheduled from my draft phab comment [21:50:01] And `kafkacat -C -b kafka-main1001.eqiad.wmnet -p 0 -t 'eqiad.mediawiki.job.refreshLinks' -o -1500000 | grep commons.wikimedia.org | grep Neunhoffen` I can see a whole bunch of the refresh links ones! woo! [21:50:28] now I just need to wait for them to eventually get executed [21:51:17] 2 hours and counting, but I guess they'll have run by my morning time [21:53:06] kafkacat should be capable of keeping up with job traffic... however, refreshLinks and htmlCacheUpdate are quite numerous, and can be very large, so maybe it doesn't.. [21:59:36] how do we prune kafka for the job topics / how long will things hang around for? [22:04:35] ahh indeed, kafkacat with no offset starts at the beginning of the topic, which ends up being april 28th right now, so it needs to churn through grepping all of those before it gets to what is currently happening [22:05:09] 10000000 offset is 2am today [22:25:13] hm, that's new. it used to start from the end [22:25:29] I also seem to remember that [22:26:32] being able to look for jobs in kafka really helped. I can see that for real cases now the htmlcacheupdate is being scheduled, but i don't find the links jobs for the cases that im investigating, which explains why they never run [23:02:24] Thanks for the help, i wrote another small book comment with what I figured out today, closer to where the issue is, but still not nailed it https://phabricator.wikimedia.org/T280627#7064252