[03:37:59] Krinkle: are you around? [03:38:25] not sure I got your comment about T218456 - you talk about Wikibase failure but original one is WikibaseLexeme one [03:38:25] T218456: mwselenium-quibble-docker test broken for WikibaseLexeme - https://phabricator.wikimedia.org/T218456 [09:37:41] still no gerrit. this is bad. [09:37:51] the attack looks a lot like the one we had on phab a few months back [11:23:15] duesen: Yes, that's a very similar attack [11:23:48] Basically revising existing patches like how it revised existing phab tickets [14:44:55] SMalyshev: Good point, but afaik it was the same issue. [14:45:03] And looks like you've confirmed that meanwhile. [15:05:22] addshore: I feel hopeful the revert is finally landing now. [15:05:32] :D [15:05:35] its been a long week [15:05:44] And it's only tuesday [15:05:50] hehe [16:29:05] Krinkle: now onto more failures, yay https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/40545/console [16:53:42] @Krinkle at your convenience, your thoughts on use of an anonymous class to allow hooks to modify data in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/490531 would be appreciated. It seemed useful, but I don't want to set a precedent that we regret later. [17:12:28] bpirkle: how is that better than a normal class that can be typehinted against? [17:19:09] I mean from an IDE point of view it seems all drawbacks and no benefits [17:37:49] anomie: Regarding "Unknown user" - is this a bug in the page deletion logic, or is this something to be fixed by a maintenance script, or something else? https://phabricator.wikimedia.org/T203913 [17:47:19] @tgr some pros: minimalist, self-contained, focused, utilitarian, the code stays close to the problem it solves. some cons: inconsistent across hooks, perhaps surprising use of the PHP language, less obvious how to document, and (as you say) less IDE-friendly. I started to do some more traditional that that involved more code, then had this idea and decided to give it a spin and see what people thought. [17:48:32] I also briefly toyed with the idea of a CRUD-ish interface that an anonymous class could implement, thereby giving some standardization and consistency of expectation will still retaining (what I perceived as) benefits, but that never became more than a shower thought. [17:48:52] I think "less obvious how to document" is underselling the DX problems it would cause [17:49:32] what would you put in hooks.txt? what would you put on the mw.org hook page? how are people supposed to mock it for their tests? [17:50:17] anonymous classes are cool for code that does not need much coordination, like inside unit tests [17:50:44] in an extension-facing interface, IMO it's a bad idea [17:51:27] Krinkle: The exception may have been fixed by the fact that migrateActors.php created an actor for the empty-string user, or will be once we can switch the actor migration to READ_NEW (which is currently blocked on code review for T215525). But the real fix would be to do something (maintenance script or manual DB update) to replace that empty-string actor with the "Unknown user" actor, and fill in xx_user_text to match if we're still on READ_OLD [17:51:28] by the time someone does that. [17:51:28] T215525: log_search rows with ls_field='target_author_actor' and empty ls_value are created during actor migration - https://phabricator.wikimedia.org/T215525 [17:51:42] if it does implement an interface you'd avoid most of that but you also lose the benefits (at that point why not have a class instead of an interface?) [17:52:04] anomie: (the exception is not fixed) [17:52:26] also I think we are in general skeptical about interfaces these days, they make interface changes very painful [17:52:42] method signature changes, I mean [17:52:54] anomie: OK. so no code changes required. I imagine there's probably other tasks already about fixing the corrupt db entries? [17:53:15] But I also recall a "temporary fix" from a few months ago that made empty string valid for the mean time as "unknown user". [17:53:26] I thought we merged that.. [17:55:27] Krinkle: https://gerrit.wikimedia.org/r/c/435822/ ? [17:55:40] anomie: Right, that looks familiar. [17:55:52] anomie: Should that not have made this a non-fatal problem? [17:56:16] It's basically doing the maint thing at run-time. [17:56:42] Different code path. [17:56:47] Anyhow, what's your suggested path forward to fixing this fatal. What's the easiest to do quickly and/or worthwhile investment? [17:57:00] Ah, so we might be able to apply this fix in another and/or more central place. [17:59:33] I'd be wary of screwing around with User::newFromAnyId(). The real fix is for someone to do that maintenance script (or manually do what the maintenance script would do: set rev_user_text = 'Unknown user' and revactor_actor in the corresponding revision_actor_temp row to match, and the same for other xx_user_text fields in other tables). [18:08:06] Krinkle: I copied our conversation to T203913#5037424, FYI. [18:08:07] T203913: User.php: Cannot create a user with no name, no ID, and no actor ID (Unable to delete certain pages on Chinese Wikisource) - https://phabricator.wikimedia.org/T203913 [18:08:13] Thanks [19:11:26] anomie: I closed it, but looking at Logstash now, I do see instances of that error message. [19:11:37] [XJE8gApAEDIAAI2HKwEAAAGU] /w/api.php?format=json&prop=revisions&revids=21096755&rvslots=main&rvprop=timestamp%7Cuser%7Cuserid%7Ccontentmodel%7Csize%7Ccontent%7Cids%7Ccomment&action=query MediaWiki\Revision\RevisionAccessException from line 1643 of /srv/mediawiki/php-1.33.0-wmf.21/includes/Revision/RevisionStore.php: Main slot of revision 21096755 not found in database! [19:11:51] Seems different from "Failed to load data blob from tt". [19:12:03] Should I reopen, or add to the other one regardless? [19:12:49] Krinkle: "Main slot of revision X not found in database" and "Failed to load data blob from tt:X" are two different tasks. [19:15:35] That XJE8gApAEDIAAI2HKwEAAAGU, BTW, again looks like a race or replica lag. The rev_timestamp is 20190319190120, and the date on the log entry is also 2019-03-19 19:01:20. [19:16:46] anomie: Yeah, but there's thousands of 'em. So seems like a real issue. [19:18:31] If we treat it as unknown rev, we have better ways of failing for that (just like for nonsensical ids). If we have reason to believe it exists, it should be able to find it (e.g. from master or after chronology protector). If we're doing that already but it's still missing, seems like an atomicity or transational issue that is a real bug. [19:18:49] Given it's a GET request, I'm guessing the expected outcome might be to fail sofly as unknown rev. [19:23:36] but yeah, strange that rev exists but slot doesn't. That shouldn't be possible with transactions. [19:24:05] that is probably the issue in this case. Given we already have badrevid detection. [21:06:51] Krinkle: we can fall back to reading from master, but we really want to avoid that - connecting to master in a GET request may mean a cross-DC connection in the future. [21:07:17] This may be acceptable for very rare things, but not for merely not-so-frequent things [21:07:47] duesen_: Agreed. And also, if it's a transaction issue, it might not exist on the master either. [21:08:09] that is true. and that would be disturbing. [21:08:11] Eg. something is being committed in incomplete state. [21:08:40] Krinkle: but that should be easy to check. the error message does have the revision ID, right? [21:08:54] then we can just look whether the slot exists now [21:08:55] duesen_: I don't mean now, I mean at some point in time. [21:08:59] In the past. [21:09:16] If it can be on the replica without the other half, that means by defition it was also on the master incomplete at some point. [21:09:26] Hence, querying the master would just make the problem less likely. Not fix it. [21:09:27] i don't see how that would happen even as a result of a transaction issue, really. [21:09:32] not in a GET request, anyway [21:09:42] the GET request is reading it, not writing it. [21:09:51] maybe withing the request that does a write, and then ends up reading from a different connection before the trx has been committed [21:09:52] It;s looking for a recent rev id, and it's not been fully written yet. [21:10:27] yea, sure. but how did it get that ID from? [21:10:29] which means the problem is presumably that we're writing in a way that isn't atomic. [21:10:34] hm, from a query, possibly, i guess [21:10:40] see task, there's a few thousand each week. [21:10:54] recent changes is one likely source. [21:11:02] or eventbus [21:12:10] or, given we have since 3-4 months ago a storm of very normal and quick requests taking 60s due to deployment issues with hhvm, it's quite likely things just halt arbitrarily for no reason. [21:12:22] 60s is plenty to travel the world and look for it, for the user themselves as well. [21:12:38] Anyhow, my bet is on a trx issue. something comitting too early or not using the same db. [21:13:58] Krinkle: staring at the code now... [21:14:20] If non-trivial to find or fix, might make sense to catch in some form, given it's apparently known behaviour to happen. I mean, I don't know how important it is to make work for those few super fast queries. Making them fail as today seems fine. It's just about making it not fatal and wake up ops if becoming more common as an edge case due to random user activity. [21:14:45] Krinkle: insertRevisionInternal does everything, and it's executed via doAtomicSection. No hooks are fired in the atomic section. [21:15:32] yea, well... how do we make it so we don't treat this as fatal, but still notice *actual* database corruption? [21:15:40] we can't just ignore a missing slot... [21:16:01] i mean, we could, but that seems bad [21:16:26] agreed. but I don't decide on priorisation :) If it's important to catch that, you have to plug the whole. It's not nice to deploy an alarm you ignore most of the time but others can't ignore. [21:16:37] not that you're doing that now, just speculating ahead as a principle. [21:16:59] yes, i agree. i just don't know how to find this [21:17:25] taking a look now [21:17:29] is the stack trace always the same? [21:17:35] maybe the error message is misleading [21:17:35] what was the task again? [21:17:49] ...and the actualy problem is a different one. i'll get the code that emits it [21:17:57] T212428 [21:17:57] T212428: includes/Revision/RevisionStore.php: Main slot of revision (number) not found in database! - https://phabricator.wikimedia.org/T212428 [21:21:31] https://logstash.wikimedia.org/goto/d66cede8c0229e2f46e923f2ca2bfd8a [21:21:47] 131 instances in 12 hours. various wikis and mw servers. [21:22:04] Two different code paths [21:22:18] RevisionStore.php:1643 127x, RevisionStore.php:1641: 4x [21:22:35] all on api.php [21:23:17] two lines difference? [21:23:26] actually, the call spans all three lines. [21:23:56] maybe it's a different version of the file, or a different version of php (??!!!) reporting the line number differently? [21:24:38] what i want to know is whether the offending revision is very new at the time of the error [21:25:49] I#m seeing one triggered by Closure$MovePage::move(Wikimedia\Rdbms\DatabaseMysqli, string) [21:31:02] duesen_: yeah, php 7 and hhvm reporting it slightly differently wouldn't be that weird [21:31:08] remember we have public beta on php 7.2 now [21:31:17] (in preferences) and also via WikimediaDebug. [21:32:28] [XJEapgpAMDkAAGEF-fsAAADN] enwiki 2019-03-19T16:36:54 Main slot of revision 888506410 [21:32:41] https://en.wikipedia.org/?diff=888506410 18:36, 19 March 2019 [21:32:55] UTC and +2 [21:32:58] so, yeah, it's new [21:33:50] the user's edit was not made on php7 (would've created a public change tag) [21:34:08] They did however made 10 edits to the page in 2 minutes. [21:34:17] might be relevant, don't know. [21:34:44] i'm blind, that's a span of 15 minutes, not 1.5 minute [21:36:25] but not always it seems (not always super recent) [21:36:41] in fact, here's a case of someone getting an error for an older revision ID than the previous person getting the error on the same wiki [21:36:55] 2019-03-19T19:48:34 enwiki Main slot of revision 888536299 [21:37:08] 2019-03-19T20:16:06 enwiki Main slot of revision 888540298 not found [21:37:18] crap, still blind. [21:37:37] * Krinkle shuts up [21:37:55] I haven't found a case of this happening on a revision older than a few seconds... [21:38:14] Yeah, so race condition then. Transaction problem or replication problem. [21:38:28] Afaik replication is pretty transaction safe afaik, so bet is on trx. [21:38:57] right. how do we track it down? [21:39:06] don't we log premature commits of atomic sections? [21:39:22] that's the only thing I can think of. but i don't see anything that could trigegr that either [21:39:45] * Krinkle realises Verschieben sounds funny. [21:40:34] I've spoken German almost as long as I have Dutch, but my German mindset is entirely non-digital. so I have a very physical connotation to the phrase "Verschieben". Like I'm kicking a box with pages in it over the floor. [21:40:36] to move it. [21:40:42] verschiebe verschiedene Vertriebene? [21:41:06] :) [21:41:34] I have great trouble talking about anything digital in german without half the words being english [21:41:50] same in spanish [21:41:53] even the german wikipedia translations of the mediawiki concepts sound quaint :P [21:42:11] anyway, I'm at a loss. [21:42:22] So where is the trx normally written (the atomic section anyway) [21:42:27] I'm lost in RevisionStore [21:42:36] what can we do? wait and retry? ignore and return... what? there'S nothing we can return that wouldn't be broken. [21:43:06] Krinkle: look for insertRevisionOn [21:43:17] it has a call to doAtomicSection [21:43:20] that'S the transaction [21:43:31] unless of course we are already in a transaction [21:44:10] duesen_: If we "temporarily" accept this as snafu, the reader side could catch it and return as if the rev doesn't exist. maybe log warning-level which is excluded from most alarm bells and is something you'd monitor on your team or product-level Logstash dashboard. [21:44:21] Right, okay, going down that rabbit hole now. [21:44:30] There's a lot of methods and other classes involved. [21:44:34] * Krinkle likes procedural code [21:44:53] :P [21:45:50] Krinkle: you mean, catch this in ApiQueryRevisions? [21:46:06] Somewhere between there and where its thrown, yeah. [21:46:09] or in there yeah [21:46:40] i see no good place in between [21:51:14] https://etherpad.wikimedia.org/p/j9qaVR_wF9bIWn6BWuIr [22:07:51] duesen_: Yep, I don't see how it can commit one without the other. [22:10:59] duesen_: Re that etherpad, line 24, the comment just above the selectSqlText() explains why. [22:11:46] anomie: thx [22:11:50] (I wrote that) [22:19:06] Krinkle, anomie : https://gerrit.wikimedia.org/r/c/mediawiki/core/+/497650 [22:19:16] I'm not proud of this one, but it should do the trick [22:19:21] haven't tested [22:20:06] sorry i had to move the code to a sepaerate method. i was getting vertigo from all the indent [22:22:04] getting late, my brain is shutting down