[16:15:11] godog: Regarding https://phabricator.wikimedia.org/T228292#5344679 - perhaps the root cause isn't recently introduced, but it affecting real users uploading images, is new. perhaps some build up of traffic is causing a threshold to be reach, or long-term validation finally expiring or something else unrelated. What can we do to fix it? [16:15:54] I'm also not very familiar with Swift, but I do wonder why it is doing something with Swift in codfw directly from MW during upload. [16:16:00] Do we write to both instead of replicate? [16:19:24] Krinkle: (meeting) will take a look later but yes, we write to both [16:21:38] Is there a fallback replication as well, or does it mean MW has to deny unless both succeed, so it's effectively down whenever either of them has an (intermittent) issue or random network issue. And MW then also clean ups data from the one that succeeded when the other one failed? [16:21:46] Probably good reasons for it, just sounds a bit odd to me :) [16:53:00] Krinkle: yeah a bunch of reasons, yes iirc mw will write to both synchronously and fail if one is unavailable, at the time we expanded swift in codfw we decided to keep the two clusters unaware of each other at the swift level. there's also container replication that at the time IIRC wasn't working very well, though things might have changed now, so yes mediawiki writes to both instead [16:54:13] to answer your question on what to do, my suggestion would be to understand why mediawiki would try talking to swift with an expired token and when that happens I'm guessing it should try again to acquire a valid token possibly in the same upload ? [16:54:22] I gotta run now, might be back later [17:59:58] AaronSchulz, I asked in releng about this --> From https://integration.wikimedia.org/ci/job/parsoidsvc-parsertests-docker/5815/console .. "Wikimedia\Rdbms\DBUnexpectedError from line 113 of /workspace/src/includes/libs/rdbms/database/DBConnRef.php: Database selection is disallowed to enable reuse." [18:00:32] greg-g tells me we are the only ones affected by this that he has heard about ... and thought you might know what is going on. [18:01:02] heard about *so far* [18:01:46] subbu: A.aron is currently OOO until Aug 7 [18:03:04] subbu: To change the DB connection, go through LB or LBF, doing this directly on the DB object is disallowed as can cause internal state to be invalid when it comes to re-using connection objects. This caused prod issues several times. [18:03:35] Krinkle, this is nothing we are doing in parsoid .. this is just running mediawiki parser tests. [18:04:04] checkout mediawiki, run parser tests with parsoid's copy of the test file. [18:04:15] I don't know in that case. My best guess would be to compare the code to how we normally run them and see what's different [18:04:17] so, not sure why it is complaining now. [18:04:27] Given the core build is passing. [18:04:37] it broke as of this 30 mins back. [18:04:48] or something on that range. [18:04:53] subbu: OK. that's very likely https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/501984/ [18:05:02] which imposes these restrictions to more ways of getting connections [18:05:14] So as to avoid these errors from slipping into prod over and over again. [18:05:26] So it's likely a pre-existing bug from the separate parser test runner that is now exposed. [18:05:57] I can revert that for a few days, but there's nothing it can do to not break this, it'll require an update to the parser test runner [18:06:31] but, why is this parsoid specific ... are you saying the jenkins job has a private version that is a fork of core one? [18:06:59] i don't need you to revert necessarily if this is going to com back. [18:07:21] i am trying to understand where the problem is. [18:07:41] subbu: is this running the parser tests via PHPUnit like we normally do? [18:07:48] or is it using a separate end point? this is the standalone runner, right? [18:07:51] i think so. as far as i know. [18:08:02] let me checkout the integration repo and see what the defintion of that task is. [18:08:10] That's a very rarely used code path (as in, basically never in CI, except for Parsoid I guess), so likely got out of sync. [18:08:24] It's indirectly using wfGetDB and dong something with it that is probably harmless in CI, but unacceptable in prod code. [18:08:27] cscott, ^ see discussion here if you remember anything about the jenkins jobs we use. [18:08:31] Which we now detect in rdbms and thus throw for. [18:08:36] I'll revert the check to give it a few days. [18:08:44] ok. thanks. [18:09:14] so, i guess we'll have to investigate the jenkins job definition and see why the test run is different from core. [18:09:41] subbu: Yeah, my guess is that it isn't so much the Jenkins job but rather the parser test runner. [18:09:49] ok [18:09:49] Core has two endpoints to run parser tests. [18:10:10] one that is quick for standalone use, and one that is slower, requires an installed MW db, and uses PHPUnit. [18:10:39] probably something worth revisiting given the recent progress on making PHPUnit quick [18:14:00] ok. thanks Krinkle for the temporary revert then since this looks like it will need some investigation. [18:17:03] subbu|lunch: can you create a task for tracking? Then I can tag the re-submission as blocked on taht. [18:27:39] will do [19:14:01] Krinkle: FYI, https://gerrit.wikimedia.org/r/c/mediawiki/core/+/517696 made generateLocalAutoload.php start picking up commented-out class definitions from my LocalSettings.php. I wonder whether making it basically a grep hack is going to cause us other problems in the future. [19:19:31] anomie: Aye, that's quite possibly the only scenario that would typically trip it. Having such constructs in LocalSettings is quite reasonable indeed, and unlike untracked files in includes/ or in extensions/, it's also unlikely to be useful as a reminder before committing [19:19:47] I suppose we could exclude $IP/* more generally. I'm actually not sure why it's included [19:19:57] AutoloadGenerator::initMediaWikiDefault [19:20:10] seems like we wouldn't want classes to exist in any of those [19:21:02] Krinkle, T228928 [19:21:03] T228928: Fix parser tests runner to use wfGetDB correctly - https://phabricator.wikimedia.org/T228928 [22:05:59] there's lots of interesting traffic in the new DeferredUpdates log [22:08:26] I might file tasks for some of it [22:09:34] * Krinkle notes down to make it log the exception object as 'exception' so that existing filters and queries for 'exception.trace' and 'exception.class' etc work as expected [22:09:46] should this log be made more visible, by adding it to some logstash dashboards? [22:10:20] Yeah, I'm also not sure why it's a separate channel. [22:10:27] it's just all exceptions thrown by DeferredUpdates [22:10:30] For any other high-level catches we always log to 'exception' directly. [22:10:40] we could make it do that [22:10:40] Yeah [22:10:50] Krinkle: re deploying Extension:Theme on WMF: nobody proposed that explicitly. But merging it into core implies that it's deployed, right?... [22:11:09] TimStarling: The information about which update it came from is quite useful though, so that might be a reason to keep it separate. [22:11:13] For Jobs we do both [22:11:21] Which might be better, or not. [22:12:45] lots of RevisionStore exceptions which duesen_ might want to review, e.g. "Deferred update LinksDeletionUpdate failed: The given Title does not belong to page ID 50675301 but actually belongs to 50675294" [22:14:19] e.g. MWExceptionHandler::rollbackMasterChangesAndLog or MWExceptionHandler::logException [22:14:51] trying to find where the job code is that does this [22:15:48] Yeah, it just logs "Failed executing job: {job_type}" to channel:JobExecutor level:ERROR. the trace is only in 'exception' [22:16:32] I always assumed they were in there already because I've definitely had to debug post-send exceptions before. [22:16:49] TimStarling: we have a ticket open about this one. apparently caused by a race condition while moving pages. i think it's sitting in the clinic duty backlog now. not easy to investiage [22:19:11] ah no, I was thinking of a different issue, https://phabricator.wikimedia.org/T205675 [22:19:16] We had a similar one before, https://phabricator.wikimedia.org/T200072 [22:19:45] It's very likely related to page rename, or undeletion... [22:19:57] TimStarling: best file a ticket. I'm going to bed [22:22:03] from the past 7 days, the only 'exception' channel messages with restInPeace and doUpdates are: 60s timeouts, 200s timeouts, 'JobQueueEventBus.php: Could not enqueue ' and 'Database.php: Cannot execute query from … while transaction' [22:22:31] looking at the 20 days before that, though, there are various "normal" exceptions from deferred updates as well. so maybe it's just a recent regression taht we lost them [22:22:40] i gotta go as well. [22:22:54] Hope that's useful TimStarling :) - will follow up tomorrow, or CC me on a task if there is one.