[11:22:53] godog: I have broken up the swift refactoring change into smalle PS's now starting here https://gerrit.wikimedia.org/r/c/operations/puppet/+/631155/ [11:59:41] jbond42: \o/ neat, thank you I'll take a look [12:22:41] hnowlan: unfortunately not, it's a long-standing open wishlist bug [12:23:26] ah, dang [12:23:57] I usually just build several trivial variants of the package, but it's annoying [12:24:05] is this for Envoy? [12:25:41] I'm migrating some of the bits that restbase hosts need to buster, jvm-tools in this case [12:25:48] ah [12:25:54] given that we have a jdk8 component I figured I'd keep them together [12:30:00] is it specific to Java 8 vs 11 in some way? otherwise you can simply import it to "main" as we did for jessie/stretch [12:34:07] hnowlan: the current version Depends: on openjdk-7-jre-headless | java7-runtime-headless , while restbase* runs Java 8, I doubt this was actually used the last years [12:34:41] so I think you can simply skip it... [12:37:42] jbond42: just noticed the puppet-compiler link on https://gerrit.wikimedia.org/r/c/operations/puppet/+/631156 doesn't work without a trailing / (JFYI) i.e. https://puppet-compiler.wmflabs.org/compiler1002/25537 issues a wrong redirect whereas https://puppet-compiler.wmflabs.org/compiler1002/25537/ WAI [12:39:22] moritzm: aha, good to know [12:42:09] I think you can simply remove it from Puppet [12:42:50] Enhancing Cassandra performance was the main reason why Java 8 was backported to Jessie in 2016 [12:42:58] thanks godog i have created T264184 [12:43:00] T264184: Fix puppet-compiler redirects - https://phabricator.wikimedia.org/T264184 [12:43:35] and given the dependency it seems unused since then, so just nuke it entirely from Puppet, if we ever need it again, there's also T178839 [12:43:36] T178839: New upstream jvm-tools - https://phabricator.wikimedia.org/T178839 [12:44:03] jbond42: nice, thank you! FYI I have puppet disabled on two ms-be hosts in codfw (re: the patches in general) but that shouldn't matter [12:44:30] yeah, I just saw it as a blocker at T264092 [12:44:30] T264092: Restbase on buster preliminary work - https://phabricator.wikimedia.org/T264092 [12:45:29] moritzm: sounds good, I'll just remove it [12:47:02] I suppose Cassandra on buster will initially use the Java 8 forward port for Buster? [12:48:04] we should move cassandra to profile::java at some point (T261966), that way picking 8 over 11 is simply setting a Hiera option [12:48:05] T261966: Move cassandra puppet code (used by Restbase, Sessionstore, AQS, maps) to profile::java - https://phabricator.wikimedia.org/T261966 [12:49:14] yeah, it's currently using that. Given that I'm already making changes I'll look at moving [12:49:19] jbond42: heh - just noticed that the `standard` module in puppet includes `profile::base` :) [12:49:34] er s/that/Java 8 forward port/ [12:53:47] jbond42: what's the ~ syntax for in hiera values ? [12:58:00] godog: that resolves to undef [12:58:17] thats one of the new things with hiera5 :) [12:58:26] oh neat [13:00:01] ah thanks, TIL [13:00:24] that's a yaml "feature" [13:04:10] godog: yes in yaml ~ === null. i think hiera5 excepts both values and just my prefrence to use '~'. before hiera5 though yaml null values where ignored [13:06:32] godog: as to profile::base being in standard, that predates me :) https://github.com/wikimedia/puppet/commit/938ff24a31ee6e1b58027ad3331f739512e92b8f#diff-b6a0e9bd66f2dd17e5bba35a649d94e3 [13:07:34] jbond42: (that was me who pointed it out) that's an entertaining submitter :) [13:08:19] oh your both the same color in irssi i hadn't noticed :) [13:32:19] I'm no kormat but... color me impressed [13:32:30] * godog takes coat [13:33:10] 👏 [13:47:03] ahh. finally, i know who to blame for `$HOME` not behaving normally under sudo. https://gerrit.wikimedia.org/r/c/operations/puppet/+/180506 [13:47:24] * kormat glares at paravoid [17:01:11] just to note restbase2018-a has joined the Restbase cluster - it's not pooled as a restbase node but it's fully joined to cassandra with a single node and it's the first buster restbase host to join. Nothing wacky expected as the java and cassandra setups are very similar to the old nodes, but it's the first of its kind [17:05:06] We now have dedicated install servers in the POPs. (install3001, install4001, install5001). They are switched over in 2 steps from the bast hosts, first TFTP (happened yesterday/right now) and then DHCP. So bast[345] won't be TFTPs anymore. One thing left is to check whether they should also have the squid webproxy on them like in main DCs. [17:06:16] testing OS installs are working fine with testvm[345]001 [20:57:27] I got three pages from yesterday? [20:57:38] just now? [20:57:45] yeah we silenced it for 24hr I think [20:57:55] downtimed, excuse me [20:58:34] but I disabled notifications [20:58:39] and there are no irc alerts [20:58:47] is this VO related only herron? [20:59:21] yes the ack in VO lasts only 24h [20:59:44] can we resolve it then? [21:00:02] it is quite late in EU [21:00:52] ok I’ll resolve them now [21:01:04] thank you [21:01:19] and https://phabricator.wikimedia.org/T259465 has more info about that 24h expiration [21:01:23] * marostegui goes back to bed [21:11:39] rzl: please tell me that the alarming timeseries for cdnPurge_0 is some artifact of the switchover and not something real https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?viewPanel=22&orgId=1&var-site=codfw&var-type=All&from=now-7d&to=now [21:12:54] uhhhhh. [21:13:03] https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?viewPanel=22&orgId=1&var-site=codfw&var-type=All&from=now-37d&to=now [21:13:07] now-37d makes it even more interesting [21:13:27] haha I just did the same [21:13:32] well, I picked 40 [21:14:03] so that implicates the stuff we did at T-1 day, not the MW switchover [21:14:13] I'm also not sure what happened with processMediaModeration, it is possibly related and possibly not [21:14:18] either the eqiad depool or one of the a/aservices [21:14:21] *a/a services [21:14:46] yeah... this might just be "one of the kafka topics isn't getting consumed but the other one is"? [21:15:09] eqiad is repooled though, so unless that had some kind of lasting effect I think we can rule that out [21:15:24] yeah, it seems more likely to be one of the services [21:15:28] yeah I'd believe that, it wouldn't be our first problem shaped sorta like that [21:16:05] if we're having problems propagating purges to eqiad that would also certainly explain quiddity's thing [21:16:23] the kafka topics are per-core-DC though, not per-edge [21:16:32] and I believe the purgeds running on cp nodes read both topics [21:16:33] I don't know purged super well [21:16:36] ahh okay [21:16:44] although, this implies the purgeds aren't reading the codfw topic? [21:17:01] if that were true. wouldn't we be in way more trouble? [21:17:08] yes [21:17:18] that graph basically cannot be true; we'd be getting a ton of user complaints [21:17:49] https://w.wiki/eYe purged stats look okay too [21:18:04] (grafana/explore link, be logged in etc) [21:18:40] this seems vaguely threatening https://usercontent.irccloud-cdn.com/file/t8hyeoN8/image.png [21:18:49] snrk [21:18:53] anyway, looking [21:19:02] so, re: quiddity's trouble report, it actually probably isn't the generic purging infrastructure; it might be async reparses via parsoid?? [21:19:38] imagine I'm nodding slowly like I know what all the nouns mean but I haven't picked up what you're saying yet [21:19:58] I'm just trying to remember what's special about the officewiki editing flow, and that is one of the things, IIRC [21:20:08] although it might be VE only? [21:20:21] *oh* yeah of course, I forgot it's officewiki we're talking about [21:20:36] (If helpful, I'm happy to video-meet and screenshare some live editing? fwiw I was using wikitext-edits) [21:21:02] yeah I was about to say, those edits don't have the VE tagb [21:21:33] this *can't* be an edge caching issue [21:21:40] it's officewiki and quiddity is 100% logged in [21:22:01] could potentially be something buggy in group0? I know we just rolled back group1 [21:22:11] no I suspect it's um [21:23:02] https://phabricator.wikimedia.org/T250248 [21:23:12] I suspect parsercache is being slow and this is happening?? [21:23:28] (I tried a long page at testwiki, and cannot reproduce there) [21:23:44] hm, slow parsercache would explain why it's only big pages [21:25:19] quiddity: can you edit a long page on officewiki, then force a purge, then view source and Ctrl-F for `NewPP limit report` and pastebin the HTML comments with all the performance data? FTR it looks something like https://phabricator.wikimedia.org/P12861 [21:27:18] cdanis, https://phabricator.wikimedia.org/P12862 [21:27:47] huh okay, that's... that's not actually that bad a page [21:27:57] a quarter of a second is totally reasonable [21:28:34] group0 just got rolled back [21:28:37] hmm, the problem seems to be going away at my end. [21:28:42] aHA [21:28:43] 👀 [21:28:51] XD [21:29:23] I haven't looked yet but we should probably make sure a task gets filed for this, if the issue isn't known already [21:29:53] yeah, let's chat with twentyafterfour when he has a moment -- I don't know what triggered the rollback originally, this might or might not have the same cause [21:30:01] hey what's up? [21:30:33] twentyafterfour: quiddity managed to produce some weird caching behavior on officewiki, which we think was caused by parsercache slowness [21:30:38] twentyafterfour: quiddity was seeing some, uh, interesting behavior with parsercache on officewiki, until you rolled back group0 [21:30:38] I rolled back because of a massive number of errors [21:30:45] (specifically the thing where mediawiki eventually gives up and serves a stale page) [21:30:51] > At officewiki, in long-pages (but not short pages), when I publish an edit, I don't see the changes unless I manually purge. (hard-refresh didn't fix it). E.g. editing my userpage. But another user can see the changes immediately. [21:30:53] > E.g. I removed the word "test" from the top of https://office.wikimedia.org/wiki/User:Quiddity_(WMF) 2 mins ago, but no amount of hard-refreshing is making it go away. [21:31:13] hmmm [21:31:36] well there is this: https://phabricator.wikimedia.org/T263177#6483815 [21:31:43] but I thought it only affects wikidata [21:31:58] and it passed the tests suggested by lucas [21:32:04] yeah, and that's a different cache as well -- logged-in users never get CDN-cached data [21:32:35] there is something REALLY broken in wmf.11 though [21:32:49] I've never seen so many unrelated and unusual errors - the kind that shouldn't make it past ci [21:33:13] like static errors, file not found on includes and whatnot [21:33:15] please don't tell me they were of the flavor "this method exists but the PHP interpreter says it doesn't" [21:33:34] mostly wikidata type errors actually [21:33:53] but a bunch of stuff see https://phabricator.wikimedia.org/T264241 [21:34:01] (see the discussion on ^ ) [21:34:08] er no [21:34:19] sorry yeah that was the right task link [21:34:55] oh wow [21:34:57] thanks, I hate it [21:35:00] that looks like opcache corruption [21:35:15] opcache contains interned copies of all string literals [21:35:24] yeah I had to roll back group0 to get it to stop even though group0 was fine before group1 deployed [21:35:43] presumably the second rollback cleared the cache? [21:35:54] yeah [21:35:59] that's a single bit-flip too, so I hate it even more [21:36:33] I guess I can stop digging through these errors because they are mostly bogus? [21:36:35] was it on few hosts, many hosts, or all hosts? [21:36:49] if it was on many hosts it's less likely to be opcache and if it was on all hosts it's almost certainly not [21:36:51] basically everything... link to kibana incoming [21:36:56] hm, okay [21:36:58] https://logstash.wikimedia.org/app/kibana#/dashboard/0a9ecdc0-b6dc-11e8-9d8f-dbc23b470465?_g=h@4b3ae48&_a=h@bd9bdfd [21:37:05] starting to think we should file a task for the parsercache thing just in case, we can always close it if it turns out to be a dupe [21:37:10] +1 [21:37:17] looks like a few hosts mostly though [21:37:18] actually [21:38:07] AFAICT the LathMathML error was 100% mw2319 and 0% anything else [21:38:13] so, that is opcache corruption [21:38:42] ok but there are a bunch of others for mw2356 mw2261 mw2333 etc [21:39:02] I'd say there may be multiple bugs in this one [21:39:02] yeah, so those are probably different errors [21:39:59] the error rate still hasn't entirely dropped off to normal even now [21:40:10] do we need to restart php-fpm anywhere? [21:40:57] mw2356 sure does look unhappy [21:41:00] yes I think mw2319 is still acting up [21:41:11] [{exception_id}] {exception_url} TypeError from line 46 of /srv/mediawiki/php-1.36.0-wmf.10/extensions/Wikibase/client/includes/Store/DescriptionLookup.php: Argument 2 passed to Wikibase\Client\Store\DescriptionLookup::__construct() must be an instance [21:41:16] [{exception_id}] {exception_url} TypeError from line 50 of /srv/mediawiki/php-1.36.0-wmf.10/extensions/Wikibase/lib/includes/Store/LanguageFallbackLabelDescriptionLookupFactory.php: Argument 3 passed to Wikibase\Lib\Store\LanguageFallbackLabelDescriptio [21:41:24] those two are the overwhelming majority of the errors [21:41:43] mw2356 [21:41:45] 9,469 [21:41:47] mw2319 [21:41:49] 1,719 [21:41:51] mw2284 [21:41:53] 264 [21:41:55] mw2326 [21:41:57] 262 [21:41:59] mw2217 [21:42:01] 127 [21:42:03] those are the hosts with the most errors [21:42:17] is there anything on those hosts you want to save for debugging? we can depool rather than restarting [21:42:27] up to you [21:42:36] rzl I don't know? [21:42:46] there are more hosts with errors, those are just the worst offenders [21:43:24] rzl: I think let's just restart, maybe even depool, scap pull, restart php-fpm, repool [21:43:29] lets at least restart php-fpm on the top 2 and see if the rate comes down? [21:43:42] cdanis: you think scap is out of sync? [21:44:08] all I did was sync-wikiversions but anything is possible [21:44:10] I am just being paranoid but that kind of error makes me a little suspicious [21:44:48] mm okay [21:44:53] I guess I should start an incident report [21:45:08] eh... [21:45:10] IMO a task is all we need here [21:45:28] won't argue with you if you think you'd get value from it though :) [21:46:07] I don't think I value it but I thought it was standard procedure. I'll gladly avoid it [21:46:31] I actually don't know what SOP for train rollbacks is [21:46:36] okay I'm going to depool-pull-restart-pool on mw2356, objections? [21:46:39] but I agree that the exercise of writing one will not teach us much here [21:46:39] should i restart php-fpm on mw2356 (and only that one)? [21:46:44] ah, ok [21:46:46] mutante: no please [21:46:52] ack, stepping back [21:46:54] well incident because it probably caused an outage on the site [21:46:58] thanks <3 [21:46:59] or partial outage [21:47:39] in the case of the opcache errors, a very partial one -- it's not like users are stickily routed to appservers [21:47:59] yeah we definitely don't need to do an incident writeup for every code bug where we serve some errors [21:48:06] ok [21:48:09] I really do appreciate that you thought of it though [21:48:26] going ahead on mw2356 [21:48:46] I haven't scapped much -- scap pull without sudo, right? [21:48:50] correct [21:49:36] done and repooled [21:49:58] https://www.irccloud.com/pastebin/NUYSh4U6/ [21:50:01] the error rate does look better [21:51:00] no errors on mw2356 since the repool [21:51:18] mw2284 is now the #1 error rate [21:51:29] followed closely bt mw2326 [21:51:38] wait, I haven't touched 2319 [21:51:42] why would that be [21:51:43] I depooled it [21:51:49] ohh that'll do it [21:51:49] but didn't touch anything else [21:51:50] okay cool [21:52:04] okay I'm just going to do the other four via cumin [21:52:08] the errors on mw2284 do look potentially opcache-related btw [21:52:20] rzl: skip the scap pull, I bet it's not necessary [21:52:30] even better [21:52:32] (I say this because cumining it will be slightly annoying) [21:52:47] so we don't have any clue what causes this opcache corruption? just crappy code in php core? [21:53:00] yeah I was just gonna do it as four cumin runs, but restart-php7.2-fpm automatically depools and repools so now it's just one [21:53:03] yes, it is some mix of our particular deployment model, plus also race conditions in php-fpm [21:53:13] there are several long bugs with speculation... somewhere [21:53:37] cdanis: ok no need to dig them up [21:53:38] thanks [21:53:49] (ohhh and because cumin runs as root, now I see your point) [21:53:53] it's a low-frequency problem, but does pop up.. I guess about once every two months? [21:54:18] just often enough to be annoying but not often enough to force us to find and fix it [21:54:25] usually on 1-5 servers, and always with "inexplicable" error messages: misspelled methods, methods don't exist but they definitely do, etc [21:54:59] yeah, and we've applied some duct tape and bailing wire kind of solutions to address a few of the known triggers [21:55:04] done, and good recoveries in -ops [21:55:07] yeah I don't think I've actually been hit by this before, somehow I got lucky on the deployment rotation schedule [21:55:50] not sure about these warnings, looking https://www.irccloud.com/pastebin/zKCnAAlK/ [21:56:01] Opcache corruption was an epic problem with hhvm for a couple of months back in the day [21:56:14] there is a long tail of more servers with errors: mw2284 mw2326 mw2261 mw2258 mw2217 mw2310 [21:56:31] twentyafterfour: do they have disproportionally more errors than the appserver fleet at large, though? [21:56:33] bd808: oh yeah I remember the hhvm epics [21:56:45] if needed we can just do a rolling restart across the fleet [21:56:58] cdanis: this kibana dashboard should be at nearly zero errors so yeah [21:57:04] hm [21:57:07] anything showing up here with high frequency is unusual [21:57:17] there are negative filters for all the known bugs [21:57:46] and since I rolled back the train we should be at zero errors other than a couple of known errors that we don't filter due to wanting to keep an eye on them [21:57:54] it would be I think unprecedented to see opcache corruption on this many servers [21:58:00] not that I have any objection to a rolling php-fpm restart [21:58:11] just that it's interesting/potentially troubling [21:58:18] before the train we had maybe 10 errors per hour [21:58:30] now I see 1500 in 3 minutes [21:58:46] so we are up ~500 errors per minute [21:59:04] most of them seem to be [{exception_id}] {exception_url} ErrorException from line 1398 of /srv/mediawiki/php-1.36.0-wmf.10/includes/parser/ParserOptions.php: PHP Warning: Invalid argument supplied for foreach() [21:59:21] is it that same error across many servers? [22:00:06] cdanis I can't tell, looks like each error is mostly from one server but the volume of erros is making that hard to determine [22:00:21] like each error gets repeated a bunch of times from the same server that is for sure [22:00:38] sure, and that second part is what you'd expect: once the bytecode is cached that way, ... [22:01:17] yeah it does still look like cache corruption because none of these errors would normally make it to production [22:01:26] (and like I said, I rolled back to a version that should be clean) [22:01:28] the top 5 errors all have the exact same frequency in the past 10 minutes, and are all ParserOptions.php [22:03:24] there are about 20 servers returning one of the parseroptions errors [22:03:41] frequency ranging from 36 in 3 minutes down to 16 in 3 minutes [22:04:40] mw2261 (the highest error rate server at present) is 100% [{exception_id}] {exception_url} Error from line 75 of /srv/mediawiki/php-1.36.0-wmf.10/api.php: Undefined class constant 'Wikimedia\ParamValidator\ParamValidator::PARAM_DEPRECATED' [22:04:47] which is also opcache-corruption-flavored [22:05:09] suddenly a pile of restbase alerts from icinga, not sure if related [22:05:19] here is the list of servers with that ParserOptions error: https://phabricator.wikimedia.org/P12864 [22:05:33] was about to report it but also just recovered again, rzl [22:05:42] so it did [22:07:30] Error from line 75 of /srv/mediawiki/php-1.36.0-wmf.10/api.php: Undefined class constant 'Wikimedia\ParamValidator\ParamValidator::PARAM_DEPRECATED' [22:07:38] seems like corruption also [22:07:42] yeah, on mw2261 [22:07:54] rzl: I think we should php-fpm-restart on mw2261 + the list of servers from twentyafterfour's paste [22:08:24] which paste -- `mw2284 mw2326 mw2261 mw2258 mw2217 mw2310`? [22:08:47] yeah those account for the vast majority of the errors.. probably all of the anomalous errors [22:09:00] rzl: https://phabricator.wikimedia.org/P12864 [22:09:01] sure, doing [22:09:08] oh sorry missed it [22:10:13] I'm going to roll these in batches of five [22:10:24] ok [22:10:55] cdanis will have to forgive me for constructing this command line in a text editor, instead of piping curl into awk into xargs or something [22:11:03] curl -s https://phab.wmfusercontent.org/file/data/ecwhuzxsafj26wehbtw3/PHID-FILE-bb2l53jd2wabszifkklr/servers_with_possible_opcache_corruption | cut -d, -f1 | tail -n+2 | sed 's/$/.codfw.wmnet,/g' | tr -d '\n' [22:11:06] btw [22:11:08] sdhafjkasdjh [22:12:14] ahahahah [22:12:56] I was just feeling guilty for not pasting a simple list [22:13:00] sudo cumin -b5 'mw[2226,2228,2229,2230,2232,2234,2239,2257,2258,2261,2270,2273,2307,2310,2311,2315,2325,2329,2331,2357,2363].codfw.wmnet' restart-php7.2-fpm [22:13:14] twentyafterfour: don't, he needs the enrichment [22:13:15] the csv was what kibana spit out [22:13:35] I do appreciate a good use of cut + tail [22:13:50] the |tr -d '\n' was icing on the cake [22:13:53] :) I'm pretty sure you can do that entire thing in awk, and I'm pretty sure I don't care to try [22:14:19] anyway, it's running now [22:14:36] awk is like perl for those who REALLY hate themselves [22:14:55] or maybe it's the other way around [22:15:00] 🤔 [22:15:49] done [22:16:07] hmmm, more warnings shaped like `LB lvs2009:9090 reports pool apaches_80 as disabled/up/not pooled, should be enabled/up/pooled` [22:16:42] each host in some combination of 80, 443, or both, on the depool, repool, or both [22:16:51] so the ParserOptions error doesn't seem to have gone away [22:17:11] rzl: I think that's just an eventual-consistency thing [22:17:23] yeah that's what I was suspecting [22:17:29] I just haven't done this enough times to know if it always does that [22:17:58] the end state seems fine, it would just be a bummer if they were still getting traffic while they restarted [22:18:44] our story around that isn't amazing anyway, end-to-end [22:19:26] nod [22:19:34] it seems like there is a longer tail of these corruptions than we thought? [22:21:24] mw2225,mw2310,mw2375,mw2240,mw2254,mw2275,mw2307,mw2255,mw2257,mw2316,mw2351,mw2355,mw2241,mw2305,mw2311,mw2325,mw2327,mw2361,mw2367,mw2271 [22:22:01] mw[2225,2310,2375,2240,2254,2275,2307,2255,2257,2316,2351,2355,2241,2305,2311,2325,2327,2361,2367,2271,] [22:24:16] this is odd [22:24:37] it's never been this prevalent before, and, it's also very odd to see the same error on so many different machines [22:26:34] cdanis agreed [22:27:40] none of them have come back after being restarted though, right? [22:28:01] it seems not [22:28:22] I agree it's weird and needs looking into, but also, it's getting late in the day [22:28:23] no [22:28:24] I wouldn't mind getting the immediate issues wrapped up by just restarting em all [22:28:25] that's a lie [22:28:29] oh no [22:28:32] mw2226 was restarted in the first round, but is serving errors as of a minute ago [22:28:47] the five ParserOptions.php error [22:28:47] wth [22:28:58] 2326 was restarted in the first round, 2226 in the second round [22:29:06] sorry, yes [22:29:10] but either way it shouldn't still be broken [22:29:45] so, either something is re-corrupting the opcache, or the ParserOptions.php error is something besides opcache corruptio [22:29:52] the rate didn't really change at all (as far as I can tell -- it's a few events every few minutes, so very hard to be sure) [22:30:01] no, I think this has to be something else [22:30:07] ok hmm [22:30:11] a config change, a change in traffic pattern triggering a latent bug, something [22:30:25] yeah [22:30:32] did it start at the time of the rollout? [22:30:42] it seems like it yeah [22:30:43] of any rollout I guess, either forward or back [22:30:54] because I checked error rate right before rolling out to group1 [22:30:58] and everything looked ok [22:31:12] should I do a full scap sync just to be sure ? [22:31:19] rzl: https://logstash.wikimedia.org/goto/252a3c9f6e17c36e894d29f2931beecc [22:31:27] twentyafterfour: I think hold off a sec [22:31:59] (I'm currently experiencing Dramatic Tension while I wait for kibana to load) [22:32:06] oh huh! yeah okay [22:32:16] that's an interesting pattern [22:32:30] rate is almost down to zero but not quite [22:32:44] (I mean, relative to where it was) [22:34:28] rzl: the jobqueue (and/or other methods of deferring work asynch) itself makes requests to the API, right? [22:34:41] oh hm but it's mostly *not* apiservers [22:34:43] uh [22:35:23] what's really weird is it's all 1.36.0-wmf.10 [22:35:41] and it was wmf.11 that I rolled out (but I assume that could happen with opcode cache corruption) [22:35:59] still not one occurrence in wmf.11 which was live for a few minutes [22:39:05] I'm a little curious what that full scap sync would do [22:39:58] my thought in holding off was, some of what we're looking at *was* cache corruption, and in theory it might come back [22:40:07] but on the other hand it's not like we can just not scap [22:40:52] and if it did corrupt the cache in a reproducible way, that would actually be good data [22:40:58] okay [22:41:00] okay [22:41:08] let's talk about this stack trace of one of the errors [22:41:17] https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-deploy-2020.09.30/mediawiki/?id=AXThKyjHLNRtRo5Xttip [22:41:43] I haven't deeply investigated the code yet, but I'm suspicious of the opcache corruption that was in the MathML library causing some bad data to have been persisted to parsercache [22:42:11] ughhh yeah okay that's plausibel [22:42:12] *ble [22:42:35] cdanis: seems like a good lead [22:42:42] oh [22:43:08] okay, no the `onPageRenderingHash` call returning the exception from `in_array` is _also_ about `$forOptions` [22:43:16] which is the same name from the ParserOptions.php failures [22:44:48] is there any way to invalidate parser cache for a time window? [22:45:01] I was just digging around for docs on that, I'm not sure [22:45:09] mutante: if you're still lurking, do you know? [22:46:05] re: scap. I checked the sha1sum of /srv/mediawiki/php/cache/gitinfo/ on mw2* to check if they all have the same MW "version" and they are the same [22:46:21] rzl: sorry, don't know about invalidating parser cache [22:46:35] searching [22:46:56] https://wikitech.wikimedia.org/wiki/Caching_overview has a bit of documentation about cache in general [22:46:59] I found wgParserCacheExpireTime but presumably we don't want to blow the entire thing away [22:47:02] expiration is 22 days [22:47:10] yeah, no we don't [22:47:25] at least I think that would be a very bad thing [22:47:30] it would be a very bad thing [22:48:41] Obsolete::Parser_cache_extension ..hrmm [22:48:42] I found 22 also but it looks like a lie, actual expiration is 30 [22:48:43] https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/master/wmf-config/CommonSettings.php#331 [22:49:11] okay [22:49:15] #2 /srv/mediawiki/php-1.36.0-wmf.10/includes/parser/ParserCache.php(251): ParserOptions->optionsHash(NULL, Title) [22:49:17] #3 /srv/mediawiki/php-1.36.0-wmf.10/includes/parser/ParserCache.php(283): ParserCache->getKey(WikiFilePage, ParserOptions, integer) [22:49:24] so the NULL reference is introduced between those two stack frames [22:49:27] and [22:49:31] $optionsKey = $this->cache->get( [22:49:33] $this->getOptionsKey( $wikiPage ), BagOStuff::READ_VERIFIED ); [22:49:35] $usedOptions = $optionsKey->mUsedOptions; [22:49:46] is the usual code path (excerpting from ParserCache.php) [22:49:54] so the bad value is persisted in BagOfStuff [22:50:09] damnit [22:50:09] the bad value(s), I believe there's one per page title [22:50:20] we're really getting a tour of the caching architecture today, aren't we [22:50:26] well hmm there weren't that many page titles though right? [22:50:33] I forget rzl is that memcache or redis? [22:50:35] so maybe just a few bad cache entries? [22:50:47] I thought parsercache was in mysql [22:51:01] it's a BagOStuff lookup that is made while computing the ParserCache key [22:51:04] BagOStuff is the generic interface that can be backed by whatever [22:51:10] oh [22:51:24] protected function getOptionsKey( WikiPage $wikiPage ) { [22:51:25] return $this->cache->makeKey( 'pcache', 'idoptions', $wikiPage->getId() ); [22:51:27] } [22:51:31] what's 'pcache' type backed by? [22:51:42] sounds like parsercache [22:51:53] BagOStuff can be backed by parsercache??? [22:51:53] which is mysql yeah [22:51:57] very old but: "The parser cache is currently (2011-08-31) backed by MySQL with hot keys in MediaWiki's regular object cache (Memcached). " [22:52:26] cdanis: I didn't think so either, checking [22:52:41] https://doc.wikimedia.org/mediawiki-core/master/php/classBagOStuff__inherit__graph.png [22:52:50] yes, is the answer https://gerrit.wikimedia.org/g/mediawiki/core/+/master/includes/parser/ParserCacheFactory.php#38 [22:52:59] I need an adult [22:53:19] Krinkle: I don't suppose you're around? [22:53:48] okay [22:53:50] let's take a step back [22:53:54] searching wikitech for pcache brings up a few mentions of 'mysql pcache' [22:54:08] role(mariadb::parsercache) [22:54:11] btw there's also a "cache epoch" there, which in principle we could bump to invalidate it [22:54:34] cdanis: 👂 [22:55:00] I emphathically do not think we should invalidate the entire parsercache for ~600 logged errors per minute (especially when I bet a bunch of these are errors logged multiple times for the same request) [22:55:06] cdanis: I am. [22:55:24] ParserCacheFactory is new this week. [22:55:27] I have no idea what it is [22:55:29] what broke? [22:55:31] *oh* [22:56:11] in_array is not capable of throwing exceptions [22:56:20] even if it is passed NULL? [22:56:24] it might be emitting a background warning though [22:56:25] (as the array) [22:56:29] does not affect runtime behaviour [22:56:38] what's the observed problem? [22:56:41] so [22:56:42] too much backscroll [22:56:47] twentyafterfour rolled out .11 [22:56:48] you're right, ParserCacheFactory is brand new, I wonder if we should loop in Pchelolo (not in this channel) [22:56:50] https://gerrit.wikimedia.org/r/q/message:parsercache [22:57:00] we had some opcache corruption problems on a handful of servers, Krinkle [22:57:05] after rolling back .11 we still have a long tail of strange errors [22:57:05] and a bunch of other errors [22:57:07] we reverted [22:57:20] on many servers, too many to be handwaved away as opcache corruption [22:57:28] we're seeing a bunch of ParserOptions.php-related errors [22:57:34] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/630261 [22:57:34] around `$forOptions` [22:57:46] ErrorException from line 1411 of /srv/mediawiki/php-1.36.0-wmf.10/includes/parser/ParserOptions.php: PHP Warning: array_intersect(): Argument #2 is not an array [22:58:26] `$forOptions` is fetched from BagOStuff, AFAICT [22:58:48] so my guess was that we persisted some bad data there on some subset of pages [23:00:39] is there a task? [23:00:45] is there operational or end-user observed issue [23:00:56] e.g. cache miss, latency, http errors [23:01:20] probably the refactor for ParserCacheFactory just made a typo [23:01:30] I'd rollback since we don't know how it is behaving in that case [23:01:46] re: opcache corruption, so it doesn't hapen anymore/everywhere? [23:02:04] it almost never happened everywhere, Krinkle [23:02:16] in fact I don't think it ever happened what you could say was "everywhere" [23:02:22] if it did, we'd be able to reproduce it [23:02:32] Krinkle: I rolled back right away [23:02:35] error persisted [23:02:53] it re-triggered post-rollback on the old version of similar code? [23:03:17] I have a wmf10 trace from above: https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-deploy-2020.09.30/mediawiki?id=AXThKyjHLNRtRo5Xttip [23:03:19] is there a wmf11 one? [23:03:22] no [23:03:26] only wmf10 AFAIK [23:04:07] but it started post-rollout, so we think the wmf11 code poisoned the cache in a way not related to opcache corruption. [23:04:09] so some bug in wmf.11 or an opcode corruption caused something that rollback didn't resolve [23:04:15] yes [23:04:19] interesting [23:04:34] confirming: any messages containing ParserOptions.php were only logged on wmf.10 [23:04:38] do we have evidence of opcache corruption e.g. an error on wmf11 code that has no sane explanation? [23:04:43] we have multiple yes [23:04:44] though I still haven't done that full scap to make sure that all the code is in sync, should I do that? [23:04:59] Krinkle: https://phabricator.wikimedia.org/T264241 [23:05:09] Krinkle: yes bunches of opcode corruption evidence [23:05:21] there's another backport window starting in -ops, do we want them to wait? [23:05:23] but we've restarted the servers and a couple of errors persist [23:05:33] probably should wait yes [23:05:39] twentyafterfour: I think "sha1sum /srv/mediawiki/php/cache/gitinfo/* | sha1sum" being the same on all mw* should show it is the same code [23:05:50] (it is) [23:05:57] oh my, I haven't seen the string mutation coruption in months. [23:05:58] "LathMathML" [23:06:00] that's horrible [23:06:04] yes, single bit error [23:06:37] the main one we've been plaughed by is php rewriting $a->foo(); $b->bar() as $b->foo() and (hopefully) causing an exception. [23:07:27] better not place any $cache->delete() near $dbw->select() or youll be truncating a production table [23:07:32] so back to something you were saying earlier Krinkle -- do these error messages correspond to user-visible issues? [23:07:42] ErrorException from line 1411 of /srv/mediawiki/php-1.36.0-wmf.10/includes/parser/ParserOptions.php: PHP Warning: Invalid argument supplied for foreach() [23:07:46] ErrorException from line 116 of /srv/mediawiki/php-1.36.0-wmf.10/extensions/Math/src/MathHooks.php: PHP Warning: in_array() expects parameter 2 to be array, null given [23:07:55] I need to bail soon and eat something, I can come back if -- yes, exactly that question [23:08:08] This by itself just means the foreach() acted as if an empty array was passed, no iteration. and similarly in_aray() will return false. [23:08:19] which could be innocent, or could have huge side effects [23:08:27] so it happens while computing the parsercache key [23:08:36] does that mean that essentially parsercache lookup fails and we reparse the page? [23:08:53] no, that's too rational. [23:09:04] we *don't* seem to bail on the request entirely, because we hit one of these per request (you can tell because there's exactly the same number of them over any given time window, which won't happen by chance) [23:09:22] this is the line that changed and is related to getting the key, afaict: [23:09:26] $key = $this->cache->makeKey( $this->name, 'idhash', "{$pageid}-{$renderkey}!{$hash}" ); [23:09:36] yeah, it won't fail. it'll just proceed in a way no one has ever tested or thought about. [23:09:47] looking at the code now, there's a lot of layers [23:09:53] yes :) [23:10:06] RoanKattouw: https://wm-bot.wmflabs.org/logs/%23wikimedia-sre/20200930.txt [23:10:24] haha I was going to link to that, but with the warning that we're a couple hours deep at this point [23:10:29] Krinkle: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/630261/5/includes/parser/ParserCache.php#123 [23:10:46] I joined only 10min ago, so maybe start with that :) [23:11:15] mutante: yeah, the patch is insignificant. no meaningful change there. no mistake either from waht I can tell. [23:11:41] *nod*, ok [23:11:48] cdanis: was MathML the only corruption we know of? [23:12:00] I cna't find the foreach() null error [23:12:23] no, there were many [23:12:44] [{exception_id}] {exception_url} TypeError from line 46 of /srv/mediawiki/php-1.36.0-wmf.10/extensions/Wikibase/client/includes/Store/DescriptionLookup.php: Argument 2 passed to Wikibase\Client\Store\DescriptionLookup::__construct() must be an instance [23:12:47] [{exception_id}] {exception_url} TypeError from line 50 of /srv/mediawiki/php-1.36.0-wmf.10/extensions/Wikibase/lib/includes/Store/LanguageFallbackLabelDescriptionLookupFactory.php: Argument 3 passed to Wikibase\Lib\Store\LanguageFallbackLabelDescriptio [23:12:59] those are wmf.10 [23:13:00] interesting [23:13:14] the LathMathML error occurred on *both* wmf.10 and wmf.11 [23:13:16] ... what [23:13:35] okay sorry I need to afk -- I'll check back when I have some food in me, feel free to text if you need me sooner [23:13:54] rzl thanks for everything [23:14:09] get food I'm sure we can manage [23:14:21] Krinkle: https://logstash.wikimedia.org/goto/cd166c93e590c5b0b15e3310a0bcd935 [23:14:24] https://etherpad.wikimedia.org/p/opcache-%F0%9F%92%A9-2020-10-01 [23:14:29] so, I am not super worried about the fate of production at this point [23:15:12] the lingering issues aren't that high-rate -- they're affecting something like 30 requests / minute [23:15:33] er, more like ~50, but still. [23:16:01] also, they definitely don't cause a fatal error, and probably just mean our parsercache hit rate is slightly less than what it should be [23:16:25] I am digging in kibana trying to see if I can make a list of the affected urls [23:16:29] I am going to contribute some notes and then stand down, I think [23:17:47] If we think any pages are affected, we'll just lazy-purge all of the affected time range. no worries [23:18:03] Krinkle: I didn't know that was possible, ok [23:18:23] there's a built-in mechanism to do that without much impact. in wmf-config. (ignores CDN, but will lazy-ignore cache if in the range at runtime) [23:19:45] most seem to be /w/api.php [23:21:14] RoanKattouw: do you know if `in_array( 'math', $forOptions )` could normally return true any way in prod? Or is this condition always false? [23:21:58] I also need to break to eat but I'll be pingable [23:25:15] the the "co-incident with …" are those from the same exact reqId's, e.g. same request triggers them in a row? [23:25:30] Krinkle: yeah [23:25:42] could use a few more permalinks to some of those :) [23:25:42] I see 4 of them for the same request [23:26:15] krinkle just look at the new errors dashboard it's full of them at a rate of 1000 per 5 minutes [23:29:32] ack [23:29:57] https://logstash.wikimedia.org/goto/f7ee2d35993ca9c28dc3524f3c341915 [23:30:03] linked in the etherpad [23:35:11] thx [23:37:25] ok, so yeah, I don't hink it's worth going into details, but it seems prudent to invalidate the cache entries from wmf.11 going out and being rolled back. [23:38:27] looking at a few other errors now [23:42:23] I'm not seeing how/where the cause persisted [23:43:38] I don't know either [23:44:09] I'm not sure this is a corruption after all [23:44:10] https://phabricator.wikimedia.org/T263851 [23:44:24] CacheTime was changed in ways that can only be classified as a high-risk change. [23:44:33] incl two reverts in the last week [23:44:42] ygh [23:44:48] ugh [23:45:05] anyway, I'll purge first [23:45:23] do you have timestamps for when wmf.11 went out and went back? [23:45:58] hang on let me see [23:46:38] 20:55:57 [23:47:06] 21:03:17 [23:47:25] I think that's the timeframe [23:47:35] OK [23:49:15] wmf.11 was later rolled back on group 0 as well, that happened at 21:23:25 [23:49:25] not sure if it could do harm on group0 probably not? [23:49:39] I mean, we do have lots of docs on mediawiki.org [23:49:50] woudl be confusing to have testwikidata e.g. not work correctly for a month [23:50:05] and sstill spew errors 24/7 [23:50:12] yeah ... the time it went out to group0 was earlier though and didn't terigger any errors [23:50:29] would 19:00 - 21:05 cover it? [23:50:45] yeah [23:51:01] 21:25 [23:59:32] twentyafterfour: interesting, filing a subtask using the permalink for that, does not produce an eventg on the parent task? [23:59:32] https://phabricator.wikimedia.org/T263177 [23:59:49] it doesnt' say I added T264257 to it [23:59:50] T264257: Fix ParserOutput corruption wmf.10 -> wmf.11 - https://phabricator.wikimedia.org/T264257 [23:59:50] weird [23:59:57] regression I think?