[00:00:23] ejegg: Also please edit my phab comment to add a link to your phantom script [00:00:34] And remember to actually run it on a timer before you stop working today :) [00:00:51] Perhaps logging to a place where we can read the output [00:01:14] Well, I'm in CST now (New Orleans) and heading to a show at 10... So being awake at 2am is probable, not sure if I'll trust myself with CN admin though! [00:01:33] RoanKattouw: Ahh, I meant to look into a labs instance [00:01:39] I see there's a 'monitoring' project [00:02:54] Your homedir on stat1003 could work too [00:03:06] But yeah labs is good [00:03:10] Probably better [00:03:28] RoanKattouw: ejegg: Krinkle: storm of message blob cache misses around 8:02 yesterday.... https://goo.gl/QSgBjd [00:03:43] (no idea if that could be relevant) [00:04:23] Hmm those are blob misses for a ton of modules [00:04:32] Maybe an objectcache stampede of some sort, causing somehow corrupted data? [00:05:57] Also at 8:02 today https://goo.gl/0Cz6Su [00:08:12] Though actually those aren't significant peaks it seems, if u zoom out [00:09:02] https://goo.gl/iGrYfP [00:10:02] ejegg: so ur script is all set to run? [00:11:00] ahh, no phantomjs up on stat servers [00:11:13] I'll just leave my compy on and post the results in the morning [00:11:44] relocating [00:11:50] ejegg: thx!!!! [00:15:54] RoanKattouw: AndyRussG: getWithSetCallback() in WANObjectCache does not cache false return valus [00:16:07] AndyRussG: However, it seems ChoiceDataProvider currently ignores errors [00:16:18] return values of select() are ignored, wrapper array is created unconditionally [00:16:19] Krinkle: how so? [00:16:43] aaaarg [00:16:50] $stuff = []; $rows = $db->select( .. ); ; foreach( $rows); if !stuff; return stuff; [00:17:10] should have a if $rows == false [00:17:15] and if so, return false too [00:17:31] Krinkle: do we get a log anywhere else when a select fails? [00:17:59] Actually, a DBQueryError is thrown [00:18:24] Which would probably bubble all the way to ResourceLoader [00:18:34] Mmmhmmm [00:18:41] (if the query fails) [00:19:10] Hmmm gonna try locally [00:23:53] Fundraising Sprint Waiting for Godot, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, Unplanned-Sprint-Work: Central Notice: possible CN issue early on December 1st UTC - https://phabricator.wikimedia.org/T152122#2843217 (Ejegg) PhantomJS script: P4559 [00:26:40] Krinkle: on my local setup if I trigger a DBQueryError in ChoiceDataProvider (just adding som giberish in one of the query join conditions) I don't get an error in the apache error.log [00:27:28] Just some error messages in the module content [00:28:20] And the error also shows in the JS console [00:34:52] In my local setup that also stops JS (for example, search box autocomplete stops working). On prod, we saw evidence that JS was working fine throughout the banner gap (for example, other EL events didn't have a comparable dip) [00:34:57] ejegg: RoanKattouw ^ [00:36:34] oho! [00:37:26] Hmm interesting [00:38:03] I mean, the fact that DBQueryErrors can break JS completely sounds like a bug, but it wouldn't explain this incident as you said [00:39:40] yep [00:40:04] I don't know really what this logstash channel shows, but at least I can say, nothing to see here: https://goo.gl/ESZxCr [00:40:20] (searching on load.php because RL) [00:41:53] Ha [00:42:05] Yeah just some lock timeout query errors for ULS RL modules [00:43:51] We should also log the raw string content of the choiceData module [00:43:56] ejegg: OK, so re step #2 in my phab comment, did you write a thing that dumps choiceData from load.php? [00:43:59] Ha, jinux [00:44:02] *jinx [00:44:22] 8p [00:44:30] yep! I'm just using curl for that bit [00:44:39] cool! [00:44:44] *coorl! [00:45:08] Oh and I see P4559 already has the verison hash [00:45:11] I've got it set up via cron on 2 machines, and I'll try to VPN one of them through the SF office [00:45:37] though this house has slightly shaky wifi [00:46:02] I tried running the script on one of the CI servers since I knew they had phantomjs [00:46:12] but it never got anything back from the request [00:46:15] Hmm [00:46:32] not sure if it's a version issue - those are at 1.9 while i'm using 2.1 locally [00:46:33] I wonder how much effort it would be to install phantomjs on a labs intsance [00:46:47] my VPS was also balking at it, also on 1.9 [00:47:05] Yeah IIRC 2.x was a major breaking change [00:48:18] I can also try on a VPS [00:48:41] At least only the wifi is shakey, not the house [00:49:04] hmm, cron's having trouble with phantomjs too. wonder if i'm missing some essential env vars [00:50:44] let's try DISPLAY [00:53:10] yep, that did it [00:53:14] :) [00:53:56] shoot, what's up with VPN? [00:56:57] Hmm for some reason I can't get to fluorine to see dberror.log (https://wikitech.wikimedia.org/wiki/Logs) [00:57:13] Searching for dberror in logstash, dunno if that's getting the same [01:01:33] phooey, no VPN on wifi or tethered via phone. That could be really inconvenient... [01:02:20] RoanKattouw: As soon as we mentioned potential for db exception bubbling up, I did some thinking.. [01:02:31] Hmm now I can reach fluorine [01:02:37] RoanKattouw: So, as you know, we catch errors from module content generation, and still output the other modules [01:02:46] Yeah [01:02:51] RoanKattouw: But, I broke this. [01:02:55] RoanKattouw: Not exactly [01:02:59] but effectively [01:03:12] getScript throwing is caught in a module response [01:04:12] However, if getScript throws and the module uses getScript in getVersionHash (via enableModuleContentVersion or manually in getDefinitionSummary), then it is not caught by startupmodule [01:04:17] so startupmodule's getScript will throw [01:04:22] which means, no startup module [01:04:27] just a comment about its exception [01:04:33] literally nothing else [01:04:36] no executable code [01:04:42] Which is expected, but bad. [01:04:56] https://gerrit.wikimedia.org/r/#/c/325071/1 [01:04:58] Working on a patch now [01:05:06] this may not be the case here for fundraising, but just doing it anyway [01:06:15] RoanKattouw: See https://gerrit.wikimedia.org/r/#/c/325071/1 for details, [01:06:58] So, if our RL module doesn't explicitly implement a version hash, it'll use getScript [01:10:00] andyrussg@fluorine:~$ gunzip -c /a/mw-log/archive/DBQuery.log-20161201.gz | grep -i choicedata [01:10:03] ejegg: No, that's not the default. [01:10:04] ^ no results [01:10:05] ejegg: https://github.com/wikimedia/mediawiki-extensions-CentralNotice/blob/master/includes/CNChoiceDataResourceLoaderModule.php#L168 [01:10:24] CN data module getModifiedHash is used - which calls getChoices - which may throw [01:10:25] ah right [01:10:27] this is an RL issue [01:10:34] it shoudl be caught there, don't worry [01:10:52] getChoices is effectively getScript almost [01:11:26] AndyRussG: Good, okay, so it wasn't a db error [01:11:33] Hmmmm [01:12:10] Krinkle: so, we needn't be working ASAP on a patch to handle DB errors on those selects? [01:13:58] Krinkle: Ouch [01:14:25] No startup module is very bad [01:15:40] ^ So it could be instead an error in the generation of some other module? [01:16:00] How would the phantomjs script log the presence of a startup module? [01:17:29] Also, for the CN issue, how might that be reflected in the pattern of no-banner-served that we see? [01:17:48] I guess it might fit, since choicedata varies on lang [01:18:57] Or rather in general RL varies on lang [01:19:21] So I guess it could have been an issue for certain langs at certain DCs [01:20:13] Krinkle: looking now at resourceloader.log on fluorine [01:20:30] (dunno exactly what that'd log) [01:21:36] I'm going to run. I'll be checking emails. [01:23:33] AndyRussG: Hard to detect. Probably by either looking in the response for /modules=startup/ for an exception, or by scripting a RLQ.push function and noticing that it is never called [01:23:55] AndyRussG: Which btw, is most likely how you'll want to execute scripts in phantom given that mw is loaded asynchronously. [01:24:14] RLQ.push() will run immediately if mw is loaded, and otherwise queued for once async startup arrives [01:24:53] evalScriptAsync(function (callback) { RLQ.push(function () { .... callback(..); }); }); [01:24:57] or something like that [01:24:58] Krinkle: ejegg: ^ anyway a RL error in some other module doesn't explain our hour-long outage [01:25:04] RoanKattouw: ^ [01:25:19] hmm [01:25:26] Indeed. If a random other module throw an exception once, and not again that would cause upto 5min of issues. [01:27:11] Mmmm right tho it could throw repeatedly for an hour [01:29:07] Either way, it'd show up in exception.log [01:30:41] Hmmm [01:30:59] Does the objectcache ever get corrupted? Does that log somewhere? [01:31:59] I don't know of any objectcache exception, but if it does throw somehow, it'd bubble to RL and be in exception.og [01:32:17] if not, then it would return flase, bypass cahce and call fetchChoices right away from the callback [01:32:22] graceful fallback [01:32:33] also check meidawiki-errors for non-exceptions [01:33:16] $ gunzip -c /a/mw-log/archive/memcached.log-20161201.gz | grep CentralNoticeChoiceData also gives naught [01:34:31] https://logstash.wikimedia.org/goto/3086b68df3c1c4ea4e742d1fff1fbfa7 [01:34:43] mediawiki-error 08-09 AM url:"load.php" [01:34:48] 08-10 AM [01:35:56] https://logstash.wikimedia.org/goto/4a049e04aff7c52e0aa231fc3e5a56f2 - without DBReplication warnings [01:36:33] https://logstash.wikimedia.org/goto/f06926e432504d73bcaa259a6fd94af3 - without wikitech/silver [01:36:40] leaves 4 database connect exceptions [01:37:21] for 4 random page views [01:37:39] Hmm I don't see it... [01:38:19] scroll down to event list [01:39:47] 2 for load.php on itwiki and plwiki containing centralnotice among others (would've returned the error in a comment, other modules unharmed, and cached for 5 minutes instead of 30 days) [01:39:55] and 2 for for load.php that doesn't contain centralnotice [01:40:00] Don't see stuff at 8:09 or 8:10 [01:40:04] indeed [01:40:11] these were about 20-40 minutes earlier [01:40:46] Hmmm [01:42:15] So... that means...? Maybe another connection error when generating choicedata that we're not detecting because not logged? [01:44:41] No, I don't think we have any reason to believe a query problem happened [01:52:55] mmhmm [01:53:05] Gotta afk, back in about 15m! [01:53:17] should I dump all the other module versions too? [02:30:11] ejegg: Can't hurt too much I suppose [02:30:28] ejegg: But maybe dump choiceData separately [02:42:47] aaaarg much longer to return here, back soon I hope [02:53:38] yep, I'm doing a separate curl request for the startup module and parsing the versions out of that [02:56:44] OK cool [02:56:59] Are you also doing a separate curl for the contents of the choiceData module? [02:57:49] OK I got phantomjs to work on my AWS instance BTW [02:57:54] In case you hadn't gotten it working already [02:59:33] RoanKattouw: yeah, also doing that module request separately [02:59:43] Cool [02:59:46] in case anything makes phantomjs die before I can parse it [03:00:14] I'm running those same things on my AWS instance, every 5 mins starting at :59 just now [03:00:17] or rather, before it defines that stuff in mediaWiki.centralNotice [03:00:23] For redundancy in case your thing dies or whatever [03:00:27] thanks! [03:01:17] * AndyRUssG bets on no outages [03:01:22] A watch outage never boils? [03:01:39] got it running on 2 machines here, one on the 0s & 5s (vpned thru SF) and one on the 3s & 8s that geolocates to New Orleans [03:01:45] Nice [03:01:50] I have one in us-east-1 (Virginia) [03:01:59] So probably like across the street from eqiad or something [03:02:07] Hmmm [03:02:08] cool, so we should at least hit eqiad and ulsfo [03:02:26] Maybe codfw too with your NOLA one [03:02:36] Mine isn't grabbing the entire startup module though [03:02:50] ejegg: RoanKattouw: maybe also run some of these in languages that weren't affected? [03:03:01] Hmm good idea [03:03:03] AndyRussG: ah, could do [03:03:05] What's the set of langs we should run? [03:03:07] es was unaffected [03:03:10] (but that are getting campaigns) [03:03:23] and is targeted by a campaign [03:03:28] also cs [03:03:34] mmm [03:04:19] hehe, that Fla-Vor-Ice article is getting an unearned popularity spike [03:04:39] hope analytics filters on the phantomjs useragent [03:05:06] And curl [03:05:13] Hmm I guess I'm only using curl for load.php so that's OK [03:07:04] OK, so en and es, maybe cs [03:07:08] Any other ones that /were/ affected? [03:07:19] Maybe we should do 2 affected and 2 unaffected ones? [03:08:05] affected: de, en, fr, it, simple [03:08:13] OK [03:08:19] So I'll hit en, de, es, cs [03:09:04] Probably it makes sense to try digging into the data some more [03:09:17] hmm, that script should take a language arg [03:09:25] and hit Special:Random [03:09:40] Comparing yesterday's and today's outages might yield some info [03:10:14] If it were just some random infrastructure problem giving out corrupt data, why would it have hit the same languages both days? [03:11:07] Unless that's just coincidence, or just because of what we're targeting somehow... [03:11:44] thanks for all the help RoanKattouw ! [03:11:54] I'm going to step away from the screens for a while [03:15:10] Alright I've got it all running now [03:15:44] Capturing your phantom script and also the choiceData module and the startup module for en, es, de and cs at the 4s and 9s from Virginia [03:16:36] I'm gonna head home now [03:16:56] We'll see what comes out of this tomorrow [03:48:56] RoanKattouw: ejegg: what about a VPS that we force to use IPv6? [03:52:44] AndyRussG: ooh great idea [03:53:05] When I get home I'll check if my EC2 instance uses v6 [04:02:05] (PS1) Cdentinger: Restrict financial_txn rows [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/325085 (https://phabricator.wikimedia.org/T152238) [04:52:54] RoanKattouw: I think I have that option at DigitalOcean [04:56:53] AndyRussG: I'll check if mine is IPv6 but I can also give you the command I'm running [04:58:02] while true; do for lang in en es de cs; do curl https://$lang.wikipedia.org/w/load.php?modules=ext.centralNotice.choiceData -o choiceData-$lang-$(date +%H%M); QT_QPA_PLATFORM=offscreen phantomjs ../phantom-test.js $lang > phantom-$lang-$(date +%H%M); done; sleep 300; done; [04:58:11] Where ../phantom-test.js is https://phabricator.wikimedia.org/paste/edit/4559/ [04:59:56] Hmph no I'm using v4 [05:00:16] I can run curl with -6 though [05:00:42] AndyRussG: But maybe you can just run the same command as me on your IPv6-only VPS [05:01:03] An easy way to check which one is being used is curl 'https://en.wikipedia.org/w/api.php?action=query&meta=userinfo&format=json' [05:01:27] Which, ironically I have IPv6 at my house and not on my VPS [05:17:13] RoanKattouw: Nice script and neat. I guess the way you beat the async problem is by re-checking whenever a resource is received. [05:17:22] Hm? [05:17:37] Oh you mean the timestamping? [05:17:44] RoanKattouw: the phantom script [05:17:48] Not mine [05:17:57] That's all ejegg|away [05:18:16] ah, okay [05:18:32] Oh I forgot to add my modification that substitutes args[1] for 'en' [05:18:55] AndyRussG: --^^ you will need a small modification to phantom-test.js if you are indeed going to run it [05:21:08] I would've probably done something like evaluate() -> RLQ.push -> mw.loader.using -> callPhantom(data) [05:21:30] which can time out or be idle in case of errors [06:04:06] Fundraising-Backlog, Analytics, MediaWiki-extensions-CentralNotice, Performance-Team, Spike: Spike: CentralNotice: Investigate new ways to implement cross-domain banner close-button functionality, to end the Special:HideBanners cookie storm - https://phabricator.wikimedia.org/T117433#1774073 (... [06:38:28] Fundraising-Backlog, Analytics, MediaWiki-extensions-CentralNotice, Performance-Team, Spike: Spike: CentralNotice: Investigate new ways to implement cross-domain banner close-button functionality, to end the Special:HideBanners cookie storm - https://phabricator.wikimedia.org/T117433#1774073 (... [06:53:49] Fundraising-Backlog, Analytics, MediaWiki-extensions-CentralNotice, Performance-Team, Spike: Spike: Investigate alternatives to Special:HideBanners cookie storm for cross-domain banner close-button - https://phabricator.wikimedia.org/T117433#2843761 (Krinkle) [06:54:06] (now the message doesn't get cut off) [07:19:20] Krinkle: know how to force calls from a VPS to WMF prod to go thru ipv6 for this script? [08:17:36] No dropoff so far.... [08:18:02] https://grafana.wikimedia.org/dashboard/db/kafka-by-topic?from=1480507894109&to=1480753064257&var-cluster=analytics-eqiad&var-kafka_brokers=All&var-topic=eventlogging_CentralNoticeBannerHistory [08:52:53] watched pot, huh? [14:25:53] nope no outage! /me celebrates? [15:59:38] Fundraising Sprint Waiting for Godot, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, Unplanned-Sprint-Work: Central Notice: possible CN issue early on December 1st UTC - https://phabricator.wikimedia.org/T152122#2844144 (Ejegg) A watched pot never boils. No banner dip observed December... [16:13:32] (CR) Ejegg: [C: 2] Restrict financial_txn rows [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/325085 (https://phabricator.wikimedia.org/T152238) (owner: Cdentinger) [16:17:48] (Merged) jenkins-bot: Restrict financial_txn rows [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/325085 (https://phabricator.wikimedia.org/T152238) (owner: Cdentinger) [18:00:02] fr-tech: The Great Movie Posters: [18:00:03] A mis-spawned murderous abomination from the nether reaches of an [18:00:03] unimaginable hell. [18:00:03] -- The Killer of Castle Brood (1967) [18:00:03] NEW -- SICKENING HORROR to make your STOMACH TURN and FLESH CRAWL! [18:00:03] -- Frankenstein's Bloody Terror (1968) [18:00:03] LUST-MAD MEN AND LAWLESS WOMEN IN A VICIOUS AND SENSUOUS ORGY OF [18:00:04] SLAUGHTER! [18:00:04] -- Five Bloody Graves (1969) [18:00:05] The family that slays together stays together. [18:00:05] -- Bloody Mama (1970) [18:00:06] -- discuss. [23:28:30] Fundraising-Analysis, Fundraising-Backlog: Make Github wmf-fr repository public - https://phabricator.wikimedia.org/T151983#2834119 (Peachey88) It should probably be transitioned over to our Gerrit or Diffusion repos, I see no reason for this to be on GitHub in the first placeā€¦