[00:00:04] (which broke my Gerrit access because it tried to use v6 and failed!) [00:00:11] I haven't heard anything, and also don't have v6 to test with [00:00:32] What I'm suggesting is, maybe some strange external factor happened, like a US ISP doing a one-hour IPv6 test at 3am Eastern time [00:01:05] Fundraising Sprint Waiting for Godot, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, Unplanned-Sprint-Work: Central Notice: possible CN issue early on December 1st UTC - https://phabricator.wikimedia.org/T152122#2840352 (Ejegg) Ahh, ok. So my burning question is now "How was the geotarg... [00:01:16] RoanKattouw: ^ yeah neat hypothesis [00:01:36] Which suggests another line of inquiry: add GROUP BY country to that query [00:01:37] RoanKattouw: This year we updated our GeoIP database and supposedly it has no trouble w/ IPv6 now [00:01:52] To see if the drop is uniform across Big English countries or biased towards US/CA or another country [00:02:16] (Getting Adam to do that now) [00:02:20] We no longer force IPv6 users to ajax call to a IPv4-only geoiplookup service. It's all just looked up on Varnish now regardless [00:02:42] Or from varnish [00:02:50] Right [00:02:52] On, from, via and using [00:03:20] * AndyRussG tosses excess coffee on dying plant [00:03:58] all the countries seem to be affected equally, from eyeballing the dash x-by-y graph [00:04:20] maybe CA a little bit worse than the rest [00:04:40] i'm still a little unclear about what's expected to happen if geoip fails to return a result, do we fail to a generic campaign or show no campaign at all? [00:05:38] Jeff_Green: no, no campaign in that case [00:06:51] in the future could we at collect the fact that they got no country code, to help with debugging? [00:07:22] The list of possible campaigns sent to clients typically includes a whole bunch of campaigns that don't apply to them, and the JS filters according to GeoIP, device, logged-in status [00:07:48] Jeff_Green: for GeoIP debugging, yes, great idea. That per se should be part of geoip infrastructure [00:08:35] We could eventually implement a small-sample-size logging call for users not selected for any campaign at all, though it's not currently in the system... [00:08:36] seems like something we'd want with banner stats [00:09:03] IIRC we used to call the equivalent of beacon/impression on every pageview, but turned it way down because performance [00:10:26] yeah [00:11:03] is the decision to load a banner or not all made client side now? [00:11:10] Jeff_Green: yep [00:11:45] so it wouldn't be too awful to slow down the small percentage who don't get gecoded [00:12:03] certainly less painful than loading a banner if they got a campaign [00:13:01] Jeff_Green: yeah. we are actually hoping to spin off geolocation into a separate component. So that could be part of a geoip-specific monitoring infrastructure [00:13:17] server just sends choices filtered on project and language [00:13:24] And we could also add some kind of more general CN monitoring for users not selected for a campaign [00:13:26] https://www.mediawiki.org/wiki/Extension:CentralNotice/Campaign_and_banner_selection [00:14:16] so another theory I was wondering about--what if somehow we cached a bad copy of the client javascript for an hour [00:14:26] like it got truncated and cached or something [00:14:34] hmmm! [00:14:40] across all DC's? [00:14:54] Wouldn't we see other effects? [00:15:02] i don't know how many copies we end up caching ultimately [00:15:23] and whether those copies get copied around [00:15:29] Also mentioned before, how many clients arrive at wp with no JS cached up at all? would we get such a huge drop? [00:15:58] Unless, I suppose, there was a RL cache invalidation that made everyone load bad JS [00:16:25] That might show up if we looked for similar effects of other JS-dependent stuff, such as non-banner-history eventlogging calls [00:16:49] didn't someone say there was no comparable drop in other eventlogging? [00:17:44] ejegg i wasn't able to find anything [00:19:01] https://grafana.wikimedia.org/dashboard/db/kafka-by-topic?from=1480531567255&to=1480620351298&panelId=6&fullscreen&var-cluster=analytics-eqiad&var-kafka_brokers=All&var-topic=eventlogging-client-side [00:19:15] ah right... Yeah for example here, no comparable dip [00:19:38] Vs this that Jeff_Green sent earlier.... https://grafana.wikimedia.org/dashboard/db/kafka-by-topic?from=1480531567255&to=1480620351298&panelId=6&fullscreen&var-cluster=analytics-eqiad&var-kafka_brokers=All&var-topic=eventlogging_CentralNoticeBannerHistory [00:20:14] Oh hey we have tmux on stat1002 [00:20:57] Hmm, there's a /small/ dip in that first one [00:21:01] starting at 8:10 [00:21:54] ejegg: maybe that's the sum total of all client-side eventlogging, so it would include our personal dip? [00:22:13] the special money dip? [00:22:13] yeah, that makes sense--the dip looks like at most 7-10K [00:23:14] we need a sad dollar sign emoticon on that graph [00:23:26] hmm, no dip in eventlogging-geofeatures, whatever that is: [00:23:35] https://grafana.wikimedia.org/dashboard/db/kafka-by-topic?from=1480531567254&to=1480620351297&var-cluster=analytics-eqiad&var-kafka_brokers=All&var-topic=eventlogging_GeoFeatures [00:23:47] https://grafana.wikimedia.org/dashboard/db/kafka-by-topic?from=1480531567255&to=1480620351298&panelId=6&fullscreen&var-cluster=analytics-eqiad&var-kafka_brokers=All&var-topic=eventlogging_NavigationTiming [00:24:24] Eh don't worry, we'll make it up by visiting those missed donors in the door-to-door campaign [00:24:54] expiry coming up... [00:28:51] awww [00:29:45] ah, geofeatures is just analysing how people are using the geohack tool labs stuff, so it's not connected with geoIP [00:30:03] ? [00:30:42] it was the only thing I could find in grafana that had geo in the name [00:31:31] hmmm [00:32:19] Looks like even the donations queue went practically down to 0 during this outage? https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Fundraising%20eqiad&h=frqueue1001.frack.eqiad.wmnet&r=hour&z=small&jr=&js=&st=1480379199&event=hide&ts=0&v=395&m=Donations%20Queue&vl=messages&z=large [00:32:29] (looking at the 4th graph) [00:32:42] yah, that time of night it's cleared out at the end of each import run [00:33:01] Hmmm [00:40:30] ooh, definite language correlation [00:40:39] oooooooooh! :D [00:41:03] Another great reason to promote Esperanto [00:41:05] en, de, it, fr, and simple are all affected [00:41:19] hm! [00:41:34] the rest look totally unaffected [00:41:51] or have sample sizes too small to tell [00:42:05] could that be just an effect of those being the languages we're targeting greatly? [00:42:15] are we getting anything through on the affected langauges? [00:42:41] yeah, it's not 100% for anything [00:42:52] hmm [00:44:36] ejegg: any comments on the Hive create table query at the end of the 'therpad? [00:45:17] looking [00:46:06] maybe add 'ion' AND year=2016 AND month=12 AND day=1 AND hour>6 and hour < 10 [00:46:20] err, ignore that 'ion' [00:46:32] just meant add the hour restriction [00:49:35] ejegg: yeah I was thinking of that, but if it's gonna be fast to query, maybe sometimes we'd want to get the whole day? [00:49:47] k [00:49:55] I'm gonna add in pre-parsed fields for campaign, banner and statusCode [00:50:06] sounds good [00:50:45] anything else we should pre-parse out of the uri_query? [00:50:57] Ah no statusCode no, that'd be from beacon/impression [00:51:02] Actually, maybe I should try that [00:51:05] instead [00:51:10] of BannerLoader [00:51:17] Well, let's do both I guess [00:51:37] yeah, let's [00:52:01] What about comparing discrepancies between the country_code in webrequest vs. country url param received from beacon/impression? The latter would show what we're getting via the Varnish cookie [00:52:35] hmm, could be interesting [00:53:09] would tell us if geoip was mislocating rather than off [00:53:24] at least, for the ppl that got through [00:53:40] yea [00:53:46] k, I need to relocate. back soon [00:53:53] K thx much!!!! [01:03:43] AndyRussG: K4-713 ejegg|afk XenoRyet I'm thinking of heading out. I will be checking email. let me know if you need anything [01:04:35] * K4-713 salutes [01:30:57] Fundraising Sprint Waiting for Godot, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, Unplanned-Sprint-Work: Central Notice: possible CN issue early on December 1st UTC - https://phabricator.wikimedia.org/T152122#2840516 (Ejegg) Looks like it's language-specific: en, de, fr, it, and simp... [01:32:00] AndyRussG: any thoughts on the language-specificity? [01:33:22] and did you see any mismatches on the countries? [01:38:49] Wow, that got weird. [01:40:50] ejegg, AndyRussG: Good catch with the language specificity. [01:41:55] (CR) Ejegg: [C: -1] "receive_date is the time the user was on the payments box, so when the queue's backed up that query will actually give you zero. How about" [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/324655 (https://phabricator.wikimedia.org/T152072) (owner: Eileen) [01:42:31] Not sure what to do with that, though [01:44:03] ejegg, AndyRussG: Ah, in the interests of not running yourselves into the ground, I am going to officially advise you to reach an endpoint for tonight and pick this up in the AM. And when we get to the AM, we should strongly solicit others for help. [01:44:04] well, I'm going to do non-computer things for a bit. Maybe something will come to me [01:44:18] heh, thanks K4-713 [01:44:25] have a good evening! [01:44:35] Hey, you too. [01:44:39] Good work today. :) [01:44:46] thanks! [01:45:24] (CR) Eileen: "Hmm - I thought that because the receive_date doesn't exactly match the date in contribution_tracking it would be the time the contributio" [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/324655 (https://phabricator.wikimedia.org/T152072) (owner: Eileen) [01:52:46] ejegg|away: haven't dug in to either yet, creating m'own Hive table now, tho! [01:54:07] K4-713: thx! I do have to break again for domestic duties in a bit... Appreciate the suggestion... :) Mysteries are fun, tho ;p [01:55:30] AndyRussG: They sure are! [01:55:57] If you're having fun, by all means: Continue when you're done with your break. [01:56:11] Otherwise, please do feel free to get some rest. [01:56:41] In the meantime, I'll see if I can get more eyes on this ticket. We left some great clues. [01:57:09] Somebody around this place must know something about what we're seeing here. [01:59:22] K4-713: yea hopefullly.... 8p [01:59:56] I'll try to at least make sure all most significant conclusions/leads so far are on the task :) [02:03:38] AndyRussG: That would be great. Increases the chances that someone else will put the pieces together while we're off the clock. [02:03:41] I love it when that happens. [02:04:19] Closest thing to magic I've ever seen. [02:17:24] (PS7) Eileen: Add option to early-exit dedupe jobs based on the volume of contributions being processed [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/324655 (https://phabricator.wikimedia.org/T152072) [02:36:22] (CR) Eileen: "ejegg - have updated with your suggestion" [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/324655 (https://phabricator.wikimedia.org/T152072) (owner: Eileen) [05:13:17] Wikimedia-Fundraising, MediaWiki-extensions-FundraiserLandingPage: DonateWiki - Special:FundraiserRedirector - Social Media Text Scrape - https://phabricator.wikimedia.org/T152181#2840844 (Jseddon) [05:13:37] Wikimedia-Fundraising, MediaWiki-extensions-FundraiserLandingPage: DonateWiki - Special:FundraiserRedirector - Social Media Text Scrape - https://phabricator.wikimedia.org/T152181#2840856 (Jseddon) p:Triage>Unbreak! [06:14:29] Fundraising Sprint Waiting for Godot, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, Unplanned-Sprint-Work: Central Notice: possible CN issue early on December 1st UTC - https://phabricator.wikimedia.org/T152122#2840884 (AndyRussG) Made two partly-processed tables in Hive to be able to... [09:11:28] (CR) Hashar: "check experimental" [wikimedia/fundraising/stats] - https://gerrit.wikimedia.org/r/324774 (owner: Awight) [09:11:53] (CR) Hashar: "Did a basic CI config https://gerrit.wikimedia.org/r/#/c/324863/ which let us trigger a tox run by commenting "check experimental"." [wikimedia/fundraising/stats] - https://gerrit.wikimedia.org/r/324774 (owner: Awight) [09:12:33] (CR) Hashar: "check experimental" [wikimedia/fundraising/stats] - https://gerrit.wikimedia.org/r/324774 (owner: Awight) [09:41:43] (CR) Hashar: Add gitignore and tox support (5 comments) [wikimedia/fundraising/stats] - https://gerrit.wikimedia.org/r/324774 (owner: Awight) [10:53:22] Wikimedia-Fundraising, MediaWiki-extensions-FundraiserLandingPage: DonateWiki - Special:FundraiserRedirector - Social Media Text Scrape - https://phabricator.wikimedia.org/T152181#2841196 (Pcoombe) [10:53:24] Fundraising-Backlog, Wikimedia-Fundraising, Patch-For-Review: Thank You page to Facebook error - https://phabricator.wikimedia.org/T152026#2841199 (Pcoombe) [11:16:03] Wikimedia-Fundraising: Improve Facebook sharing process on Thank You page - https://phabricator.wikimedia.org/T152195#2841226 (Pcoombe) [11:17:59] Fundraising-Backlog, Wikimedia-Fundraising, Patch-For-Review: Improve text scraped by Facebook from donatewiki - https://phabricator.wikimedia.org/T152026#2841242 (Pcoombe) p:Triage>High [15:05:28] Fundraising-Backlog, Wikimedia-Fundraising, Patch-For-Review: Improve text scraped by Facebook from donatewiki - https://phabricator.wikimedia.org/T152026#2841662 (DStrine) I successfully posted a few days ago. I'm posting the expeced result below [1]. Also props to whoever setup the url for the link... [15:27:47] Wikimedia-Fundraising: Facebook preview looks pretty bad - https://phabricator.wikimedia.org/T152210#2841759 (TJones) [15:29:08] Wikimedia-Fundraising: Facebook preview looks pretty bad - https://phabricator.wikimedia.org/T152210#2841759 (TJones) [15:34:21] that face book preview bot is infuriating [15:34:41] a sizeable % of traffic on the web is that robot [15:35:05] feeding people the internet through face book [15:35:38] * cwd slides soap box under desk [15:36:53] Fundraising-Backlog, Wikimedia-Fundraising, Patch-For-Review: Improve text scraped by Facebook from donatewiki - https://phabricator.wikimedia.org/T152026#2841783 (Pcoombe) @DStrine The problem is that Facebook scrapes links, and caches what it finds to show in previews. When no country is specified... [15:43:25] Wikimedia-Fundraising: Facebook preview looks pretty bad - https://phabricator.wikimedia.org/T152210#2841788 (Pcoombe) [15:43:28] Fundraising-Backlog, Wikimedia-Fundraising, Patch-For-Review: Improve text scraped by Facebook from donatewiki - https://phabricator.wikimedia.org/T152026#2841791 (Pcoombe) [15:47:39] Fundraising-Backlog, Wikimedia-Fundraising, Patch-For-Review: Improve text scraped by Facebook from donatewiki - https://phabricator.wikimedia.org/T152026#2841807 (Ejegg) Ah, I misunderstood the problem. So maybe we need to change the logic that shows the "can't find your country" message to take nav... [16:10:32] Fundraising-Backlog, Wikimedia-Fundraising, Patch-For-Review: Improve text scraped by Facebook from donatewiki - https://phabricator.wikimedia.org/T152026#2841847 (Pcoombe) Okay, I've made the https://donate.wikimedia.org/?country=XX page have the same appeal content as a normal donation page, moving... [16:10:40] Fundraising-Backlog, Wikimedia-Fundraising, Patch-For-Review: Improve text scraped by Facebook from donatewiki - https://phabricator.wikimedia.org/T152026#2841848 (Pcoombe) Open>Resolved [16:17:38] Fundraising Sprint Waiting for Godot, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, Unplanned-Sprint-Work: Central Notice: possible CN issue early on December 1st UTC - https://phabricator.wikimedia.org/T152122#2841854 (K4-713) I'm seeing signs that this happened again last night at th... [16:19:30] K4-713: http://i.imgur.com/hmz1rro.gif [16:19:47] cwd: :) [16:20:39] I'm sorry are facebook and internet different? TIL [16:20:50] (said that just to trigger you all, have a good weekend!) [16:21:05] X| [16:21:19] The facebook isn't letting me dial in. [16:21:42] Guess I'll go make a sandwich. [16:21:58] i think this means that AOL won, conceptually [16:22:23] Damn it. I was always a Prodigy Star supporter myself. [16:23:02] that too [16:23:11] here's your app with a little window to the internet [16:23:39] Anyway, I'm sort of relieved that we have the identical problem at the same hour as yesterday. That's huge. [16:23:51] Particularly when there's similar stuff in the server admin log... [16:23:53] yep [16:24:47] Anyway, I'm not really here yet. I just wanted to run some numbers before coffee. [16:25:28] K4-713: what about the total drop out between 10 and 11?? [16:25:44] Heh. [16:25:56] That's an interference pattern between banners and an email campaign. [16:26:07] Email is the first spike. [16:26:16] What are you looking at? [16:26:41] doh ignore me [16:26:43] on the wrong day [16:27:22] ha, okay. [16:29:28] there are a lot of things about the data that are confusing [16:31:40] if i go back to the 28th it turns to solid blocks [16:31:53] did we dial down the frequency of reports? [16:33:47] e.g. https://grafana.wikimedia.org/dashboard/db/kafka-by-topic?from=1480220823103&to=1480398391189&panelId=6&fullscreen&var-cluster=analytics-eqiad&var-kafka_brokers=All&var-topic=eventlogging_CentralNoticeBannerHistory [16:35:08] I'm definitely not the one to ask about all that. [16:35:36] But. [16:36:10] The campaign officially started on the 29th. [16:36:40] Maybe we did a Zoom In and Enhance. [16:36:51] hehe [16:37:02] wfm [17:12:47] Fundraising Sprint Waiting for Godot, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, Unplanned-Sprint-Work: Central Notice: possible CN issue early on December 1st UTC - https://phabricator.wikimedia.org/T152122#2842030 (DStrine) adding graph for tracking {F4941895} [17:53:47] so, we've got another 0800 dip this morning [17:54:18] yup [17:54:21] weird wild stuff [17:54:32] ah, and K4 says similar stuff in the server admin logs? [17:54:34] looking [17:55:28] hmm, similar meaning 'depool XXX' ? [17:56:09] 'cause those seem to be scattered all over the logs [17:57:48] ejegg: on the new table I made we could filter on db [17:57:58] Just trying to see how to do this in ipython [17:59:35] K4-713: cwd ejegg: also, looking at one of the banner history reports cps sent, we might have a similar dip on Nov. 30, though maybe it was shorter so it didn't show up on some graphs as easily? [17:59:52] oh hey, the-wub: When is the mobile campaign going to launch? [17:59:59] Gonna afk for about 15 min [18:00:02] fr-tech: They are called computers simply because computation is [18:00:02] the only significant job that has so far been given to them. [18:00:03] -- discuss. [18:00:18] lol, slander. [18:00:37] That reminds me of Talkie Toaster. [18:01:01] K4-713: Tuesday. we may run some pre-tests on Monday (in fact we had planned to run some this week, but have been too busy) [18:04:59] the-wub: Cool, thanks. :) [18:10:27] ok, I'm going to set up a phantomjs thing to request a wiki page every 10 minutes and see what Geo cookie and campaign choices we get [18:11:55] ejegg: what do you think about dash being a "submodule" of crm in vagrant puppet? [18:12:15] sure, that makes sense [18:12:23] relies on all the crm infrastructure [18:12:31] yeah that's what i was thinking [18:12:33] also, let's not worry about the oauth stuff [18:12:41] since you can run it locally with -d [18:12:41] cool [18:12:44] yeah [18:13:59] nice to be working in vagrant actually cause i am having mad network trouble today [18:19:34] ejegg: great idea [18:31:13] fr-tech K4-713 maybe it's a daily cron job that's killing us? [18:36:21] the timing seems to indicate something like that but i can't really imagine the mechanism [18:43:04] Fundraising Sprint Waiting for Godot, Fundraising-Backlog, FR-Adyen, Unplanned-Sprint-Work: Adyen: setup a form for Canada - https://phabricator.wikimedia.org/T152123#2842285 (DStrine) update: advancement would like to take out the city and province/territory. when that form is ready, let's discu... [19:18:56] hmm, phantomjs keeps telling me window.mediaWiki isn't defined. looking inside a page.evaluate() function inside the page.open callback. [19:19:09] Anybody familiar with phantomjs? [20:05:22] AndyRussG: have you done much phantomjs stuff? [20:05:35] ejegg: sorry, nope 8/ [20:05:41] k [20:05:56] asked for help in -dev, but nothing yet [20:07:30] ejegg: ah ohwellz.... sorry I don't know much... Mmm question also 4 u: quick reminder of how to run queries on phgres sampled impressions db? [20:07:34] (pointer 2 doc) [20:08:35] https://www.mediawiki.org/wiki/Fundraising_tech/Database_schema#pgehres.bannerimpressions [20:09:23] ejegg: also, how were you pulling info on datacenter yetserday? [20:09:39] thx in advance! [20:09:56] Sorry, cafe wireless flaked out [20:12:42] answers in otherchan [20:15:40] MBeat: did you get Anna sorted out? [20:16:58] we’re close, ty eileen1 [20:18:35] ok - cool - let me know if you need help [20:22:48] Did you guys figure out the impression drop issue yet? I heard a rumor that it had happened again [20:23:15] Looking at SAL makes me wonder why ops is depooling database servers on a Friday during the fundraiser <_< [20:23:56] yah same thing again, confusing! [20:25:08] RoanKattouw: still trying to figure it out [20:25:22] we know it only affected certain languages [20:25:36] en, de, fr, it, and simple [20:26:53] I'm trying to set up a phantomjs job to load a page and dump some cn vars every 10 minutes, but i'm a n00b and can't get at window.mediaWiki: https://phabricator.wikimedia.org/P4559 [20:27:55] Oh, interesting [20:28:14] Hmm I don't know how you get to the window object in phantom [20:29:16] http://phantomjs.org/api/webpage/method/evaluate.html suggests that what you're doing should work [20:29:17] yeah, I can get at document.title just fine inside page.evaluate [20:29:53] Maybe that function runs too early, before the code that defines window.mediaWiki has run? [20:30:03] maybe the page.open callback is firing before window.mediaWiki is set up? [20:30:05] Perhaps you need to wait for document ready or something [20:30:23] yeah, let me try putting it in a different callback [20:30:35] You could hack it with http://phantomjs.org/api/webpage/method/evaluate-java-script.html as well [20:30:50] Not sure how you'd get the data out in that case though [20:30:57] so yeah finding a different callback is probably better [20:32:15] aha, yep! [20:32:36] can totally find it in the later invocations of page.onResourceReceived [20:36:23] Awesome [20:36:32] That'll be a useful diagnosis tool [20:37:02] Also tomorrow should give us a good data point as to whether this phenomenon happens when no one is messing with DB servers (or anything else to do with the site) [20:37:20] RoanKattouw: could be maybe a daily cron job? [20:37:29] Yeah but at 8am UTC? [20:37:39] I dunno.... [20:37:42] I guess we'll have more info tomorrow [20:37:59] aha, looks like exactly the same languages were affected this morning [20:38:11] ejegg: ah interesting! [20:38:22] so... what makes en, de, fr, it, and simple different from the rest? [20:38:43] Couldn't be just that that's where we have campaigns? [20:38:47] I'd think it was related to the set of banners we're running, [20:38:56] but 'simple' isn't targeted anyplace [20:39:10] Or which datacenters people reading in those languages target most? [20:39:41] the difference between languages is way starker than the difference between datacenters though [20:40:32] huh ok [20:41:37] I feel like we should be running some automatic analysis on this, like some algorithm that could troll through all these columns and find which ones have a stronger relationship to variation in impression counts [20:41:57] I guess there aren't that many options, either, tho [20:42:50] yeah, correlation finding algorithm! [20:43:02] does analytics have any fun tools for that? [20:44:20] another argument against datacenter is that nl, es, cs, and ro are unaffected while de, it, and fr are affected [20:45:09] no, sk, sv, hu also unaffected euro languages [20:45:24] Fundraising Sprint Waiting for Godot, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, Unplanned-Sprint-Work: Central Notice: possible CN issue early on December 1st UTC - https://phabricator.wikimedia.org/T152122#2839139 (Catrope) >>! In T152122#2841854, @K4-713 wrote: > I'm seeing signs... [20:49:24] Interesting re languages [20:49:29] ejegg: I updated that dedupe thing to contact.created_date as you suggested [20:49:47] And that there's no apparent relationship to which banners those langs are getting [20:50:10] ejegg: Hmm, have you looked at which DB shard each language is on? [20:50:15] * RoanKattouw finds the shard map [20:51:06] https://noc.wikimedia.org/conf/s1.dblist through to https://noc.wikimedia.org/conf/s7.dblist not sure if we have anything better [20:52:01] Hmm sectionsByDB in https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php is more succinct, if you add that s3 is the "everything else" shrad [20:52:03] *shard [20:52:40] Hmm no that doesn't lead anywhere [20:53:00] it is affected but no and sv aren't, and they're all s2 [20:53:58] thanks eileen1, I'll review again [21:33:45] dang, running phantomjs on my rentaserver and I never see the response callback run [21:34:07] though I can curl the same article just fine [21:34:42] guess I can leave my puter running all night [21:39:37] err, or get me a labs thing [21:39:53] anyway, feels like food time [21:43:01] That sounds like a good idea [21:57:05] RoanKattouw: so you think the issue could be limited to a given shard? [21:57:24] AndyRussG: Not any more because there's no correlation [21:57:41] Hmmm [21:58:07] RoanKattouw: do you know who owns the GeoIP lookup infrastructure? Is that bblack and ema (Traffic)?9 [21:58:11] ? [22:06:37] (CR) Ejegg: [C: 2] Add option to early-exit dedupe jobs based on the volume of contributions being processed [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/324655 (https://phabricator.wikimedia.org/T152072) (owner: Eileen) [22:11:25] (Merged) jenkins-bot: Add option to early-exit dedupe jobs based on the volume of contributions being processed [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/324655 (https://phabricator.wikimedia.org/T152072) (owner: Eileen) [22:23:41] (PS2) Ejegg: Revert "Link to the correct configuration form" [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/324757 (owner: Awight) [22:23:47] (CR) Ejegg: [C: 2] Revert "Link to the correct configuration form" [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/324757 (owner: Awight) [22:27:19] (Merged) jenkins-bot: Revert "Link to the correct configuration form" [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/324757 (owner: Awight) [22:33:29] AndyRussG: I would guess traffic yes [22:33:36] Not totally sure [22:34:17] RoanKattouw: K cool, thx! Yeah currently picking bblack's brain over on -operations... More or less seems there's no way for servers to coordinatedly fail GeoIP lookup for cookie-setting [22:35:02] ...which would follow with what ejegg found earlier about geo- vs. non-geo-taregeted campaigns [22:35:43] yeah, and there was no correlation with the servers mentioned in the x-cache headers [22:39:13] RoanKattouw: ejegg: what about logged-in status? Is it possible that some cluster-side issue would make a bunch of anon users appear to be logged in from the point of view of client-side JS? [22:39:30] ooh, that'd be weird [22:39:41] this phantomjs script will pick that up [22:39:48] That could prevent a lot of users from being selected for a campaign... [22:40:13] that would cause a lot of other weird behavior too though, wouldn't it [22:40:16] ? [22:41:34] oh man, i finally figured out how to force my google fi phone to switch networks [22:42:04] it always chose this super slow one for some reason, like 100k and very flaky [22:42:35] ejegg: truly I dunno... anons get HTML cached from Varnishes, so I don't think they'd see most logged-ins' features [22:42:48] but the 3G sprint connect has fewer bars or whatever but is a pretty consistent 1-2mb [22:42:59] feels like lightning [22:43:57] cwd: cool! How do u like the Google fi? [22:44:52] well i am very ambivalent about google but they're certainly no worse than verizon [22:44:57] at least they innovate [22:45:13] the service is good and cheap [22:46:35] I think ejegg's script should capture a bunch of stuff [22:46:58] ejegg: Does choiceData also include which choice ended up being made, or only the data that is being used to make a choice? [22:47:16] I'll post the updated script - there's another property with the data being used [22:47:36] Hmm I guess since you're logging this from requests that you're sending yourself, not other people's requests, it doesn't matter [22:48:21] if there was something that made cn think it wasn't anon, that data would change [22:48:57] RoanKattouw: choice data is just the base set of possible campaigns and banners set to a client. Yeah, the other data ejegg is pulling should show all the criteria used to make a decision [22:49:51] RoanKattouw: Do we have any other existing client-side-JS status-checking instrumentation going? [22:50:57] I don't think we do [22:51:00] Hmm navtiming I guess [22:51:35] RoanKattouw: what does that tell us? [22:51:49] Whether frontend things in general went sideways [22:51:59] Specifically, it's data about how quickly pages loaded [22:52:02] RoanKattouw: K so where could we check that? [22:52:21] ejegg: RoanKattouw: Here's a data point that I think may indicate a RL issue: The dropoff seems to take about 10 minutes from when it starts to go down to when it's as low as it gets [22:52:33] Also recall choice data is cached using objectcache [22:52:46] Right [22:52:57] Yeah the gradual decrease and increase suggests RL, I agree [22:53:04] What objectcache does choice data use exactly? [22:55:00] https://grafana.wikimedia.org/dashboard/db/resourceloader and https://grafana.wikimedia.org/dashboard/db/navigation-timing are the dashboards I was referring to [22:55:43] AndyRussG: What was the time window of last night's dip? [22:55:55] I know it was 8:15 till 9:04 or something the night before last [22:59:28] Checking... [23:01:18] Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM: update civi record whit actual audit USD totals when we get them - https://phabricator.wikimedia.org/T152249#2843051 (DStrine) [23:04:15] RoanKattouw: almost identical timing: https://grafana.wikimedia.org/dashboard/db/kafka-by-topic?from=1480550399000&to=1480680000000&panelId=6&fullscreen&var-cluster=analytics-eqiad&var-kafka_brokers=All&var-topic=eventlogging_CentralNoticeBannerHistory [23:04:28] (you can't really see the gradual dropoff there, but it's visible in more detailed data) [23:06:27] Hmph that's pretty sparse data :/ [23:06:38] But yeah you can see it's about the same time [23:07:01] RoanKattouw: yeah there is better data in Hive [23:07:45] https://phabricator.wikimedia.org/T152122#2840884 [23:08:07] Partially pre-munged [23:08:31] Also faster and decent data in a db we have filled by sampling beacon/impression calls [23:08:49] here's the updated phantom script that pulls cookie, cn.data, and cn.choiceData: https://phabricator.wikimedia.org/P4559 [23:09:39] Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM: update civi record whit actual audit USD totals when we get them - https://phabricator.wikimedia.org/T152249#2843080 (DStrine) p:Triage>Normal [23:12:37] AndyRussG: So I was looking at https://grafana.wikimedia.org/dashboard/db/resourceloader?from=1480665600000&to=1480672800000 and the INM and 304 graphs have noticeable dips at 08:10 and 09:10 [23:12:46] Zooming out now to see if that's just an hourly pattern [23:13:38] Hmm while there are multiple dips, they are not hourly [23:13:47] Maybe they're just another symptom [23:13:51] RoanKattouw: hum!!! what is INM? [23:14:34] INM = If-None-Match, it's the % of incoming RL requests where the browser indicates that it still has the resource in cache [23:14:52] So sends us a hash of the cached version that it has, and its request is really just asking if the resource changed [23:14:54] weird [23:15:14] So INM% is a proxy for "how many clients are hitting us with hot caches" [23:15:39] huh... there's another blip around 10:50 [23:16:01] RoanKattouw: so maybe RL startup module cache stampede? [23:16:06] There are more pronounced brief drops in the 304 satisfaction ratio at the same time [23:16:23] and an even bigger thing at 16:00 [23:16:41] That is the % of INM requests to which we respond "you're fine, go ahead and use your cached version" [23:17:02] nothing in the server admin logs to correlate with either 10:50 or 16:05 [23:17:31] It's normal for 304 satisfaction to drop really low and slowly recover when a resource changes [23:17:43] (Which can happen due to on-wiki edits BTW, for e.g. user and site JS) [23:17:53] hmm [23:18:16] It's a bit odd to me that INM drops coincide with those, but I don't have as much experience interpreting this data as the perf people do [23:18:47] * AndyRussG summons Krinkle.... aww not on IRC [23:18:52] (I did help come up with some of these metrics, and of course I designed the underlying caching system in the first place, but I am no longer the RL whisperer, Krinkle understands these graphs much better than I do) [23:19:11] Ah rather he's elsewhere [23:19:18] Yeah he's not in this channel [23:20:42] RoanKattouw: I'm gonna quote u in in -dev [23:21:31] o/ [23:21:39] ejegg: Blips in which data? [23:21:40] Krinkle: [23:21:50] ejegg: Banner impressions, or the RL dashboard? [23:21:52] 17:12 AndyRussG: So I was looking at https://grafana.wikimedia.org/dashboard/db/resourceloader?from=1480665600000&to=1480672800000 and the INM and 304 graphs have noticeable dips at 08:10 and 09:10 [23:21:54] 17:12 Zooming out now to see if that's just an hourly pattern [23:21:56] that RL dashboard [23:21:56] 17:13 Hmm while there are multiple dips, they are not hourly [23:21:58] 17:13 Maybe they're just another symptom [23:22:00] 17:13 RoanKattouw: hum!!! what is INM? [23:22:01] Oh yeah [23:22:02] 17:14 INM = If-None-Match, it's the % of incoming RL requests where the browser indicates that it still has the resource in cache [23:22:04] 17:14 So sends us a hash of the cached version that it has, and its request is really just asking if the resource changed [23:22:06] 17:14 weird [23:22:08] 17:15 So INM% is a proxy for "how many clients are hitting us with hot caches" [23:22:10] 17:15 huh... there's another blip around 10:50 [23:22:12] 17:16 RoanKattouw: so maybe RL startup module cache stampede? [23:22:14] 17:16 There are more pronounced brief drops in the 304 satisfaction ratio at the same time [23:22:16] 17:16 and an even bigger thing at 16:00 [23:22:18] 17:16 That is the % of INM requests to which we respond "you're fine, go ahead and use your cached version" [23:22:20] 17:17 nothing in the server admin logs to correlate with either 10:50 or 16:05 [23:22:22] 17:17 It's normal for 304 satisfaction to drop really low and slowly recover when a resource changes [23:22:24] 17:17 (Which can happen due to on-wiki edits BTW, for e.g. user and site JS) [23:22:36] Krinkle: So, let me catch you up on what we're investigating [23:22:49] T152122 [23:22:49] T152122: Central Notice: possible CN issue early on December 1st UTC - https://phabricator.wikimedia.org/T152122 [23:23:16] https://phabricator.wikimedia.org/T152122#2839196 shows banner impressions in the early hours of Dec 1 [23:23:24] (about 36h ago) [23:24:06] There's a ~20x drop that lasted about an hour, between ~08:10 and ~09:03 UTC yesterday (Dec 1) [23:24:38] Notice how there's a gradual ramp-down/up at each end, over the course of ~3 minutes [23:24:46] Krinkle: it seems to only affect 5 languages [23:24:53] en, de, it, fr, and simple [23:24:55] A similar drop happened at about the same time today (Dec 2) [23:25:09] same language set both days [23:25:18] No gradual data for that one on the task yet; ejegg can you tell us start and end times for that one? [23:25:27] (I know it's in Hive but I'm not sure I have access to that) [23:25:39] I'll get the stats [23:25:41] RoanKattouw: Where is the rampdown/up - which graph? [23:25:52] So there have been various theories thrown around, from RL weirdness to geoip hiccups to correlations with SAL actions [23:26:02] https://phabricator.wikimedia.org/T152122#2839196 , not a graph but an SQL result [23:26:08] ejegg: What is the first graph in task description of? It shows a dip in e-mails too. Is that response/confirm, or proactive reach mails? [23:26:23] Look at 8:05-8:10, and 9:01-9:05 [23:26:51] Krinkle: That's $/hr coming in from various sources [23:27:31] As you can see there's a natural (circadian rhythm) dip around 08:00 (presumably because that's when the west coast goes to sleep) but banners went down to basically zero at that time on Dec 1 and didn't on Nov 30 [23:27:57] Krinkle: e-mail stuff should be really unrelated. E-mails sent to previous donors in batches, which take them straight to the payments cluster [23:28:24] https://phabricator.wikimedia.org/T152122#2839159 shows hourly totals of banner impressions, and https://phabricator.wikimedia.org/T152122#2839196 is the same but per minute [23:28:51] So it seems like every morning between just after 8am and just after 9am UTC, we all but stop serving banners to people [23:28:59] As in a ~95% decrease [23:29:04] And then after an hour it recovers [23:29:19] We don't know why yet, theories are: [23:29:25] 1. something something ResourceLoader [23:29:31] 2. something something GeoIP [23:29:54] 3. SAL entries about depooling and repooling DB hosts, which appeared in the SAL around those times on both days, but the correlation is not perfect [23:29:57] hmm, rampdown might start just before 8:00 this morning, let me expand that window [23:30:04] 4. Maybe a cron job or something? But who runs a cron at 08:00? [23:30:40] 8.10 and 9.02 RL dips in INM are most likely due to a resource changing (either file, on wiki page, or generated content from a RL subclass, such as banner data module). I can confirm this as css/js minification cache hit is down at the same time. [23:30:52] Should be harmless. [23:30:56] One of my wilder theories from yesterday was that maybe a US ISP was doing a one-hour IPv6 test around 3am EST, but there's no evidence for that and I don't think they'd do it twice in a row like this [23:31:19] I see [23:31:25] RoanKattouw: I don't see the natural rythem you mention of it happening daily around 8am. Which graphs is that? [23:31:32] I see it once [23:31:33] OK, so that suggests that there was a resource change both at the start and the end of the window [23:31:55] which could be the choiceData... [23:31:56] Krinkle: That graph is old and was generated before the second occurrence happened [23:32:09] It's only happened twice, and one of the occurrences was today [23:32:21] So that graph shows Nov 30 and Dec 1, and the events happened on Dec 1 and Dec 2 [23:32:28] ejegg: Indeed. See also #wikimedia-perf-bots which detects when module version hashes change on nlwiki [23:32:29] though how that would change without anything showing in the CN logs is difficult to imagine [23:32:44] ejegg: Yeah if something caused choiceData to change and then change back, you would expect to see that in the RTL graph, because that's a resource change [23:32:44] ooh [23:32:47] Darn, the bot is offline since 2 days [23:32:48] Krinkle: Oooh good point [23:32:52] DAMMIT [23:33:03] just nlwiki? [23:33:18] nl was one of the unaffected languages tho [23:33:33] I was so excited that I was gonna have a use for my chan log of perf-bots [23:33:37] s/RTL/RL [23:33:45] ejegg: What's that about nlwiki? [23:34:09] Krinkle said "...detects when module version hashes change on nlwiki" [23:34:40] Oh, right [23:34:53] I guess that was because almost all modules are identical across wikis [23:34:56] CN very much isn't of coures [23:35:02] yah [23:35:08] Krinkle: Would it be possible for you to restart that bot but point it at enwiki instead? [23:35:35] If the dip happens again tonight, a log of module hash changes on enwiki will be valuable [23:36:40] One thing noted w/ bblack on -operations was that the objectcache ttl for choicedata is 1hr [23:36:47] Aha [23:37:04] So if something poisoned the choicedata cache at 8am, that would match all symptoms we've seen so far [23:37:13] It fixing itself after an hour, two drops in the RL graph [23:37:30] Fundraising Sprint Waiting for Godot, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, Unplanned-Sprint-Work: Central Notice: possible CN issue early on December 1st UTC - https://phabricator.wikimedia.org/T152122#2843120 (Ejegg) Dip for 8/2: | 07:57:00 AM | 21295 | | 07:58:00 AM | 21267... [23:37:39] yeah [23:37:50] Krinkle: It's a bit puzzling to me that a resource change would result in a lower INM% (I understand lower 304%). Not a big deal, just curious why that happens if you happen to know [23:37:56] Today's dip was almost perfectly aligned on the hour [23:38:08] Interesting [23:38:14] I'll write this theory on the task [23:38:31] RoanKattouw: Because we have versioned module urls [23:38:39] Oooh [23:38:41] Right [23:38:47] Our cache hit isn't a request at all [23:38:56] it's a local http cache hit in the browser, which we don't count [23:39:13] The 304 dip is for the startup module and the INM dip is for requests to the changed module [23:39:13] that's why our "chacheable url cache hit ratio" (long, 304) is nearly 0 [23:39:20] which is a good thing [23:39:26] oh wait, does that mean we can get the versions from load.php webrequests? [23:39:33] RoanKattouw: restarted the bot, and pointed at enwiki [23:39:39] ejegg: ^ yep [23:39:40] ejegg: Yes kind of [23:40:01] ejegg: But if you are thinking of your Phantom script, then you can get it more easily from the RL registry in JS [23:40:09] Maybe we should just put it in there [23:40:17] Then we don't have to rely on perflogbot [23:41:03] what phantom script? [23:41:48] Krinkle: ejegg wrote a phantomjs script that he's going to run every 10 mins overnight [23:41:53] https://en.wikipedia.org/w/load.php?debug=false&lang=en&modules=ext.centralNotice.bannerHistoryLogger%2CchoiceData%2Cdisplay%2CgeoIP%2CimpressionDiet%2CkvStore%2CkvStoreMaintenance%2ClargeBannerLimit%2ClegacySupport%2CstartUp%7Cext.centralauth.centralautologin%7Cext.eventLogging%2CnavigationTiming%2CwikimediaEvents%7Cext.eventLogging.subscriber%7Cext.gadget.DRN-wizard%2CReferenceTooltips%2Cchar [23:41:54] It dumps mw.cn.choiceData [23:41:55] insert%2Cextra-toolbar-buttons%2Cfeatured-articles-links%2CrefToolbar%2Cswitcher%2Cteahouse%2Cwatchlist-notice%7Cext.quicksurveys.init%2Clib%7Cext.uls.common%2Ceventlogger%2Cinit%2Cinterface%2Cpreferences%2Cwebfonts%7Cext.visualEditor.desktopArticleTarget.init%7Cext.visualEditor.supportCheck%2CtargetLoader%2Ctrack%2Cve%7Cjquery.byteLength%2CcheckboxShiftClick%2Ccookie%2CgetAttrs%2ChighlightText [23:41:57] %2CmakeCollapsible%2Cmw-jump%2Cplaceholder%2Csuggestions%2CtabIndex%2CtextSelection%2Cthrottle-debounce%7Cjquery.uls.data%7Cmediawiki.Title%2CUri%2Capi%2Ccldr%2Ccookie%2Cexperiments%2CjqueryMsg%2Clanguage%2CsearchSuggest%2Cstorage%2Ctemplate%2Cuser%2Cviewport%7Cmediawiki.action.view.postEdit%7Cmediawiki.api.options%2Cuser%7Cmediawiki.language.data%2Cinit%7Cmediawiki.libs.pluralruleparser%7Cmedi [23:41:59] awiki.page.ready%2Cstartup%7Cmediawiki.template.regexp%7Cmediawiki.ui.button%2Cicon%7Cmmv.bootstrap%2Chead%7Cmmv.bootstrap.autostart%7Coojs%2Csite%7Cschema.NavigationTiming%2CQuickSurveyInitiation%2CSaveTiming%2CUniversalLanguageSelector%7Cskins.vector.js%7Cuser.defaults&skin=vector&version=18t5k42 [23:42:01] Oops sorry [23:42:09] ejegg: ^ version hash [23:42:19] ah, but that's the whole bundle [23:42:24] RoanKattouw: ejegg: Right, in that case you can also call mw.loader.getVersion(moduleName) and log that longside [23:42:26] Hmmm maybe it works in mysterious ways [23:42:26] AndyRussG: Yeah although that's a compound hash, it's a hash of all the hashes of the modules in the bunde [23:42:38] Krinkle: thanks! good call [23:42:48] Oh there's a function for that now? Nice! [23:42:59] RoanKattouw: you wrote that in 2010 ;-) [23:43:07] LOL [23:43:10] * RoanKattouw -> bash [23:43:30] It was a good year [23:44:06] Hehe, yeah [23:46:01] hehe [23:46:25] ejegg: In addition to your phab script maybe you can also request load.php?modules=ext.centralNotice.choiceData every 5 mins [23:46:40] ^ ah great idea too [23:46:56] I'm writing up this theory and our proposed course of action on phab [23:46:59] ejegg: Krinkle: RoanKattouw: at least for the first outage, we saw no changes to CN config anywhere near the beginnings or the ends of it [23:46:59] right on [23:47:20] So if something poisoned the RL cache, it'd be some error in the DB call, no? [23:47:30] AndyRussG: The banner choice data changes on its own though, right? Based on preconfigured timestamps and some other factors? [23:47:36] (as well as translations potentially) [23:47:54] That'd be in logstash [23:47:56] * Krinkle checks [23:47:57] Krinkle: hmmm true, it does change based on start and end times, lemme try to check that [23:48:09] Though presumably the cache method is written in such a way to not cache errors [23:48:23] It would cache at the RL layer very shortly only (5min, startup module) [23:48:27] translations, no that wouldnt' change stuff [23:48:31] and not be written to memcached [23:48:42] Hmmm I hope so. Let's see [23:48:59] whew, 12k of choiceData [23:49:10] wonder if we should be abbreviating some of those properties [23:49:30] It's a call within cache->getWithSetCallback() [23:49:36] Krinkle: The caching exptime for choiceData is 3600 and that fits perfectly with the duration of the bug [23:49:38] I dunno how that handles errors [23:50:00] Hence why we're currently theorizing that something poisons the data at 08:00, and then at 09:00 it expires and is refreshed with correct data [23:50:16] ejegg: ^ huh that's an idea! Also we should Geolocate on the server, no? [23:50:43] Yeah, now that we don't have to force the ipv4 lookup for some users, we can filter by country [23:51:08] Fundraising-Backlog: Caption on Facebook links should be something better - https://phabricator.wikimedia.org/T152250#2843126 (atgo) [23:51:09] i'll make a phab task [23:51:34] Yeah if you can cut down on that 12k that would make a number of us happy [23:52:55] ^ in the millions [23:53:18] Fundraising-Backlog, MediaWiki-extensions-CentralNotice: CentralNotice choiceData should be filtered by country server-side - https://phabricator.wikimedia.org/T152251#2843138 (Ejegg) [23:53:48] I don't see any campaigns currently configured with start or end times coinciding with the beginning of the outage. However, there was one that started at 2016-12-01 09:00 (C16_WMCH_Test01) [23:55:00] And another one that started at 2016-12-02 09:00 [23:56:16] hmm, if someone's up around at 0800 and sees the same thing, they could try saving a test campaign to invalidate that cache [23:56:51] Fundraising Sprint Waiting for Godot, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, Unplanned-Sprint-Work: Central Notice: possible CN issue early on December 1st UTC - https://phabricator.wikimedia.org/T152122#2843150 (Catrope) Current theory: 1. Something somewhere poisons (=writes b... [23:57:02] ejegg: Yeah I'm thinking about staying up [23:57:09] 08:00 UTC is midnight my time [23:57:32] do you have CN admin? [23:57:34] Problem is, I slept from like 2am till 7am last night so I may get tired before then [23:57:42] Probably not? I don't know how to use it anyway [23:57:49] ah, don't lose sleep over it [23:57:58] Also I'm only on day 2 of coming back from the east coast jet lag, so who knows if my body will think it's midnight or 3am [23:58:15] we could email pcoombe and ask him to check it when he wakes up [23:58:21] Night 1 was awesome (10pm - 7am), night 2 sucked (2am - 7:30am) [23:58:25] Hmm I guess [23:58:29] It'll be 8am Saturday for him [23:59:00] heh, who's got CN admin in Asia? [23:59:53] I mean the main people I'd want to be around for this are AndyRussG and either you or awight, but 08:00 UTC is 3am EST :/