[00:00:04] (which broke my Gerrit access because it tried to use v6 and failed!) [00:00:11] I haven't heard anything, and also don't have v6 to test with [00:00:32] What I'm suggesting is, maybe some strange external factor happened, like a US ISP doing a one-hour IPv6 test at 3am Eastern time [00:01:05] Fundraising Sprint Waiting for Godot, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, Unplanned-Sprint-Work: Central Notice: possible CN issue early on December 1st UTC - https://phabricator.wikimedia.org/T152122#2840352 (Ejegg) Ahh, ok. So my burning question is now "How was the geotarg... [00:01:16] RoanKattouw: ^ yeah neat hypothesis [00:01:36] Which suggests another line of inquiry: add GROUP BY country to that query [00:01:37] RoanKattouw: This year we updated our GeoIP database and supposedly it has no trouble w/ IPv6 now [00:01:52] To see if the drop is uniform across Big English countries or biased towards US/CA or another country [00:02:16] (Getting Adam to do that now) [00:02:20] We no longer force IPv6 users to ajax call to a IPv4-only geoiplookup service. It's all just looked up on Varnish now regardless [00:02:42] Or from varnish [00:02:50] Right [00:02:52] On, from, via and using [00:03:20] * AndyRussG tosses excess coffee on dying plant [00:03:58] all the countries seem to be affected equally, from eyeballing the dash x-by-y graph [00:04:20] maybe CA a little bit worse than the rest [00:04:40] i'm still a little unclear about what's expected to happen if geoip fails to return a result, do we fail to a generic campaign or show no campaign at all? [00:05:38] Jeff_Green: no, no campaign in that case [00:06:51] in the future could we at collect the fact that they got no country code, to help with debugging? [00:07:22] The list of possible campaigns sent to clients typically includes a whole bunch of campaigns that don't apply to them, and the JS filters according to GeoIP, device, logged-in status [00:07:48] Jeff_Green: for GeoIP debugging, yes, great idea. That per se should be part of geoip infrastructure [00:08:35] We could eventually implement a small-sample-size logging call for users not selected for any campaign at all, though it's not currently in the system... [00:08:36] seems like something we'd want with banner stats [00:09:03] IIRC we used to call the equivalent of beacon/impression on every pageview, but turned it way down because performance [00:10:26] yeah [00:11:03] is the decision to load a banner or not all made client side now? [00:11:10] Jeff_Green: yep [00:11:45] so it wouldn't be too awful to slow down the small percentage who don't get gecoded [00:12:03] certainly less painful than loading a banner if they got a campaign [00:13:01] Jeff_Green: yeah. we are actually hoping to spin off geolocation into a separate component. So that could be part of a geoip-specific monitoring infrastructure [00:13:17] server just sends choices filtered on project and language [00:13:24] And we could also add some kind of more general CN monitoring for users not selected for a campaign [00:13:26] https://www.mediawiki.org/wiki/Extension:CentralNotice/Campaign_and_banner_selection [00:14:16] so another theory I was wondering about--what if somehow we cached a bad copy of the client javascript for an hour [00:14:26] like it got truncated and cached or something [00:14:34] hmmm! [00:14:40] across all DC's? [00:14:54] Wouldn't we see other effects? [00:15:02] i don't know how many copies we end up caching ultimately [00:15:23] and whether those copies get copied around [00:15:29] Also mentioned before, how many clients arrive at wp with no JS cached up at all? would we get such a huge drop? [00:15:58] Unless, I suppose, there was a RL cache invalidation that made everyone load bad JS [00:16:25] That might show up if we looked for similar effects of other JS-dependent stuff, such as non-banner-history eventlogging calls [00:16:49] didn't someone say there was no comparable drop in other eventlogging? [00:17:44] ejegg i wasn't able to find anything [00:19:01] https://grafana.wikimedia.org/dashboard/db/kafka-by-topic?from=1480531567255&to=1480620351298&panelId=6&fullscreen&var-cluster=analytics-eqiad&var-kafka_brokers=All&var-topic=eventlogging-client-side [00:19:15] ah right... Yeah for example here, no comparable dip [00:19:38] Vs this that Jeff_Green sent earlier.... https://grafana.wikimedia.org/dashboard/db/kafka-by-topic?from=1480531567255&to=1480620351298&panelId=6&fullscreen&var-cluster=analytics-eqiad&var-kafka_brokers=All&var-topic=eventlogging_CentralNoticeBannerHistory [00:20:14] Oh hey we have tmux on stat1002 [00:20:57] Hmm, there's a /small/ dip in that first one [00:21:01] starting at 8:10 [00:21:54] ejegg: maybe that's the sum total of all client-side eventlogging, so it would include our personal dip? [00:22:13] the special money dip? [00:22:13] yeah, that makes sense--the dip looks like at most 7-10K [00:23:14] we need a sad dollar sign emoticon on that graph [00:23:26] hmm, no dip in eventlogging-geofeatures, whatever that is: [00:23:35] https://grafana.wikimedia.org/dashboard/db/kafka-by-topic?from=1480531567254&to=1480620351297&var-cluster=analytics-eqiad&var-kafka_brokers=All&var-topic=eventlogging_GeoFeatures [00:23:47] https://grafana.wikimedia.org/dashboard/db/kafka-by-topic?from=1480531567255&to=1480620351298&panelId=6&fullscreen&var-cluster=analytics-eqiad&var-kafka_brokers=All&var-topic=eventlogging_NavigationTiming [00:24:24] Eh don't worry, we'll make it up by visiting those missed donors in the door-to-door campaign [00:24:54] expiry coming up... [00:28:51] awww [00:29:45] ah, geofeatures is just analysing how people are using the geohack tool labs stuff, so it's not connected with geoIP [00:30:03] ? [00:30:42] it was the only thing I could find in grafana that had geo in the name [00:31:31] hmmm [00:32:19] Looks like even the donations queue went practically down to 0 during this outage? https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Fundraising%20eqiad&h=frqueue1001.frack.eqiad.wmnet&r=hour&z=small&jr=&js=&st=1480379199&event=hide&ts=0&v=395&m=Donations%20Queue&vl=messages&z=large [00:32:29] (looking at the 4th graph) [00:32:42] yah, that time of night it's cleared out at the end of each import run [00:33:01] Hmmm [00:40:30] ooh, definite language correlation [00:40:39] oooooooooh! :D [00:41:03] Another great reason to promote Esperanto [00:41:05] en, de, it, fr, and simple are all affected [00:41:19] hm! [00:41:34] the rest look totally unaffected [00:41:51] or have sample sizes too small to tell [00:42:05] could that be just an effect of those being the languages we're targeting greatly? [00:42:15] are we getting anything through on the affected langauges? [00:42:41] yeah, it's not 100% for anything [00:42:52] hmm [00:44:36] ejegg: any comments on the Hive create table query at the end of the 'therpad? [00:45:17] looking [00:46:06] maybe add 'ion' AND year=2016 AND month=12 AND day=1 AND hour>6 and hour < 10 [00:46:20] err, ignore that 'ion' [00:46:32] just meant add the hour restriction [00:49:35] ejegg: yeah I was thinking of that, but if it's gonna be fast to query, maybe sometimes we'd want to get the whole day? [00:49:47] k [00:49:55] I'm gonna add in pre-parsed fields for campaign, banner and statusCode [00:50:06] sounds good [00:50:45] anything else we should pre-parse out of the uri_query? [00:50:57] Ah no statusCode no, that'd be from beacon/impression [00:51:02] Actually, maybe I should try that [00:51:05] instead [00:51:10] of BannerLoader [00:51:17] Well, let's do both I guess [00:51:37] yeah, let's [00:52:01] What about comparing discrepancies between the country_code in webrequest vs. country url param received from beacon/impression? The latter would show what we're getting via the Varnish cookie [00:52:35] hmm, could be interesting [00:53:09] would tell us if geoip was mislocating rather than off [00:53:24] at least, for the ppl that got through [00:53:40] yea [00:53:46] k, I need to relocate. back soon [00:53:53] K thx much!!!! [01:03:43] AndyRussG: K4-713 ejegg|afk XenoRyet I'm thinking of heading out. I will be checking email. let me know if you need anything [01:04:35] * K4-713 salutes [01:30:57] Fundraising Sprint Waiting for Godot, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, Unplanned-Sprint-Work: Central Notice: possible CN issue early on December 1st UTC - https://phabricator.wikimedia.org/T152122#2840516 (Ejegg) Looks like it's language-specific: en, de, fr, it, and simp... [01:32:00] AndyRussG: any thoughts on the language-specificity? [01:33:22] and did you see any mismatches on the countries? [01:38:49] Wow, that got weird. [01:40:50] ejegg, AndyRussG: Good catch with the language specificity. [01:41:55] (CR) Ejegg: [C: -1] "receive_date is the time the user was on the payments box, so when the queue's backed up that query will actually give you zero. How about" [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/324655 (https://phabricator.wikimedia.org/T152072) (owner: Eileen) [01:42:31] Not sure what to do with that, though [01:44:03] ejegg, AndyRussG: Ah, in the interests of not running yourselves into the ground, I am going to officially advise you to reach an endpoint for tonight and pick this up in the AM. And when we get to the AM, we should strongly solicit others for help. [01:44:04] well, I'm going to do non-computer things for a bit. Maybe something will come to me [01:44:18] heh, thanks K4-713 [01:44:25] have a good evening! [01:44:35] Hey, you too. [01:44:39] Good work today. :) [01:44:46] thanks! [01:45:24] (CR) Eileen: "Hmm - I thought that because the receive_date doesn't exactly match the date in contribution_tracking it would be the time the contributio" [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/324655 (https://phabricator.wikimedia.org/T152072) (owner: Eileen) [01:52:46] ejegg|away: haven't dug in to either yet, creating m'own Hive table now, tho! [01:54:07] K4-713: thx! I do have to break again for domestic duties in a bit... Appreciate the suggestion... :) Mysteries are fun, tho ;p [01:55:30] AndyRussG: They sure are! [01:55:57] If you're having fun, by all means: Continue when you're done with your break. [01:56:11] Otherwise, please do feel free to get some rest. [01:56:41] In the meantime, I'll see if I can get more eyes on this ticket. We left some great clues. [01:57:09] Somebody around this place must know something about what we're seeing here. [01:59:22] K4-713: yea hopefullly.... 8p [01:59:56] I'll try to at least make sure all most significant conclusions/leads so far are on the task :) [02:03:38] AndyRussG: That would be great. Increases the chances that someone else will put the pieces together while we're off the clock. [02:03:41] I love it when that happens. [02:04:19] Closest thing to magic I've ever seen. [02:17:24] (PS7) Eileen: Add option to early-exit dedupe jobs based on the volume of contributions being processed [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/324655 (https://phabricator.wikimedia.org/T152072) [02:36:22] (CR) Eileen: "ejegg - have updated with your suggestion" [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/324655 (https://phabricator.wikimedia.org/T152072) (owner: Eileen) [05:13:17] Wikimedia-Fundraising, MediaWiki-extensions-FundraiserLandingPage: DonateWiki - Special:FundraiserRedirector - Social Media Text Scrape - https://phabricator.wikimedia.org/T152181#2840844 (Jseddon) [05:13:37] Wikimedia-Fundraising, MediaWiki-extensions-FundraiserLandingPage: DonateWiki - Special:FundraiserRedirector - Social Media Text Scrape - https://phabricator.wikimedia.org/T152181#2840856 (Jseddon) p:Triage>Unbreak! [06:14:29] Fundraising Sprint Waiting for Godot, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, Unplanned-Sprint-Work: Central Notice: possible CN issue early on December 1st UTC - https://phabricator.wikimedia.org/T152122#2840884 (AndyRussG) Made two partly-processed tables in Hive to be able to... [09:11:28] (CR) Hashar: "check experimental" [wikimedia/fundraising/stats] - https://gerrit.wikimedia.org/r/324774 (owner: Awight) [09:11:53] (CR) Hashar: "Did a basic CI config https://gerrit.wikimedia.org/r/#/c/324863/ which let us trigger a tox run by commenting "check experimental"." [wikimedia/fundraising/stats] - https://gerrit.wikimedia.org/r/324774 (owner: Awight) [09:12:33] (CR) Hashar: "check experimental" [wikimedia/fundraising/stats] - https://gerrit.wikimedia.org/r/324774 (owner: Awight) [09:41:43] (CR) Hashar: Add gitignore and tox support (5 comments) [wikimedia/fundraising/stats] - https://gerrit.wikimedia.org/r/324774 (owner: Awight) [10:53:22] Wikimedia-Fundraising, MediaWiki-extensions-FundraiserLandingPage: DonateWiki - Special:FundraiserRedirector - Social Media Text Scrape - https://phabricator.wikimedia.org/T152181#2841196 (Pcoombe) [10:53:24] Fundraising-Backlog, Wikimedia-Fundraising, Patch-For-Review: Thank You page to Facebook error - https://phabricator.wikimedia.org/T152026#2841199 (Pcoombe) [11:16:03] Wikimedia-Fundraising: Improve Facebook sharing process on Thank You page - https://phabricator.wikimedia.org/T152195#2841226 (Pcoombe) [11:17:59] Fundraising-Backlog, Wikimedia-Fundraising, Patch-For-Review: Improve text scraped by Facebook from donatewiki - https://phabricator.wikimedia.org/T152026#2841242 (Pcoombe) p:Triage>High [15:05:28] Fundraising-Backlog, Wikimedia-Fundraising, Patch-For-Review: Improve text scraped by Facebook from donatewiki - https://phabricator.wikimedia.org/T152026#2841662 (DStrine) I successfully posted a few days ago. I'm posting the expeced result below [1]. Also props to whoever setup the url for the link... [15:27:47] Wikimedia-Fundraising: Facebook preview looks pretty bad - https://phabricator.wikimedia.org/T152210#2841759 (TJones) [15:29:08] Wikimedia-Fundraising: Facebook preview looks pretty bad - https://phabricator.wikimedia.org/T152210#2841759 (TJones) [15:34:21] that face book preview bot is infuriating [15:34:41] a sizeable % of traffic on the web is that robot [15:35:05] feeding people the internet through face book [15:35:38] * cwd slides soap box under desk [15:36:53] Fundraising-Backlog, Wikimedia-Fundraising, Patch-For-Review: Improve text scraped by Facebook from donatewiki - https://phabricator.wikimedia.org/T152026#2841783 (Pcoombe) @DStrine The problem is that Facebook scrapes links, and caches what it finds to show in previews. When no country is specified... [15:43:25] Wikimedia-Fundraising: Facebook preview looks pretty bad - https://phabricator.wikimedia.org/T152210#2841788 (Pcoombe) [15:43:28] Fundraising-Backlog, Wikimedia-Fundraising, Patch-For-Review: Improve text scraped by Facebook from donatewiki - https://phabricator.wikimedia.org/T152026#2841791 (Pcoombe) [15:47:39] Fundraising-Backlog, Wikimedia-Fundraising, Patch-For-Review: Improve text scraped by Facebook from donatewiki - https://phabricator.wikimedia.org/T152026#2841807 (Ejegg) Ah, I misunderstood the problem. So maybe we need to change the logic that shows the "can't find your country" message to take nav... [16:10:32] Fundraising-Backlog, Wikimedia-Fundraising, Patch-For-Review: Improve text scraped by Facebook from donatewiki - https://phabricator.wikimedia.org/T152026#2841847 (Pcoombe) Okay, I've made the https://donate.wikimedia.org/?country=XX page have the same appeal content as a normal donation page, moving... [16:10:40] Fundraising-Backlog, Wikimedia-Fundraising, Patch-For-Review: Improve text scraped by Facebook from donatewiki - https://phabricator.wikimedia.org/T152026#2841848 (Pcoombe) Open>Resolved [16:17:38] Fundraising Sprint Waiting for Godot, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, Unplanned-Sprint-Work: Central Notice: possible CN issue early on December 1st UTC - https://phabricator.wikimedia.org/T152122#2841854 (K4-713) I'm seeing signs that this happened again last night at th... [16:19:30] K4-713: http://i.imgur.com/hmz1rro.gif [16:19:47] cwd: :) [16:20:39] I'm sorry are facebook and internet different? TIL [16:20:50] (said that just to trigger you all, have a good weekend!) [16:21:05] X| [16:21:19] The facebook isn't letting me dial in. [16:21:42] Guess I'll go make a sandwich. [16:21:58] i think this means that AOL won, conceptually [16:22:23] Damn it. I was always a Prodigy Star supporter myself. [16:23:02] that too [16:23:11] here's your app with a little window to the internet [16:23:39] Anyway, I'm sort of relieved that we have the identical problem at the same hour as yesterday. That's huge. [16:23:51] Particularly when there's similar stuff in the server admin log... [16:23:53] yep [16:24:47] Anyway, I'm not really here yet. I just wanted to run some numbers before coffee. [16:25:28] K4-713: what about the total drop out between 10 and 11?? [16:25:44] Heh. [16:25:56] That's an interference pattern between banners and an email campaign. [16:26:07] Email is the first spike. [16:26:16] What are you looking at? [16:26:41] doh ignore me [16:26:43] on the wrong day [16:27:22] ha, okay. [16:29:28] there are a lot of things about the data that are confusing [16:31:40] if i go back to the 28th it turns to solid blocks [16:31:53] did we dial down the frequency of reports? [16:33:47] e.g. https://grafana.wikimedia.org/dashboard/db/kafka-by-topic?from=1480220823103&to=1480398391189&panelId=6&fullscreen&var-cluster=analytics-eqiad&var-kafka_brokers=All&var-topic=eventlogging_CentralNoticeBannerHistory [16:35:08] I'm definitely not the one to ask about all that. [16:35:36] But. [16:36:10] The campaign officially started on the 29th. [16:36:40] Maybe we did a Zoom In and Enhance. [16:36:51] hehe [16:37:02] wfm [17:12:47] Fundraising Sprint Waiting for Godot, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, Unplanned-Sprint-Work: Central Notice: possible CN issue early on December 1st UTC - https://phabricator.wikimedia.org/T152122#2842030 (DStrine) adding graph for tracking {F4941895} [17:53:47] so, we've got another 0800 dip this morning [17:54:18] yup [17:54:21] weird wild stuff [17:54:32] ah, and K4 says similar stuff in the server admin logs? [17:54:34] looking [17:55:28] hmm, similar meaning 'depool XXX' ? [17:56:09] 'cause those seem to be scattered all over the logs [17:57:48] ejegg: on the new table I made we could filter on db [17:57:58] Just trying to see how to do this in ipython [17:59:35] K4-713: cwd ejegg: also, looking at one of the banner history reports cps sent, we might have a similar dip on Nov. 30, though maybe it was shorter so it didn't show up on some graphs as easily? [17:59:52] oh hey, the-wub: When is the mobile campaign going to launch? [17:59:59] Gonna afk for about 15 min [18:00:02] fr-tech: They are called computers simply because computation is [18:00:02] the only significant job that has so far been given to them. [18:00:03] -- discuss. [18:00:18] lol, slander. [18:00:37] That reminds me of Talkie Toaster. [18:01:01] K4-713: Tuesday. we may run some pre-tests on Monday (in fact we had planned to run some this week, but have been too busy) [18:04:59] the-wub: Cool, thanks. :) [18:10:27] ok, I'm going to set up a phantomjs thing to request a wiki page every 10 minutes and see what Geo cookie and campaign choices we get [18:11:55] ejegg: what do you think about dash being a "submodule" of crm in vagrant puppet? [18:12:15] sure, that makes sense [18:12:23] relies on all the crm infrastructure [18:12:31] yeah that's what i was thinking [18:12:33] also, let's not worry about the oauth stuff [18:12:41] since you can run it locally with -d [18:12:41] cool [18:12:44] yeah [18:13:59] nice to be working in vagrant actually cause i am having mad network trouble today [18:19:34] ejegg: great idea [18:31:13] fr-tech K4-713 maybe it's a daily cron job that's killing us? [18:36:21] the timing seems to indicate something like that but i can't really imagine the mechanism [18:43:04] Fundraising Sprint Waiting for Godot, Fundraising-Backlog, FR-Adyen, Unplanned-Sprint-Work: Adyen: setup a form for Canada - https://phabricator.wikimedia.org/T152123#2842285 (DStrine) update: advancement would like to take out the city and province/territory. when that form is ready, let's discu... [19:18:56] hmm, phantomjs keeps telling me window.mediaWiki isn't defined. looking inside a page.evaluate() function inside the page.open callback. [19:19:09] Anybody familiar with phantomjs? [20:05:22] AndyRussG: have you done much phantomjs stuff? [20:05:35] ejegg: sorry, nope 8/ [20:05:41] k [20:05:56] asked for help in -dev, but nothing yet [20:07:30] ejegg: ah ohwellz.... sorry I don't know much... Mmm question also 4 u: quick reminder of how to run queries on phgres sampled impressions db? [20:07:34] (pointer 2 doc) [20:08:35] https://www.mediawiki.org/wiki/Fundraising_tech/Database_schema#pgehres.bannerimpressions [20:09:23] ejegg: also, how were you pulling info on datacenter yetserday? [20:09:39] thx in advance! [20:09:56] Sorry, cafe wireless flaked out [20:12:42] answers in otherchan [20:15:40] MBeat: did you get Anna sorted out? [20:16:58] we’re close, ty eileen1 [20:18:35] ok - cool - let me know if you need help [20:22:48] Did you guys figure out the impression drop issue yet? I heard a rumor that it had happened again [20:23:15] Looking at SAL makes me wonder why ops is depooling database servers on a Friday during the fundraiser <_< [20:23:56] yah same thing again, confusing! [20:25:08] RoanKattouw: still trying to figure it out [20:25:22] we know it only affected certain languages [20:25:36] en, de, fr, it, and simple [20:26:53] I'm trying to set up a phantomjs job to load a page and dump some cn vars every 10 minutes, but i'm a n00b and can't get at window.mediaWiki: https://phabricator.wikimedia.org/P4559 [20:27:55] Oh, interesting [20:28:14] Hmm I don't know how you get to the window object in phantom [20:29:16] http://phantomjs.org/api/webpage/method/evaluate.html suggests that what you're doing should work [20:29:17] yeah, I can get at document.title just fine inside page.evaluate [20:29:53] Maybe that function runs too early, before the code that defines window.mediaWiki has run? [20:30:03] maybe the page.open callback is firing before window.mediaWiki is set up? [20:30:05] Perhaps you need to wait for document ready or something [20:30:23] yeah, let me try putting it in a different callback [20:30:35] You could hack it with http://phantomjs.org/api/webpage/method/evaluate-java-script.html as well [20:30:50] Not sure how you'd get the data out in that case though [20:30:57] so yeah finding a different callback is probably better [20:32:15] aha, yep! [20:32:36] can totally find it in the later invocations of page.onResourceReceived [20:36:23] Awesome [20:36:32] That'll be a useful diagnosis tool [20:37:02] Also tomorrow should give us a good data point as to whether this phenomenon happens when no one is messing with DB servers (or anything else to do with the site) [20:37:20] RoanKattouw: could be maybe a daily cron job? [20:37:29] Yeah but at 8am UTC? [20:37:39] I dunno.... [20:37:42] I guess we'll have more info tomorrow [20:37:59] aha, looks like exactly the same languages were affected this morning [20:38:11] ejegg: ah interesting! [20:38:22] so... what makes en, de, fr, it, and simple different from the rest? [20:38:43] Couldn't be just that that's where we have campaigns? [20:38:47] I'd think it was related to the set of banners we're running, [20:38:56] but 'simple' isn't targeted anyplace [20:39:10] Or which datacenters people reading in those languages target most? [20:39:41] the difference between languages is way starker than the difference between datacenters though [20:40:32] huh ok [20:41:37] I feel like we should be running some automatic analysis on this, like some algorithm that could troll through all these columns and find which ones have a stronger relationship to variation in impression counts [20:41:57] I guess there aren't that many options, either, tho [20:42:50] yeah, correlation finding algorithm! [20:43:02] does analytics have any fun tools for that? [20:44:20] another argument against datacenter is that nl, es, cs, and ro are unaffected while de, it, and fr are affected [20:45:09] no, sk, sv, hu also unaffected euro languages [20:45:24] Fundraising Sprint Waiting for Godot, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, Unplanned-Sprint-Work: Central Notice: possible CN issue early on December 1st UTC - https://phabricator.wikimedia.org/T152122#2839139 (Catrope) >>! In T152122#2841854, @K4-713 wrote: > I'm seeing signs... [20:49:24] Interesting re languages [20:49:29] ejegg: I updated that dedupe thing to contact.created_date as you suggested [20:49:47] And that there's no apparent relationship to which banners those langs are getting [20:50:10] ejegg: Hmm, have you looked at which DB shard each language is on? [20:50:15] * RoanKattouw finds the shard map [20:51:06] https://noc.wikimedia.org/conf/s1.dblist through to https://noc.wikimedia.org/conf/s7.dblist not sure if we have anything better [20:52:01] Hmm sectionsByDB in https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php is more succinct, if you add that s3 is the "everything else" shrad [20:52:03] *shard [20:52:40] Hmm no that doesn't lead anywhere [20:53:00] it is affected but no and sv aren't, and they're all s2 [20:53:58] thanks eileen1, I'll review again [21:33:45] dang, running phantomjs on my rentaserver and I never see the response callback run [21:34:07] though I can curl the same article just fine [21:34:42] guess I can leave my puter running all night [21:39:37] err, or get me a labs thing [21:39:53] anyway, feels like food time [21:43:01] That sounds like a good idea [21:57:05] RoanKattouw: so you think the issue could be limited to a given shard? [21:57:24] AndyRussG: Not any more because there's no correlation [21:57:41] Hmmm [21:58:07] RoanKattouw: do you know who owns the GeoIP lookup infrastructure? Is that bblack and ema (Traffic)?9 [21:58:11] ? [22:06:37] (CR) Ejegg: [C: 2] Add option to early-exit dedupe jobs based on the volume of contributions being processed [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/324655 (https://phabricator.wikimedia.org/T152072) (owner: Eileen) [22:11:25] (Merged) jenkins-bot: Add option to early-exit dedupe jobs based on the volume of contributions being processed [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/324655 (https://phabricator.wikimedia.org/T152072) (owner: Eileen) [22:23:41] (PS2) Ejegg: Revert "Link to the correct configuration form" [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/324757 (owner: Awight) [22:23:47] (CR) Ejegg: [C: 2] Revert "Link to the correct configuration form" [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/324757 (owner: Awight) [22:27:19] (Merged) jenkins-bot: Revert "Link to the correct configuration form" [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/324757 (owner: Awight) [22:33:29] AndyRussG: I would guess traffic yes [22:33:36] Not totally sure [22:34:17] RoanKattouw: K cool, thx! Yeah currently picking bblack's brain over on -operations... More or less seems there's no way for servers to coordinatedly fail GeoIP lookup for cookie-setting [22:35:02] ...which would follow with what ejegg found earlier about geo- vs. non-geo-taregeted campaigns [22:35:43] yeah, and there was no correlation with the servers mentioned in the x-cache headers [22:39:13] RoanKattouw: ejegg: what about logged-in status? Is it possible that some cluster-side issue would make a bunch of anon users appear to be logged in from the point of view of client-side JS? [22:39:30] ooh, that'd be weird [22:39:41] this phantomjs script will pick that up [22:39:48] That could prevent a lot of users from being selected for a campaign... [22:40:13] that would cause a lot of other weird behavior too though, wouldn't it [22:40:16] ? [22:41:34] oh man, i finally figured out how to force my google fi phone to switch networks [22:42:04] it always chose this super slow one for some reason, like 100k and very flaky [22:42:35] ejegg: truly I dunno... anons get HTML cached from Varnishes, so I don't think they'd see most logged-ins' features [22:42:48] but the 3G sprint connect has fewer bars or whatever but is a pretty consistent 1-2mb [22:42:59] feels like lightning [22:43:57] cwd: cool! How do u like the Google fi? [22:44:52] well i am very ambivalent about google but they're certainly no worse than verizon [22:44:57] at least they innovate [22:45:13] the service is good and cheap [22:46:35] I think ejegg's script should capture a bunch of stuff [22:46:58] ejegg: Does choiceData also include which choice ended up being made, or only the data that is being used to make a choice? [22:47:16] I'll post the updated script - there's another property with the data being used [22:47:36] Hmm I guess since you're logging this from requests that you're sending yourself, not other people's requests, it doesn't matter [22:48:21] if there was something that made cn think it wasn't anon, that data would change [22:48:57] RoanKattouw: choice data is just the base set of possible campaigns and banners set to a client. Yeah, the other data ejegg is pulling should show all the criteria used to make a decision [22:49:51] RoanKattouw: Do we have any other existing client-side-JS status-checking instrumentation going? [22:50:57] I don't think we do [22:51:00] Hmm navtiming I guess [22:51:35] RoanKattouw: what does that tell us? [22:51:49] Whether frontend things in general went sideways [22:51:59] Specifically, it's data about how quickly pages loaded [22:52:02] RoanKattouw: K so where could we check that? [22:52:21] ejegg: RoanKattouw: Here's a data point that I think may indicate a RL issue: The dropoff seems to take about 10 minutes from when it starts to go down to when it's as low as it gets [22:52:33] Also recall choice data is cached using objectcache [22:52:46] Right [22:52:57] Yeah the gradual decrease and increase suggests RL, I agree [22:53:04] What objectcache does choice data use exactly? [22:55:00] https://grafana.wikimedia.org/dashboard/db/resourceloader and https://grafana.wikimedia.org/dashboard/db/navigation-timing are the dashboards I was referring to [22:55:43] AndyRussG: What was the time window of last night's dip? [22:55:55] I know it was 8:15 till 9:04 or something the night before last [22:59:28] Checking... [23:01:18] Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM: update civi record whit actual audit USD totals when we get them - https://phabricator.wikimedia.org/T152249#2843051 (DStrine) [23:04:15] RoanKattouw: almost identical timing: https://grafana.wikimedia.org/dashboard/db/kafka-by-topic?from=1480550399000&to=1480680000000&panelId=6&fullscreen&var-cluster=analytics-eqiad&var-kafka_brokers=All&var-topic=eventlogging_CentralNoticeBannerHistory [23:04:28] (you can't really see the gradual dropoff there, but it's visible in more detailed data) [23:06:27] Hmph that's pretty sparse data :/ [23:06:38] But yeah you can see it's about the same time [23:07:01] RoanKattouw: yeah there is better data in Hive [23:07:45] https://phabricator.wikimedia.org/T152122#2840884 [23:08:07] Partially pre-munged [23:08:31] Also faster and decent data in a db we have filled by sampling beacon/impression calls [23:08:49] here's the updated phantom script that pulls cookie, cn.data, and cn.choiceData: https://phabricator.wikimedia.org/P4559 [23:09:39] Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM: update civi record whit actual audit USD totals when we get them - https://phabricator.wikimedia.org/T152249#2843080 (DStrine) p:Triage>Normal [23:12:37] AndyRussG: So I was looking at https://grafana.wikimedia.org/dashboard/db/resourceloader?from=1480665600000&to=1480672800000 and the INM and 304 graphs have noticeable dips at 08:10 and 09:10 [23:12:46] Zooming out now to see if that's just an hourly pattern [23:13:38] Hmm while there are multiple dips, they are not hourly [23:13:47] Maybe they're just another symptom [23:13:51] RoanKattouw: hum!!! what is INM? [23:14:34] INM = If-None-Match, it's the % of incoming RL requests where the browser indicates that it still has the resource in cache [23:14:52] So sends us a hash of the cached version that it has, and its request is really just asking if the resource changed [23:14:54] weird [23:15:14] So INM% is a proxy for "how many clients are hitting us with hot caches" [23:15:39] huh... there's another blip around 10:50 [23:16:01] RoanKattouw: so maybe RL startup module cache stampede? [23:16:06] There are more pronounced brief drops in the 304 satisfaction ratio at the same time [23:16:23] and an even bigger thing at 16:00 [23:16:41] That is the % of INM requests to which we respond "you're fine, go ahead and use your cached version" [23:17:02] nothing in the server admin logs to correlate with either 10:50 or 16:05 [23:17:31] It's normal for 304 satisfaction to drop really low and slowly recover when a resource changes [23:17:43] (Which can happen due to on-wiki edits BTW, for e.g. user and site JS) [23:17:53] hmm [23:18:16] It's a bit odd to me that INM drops coincide with those, but I don't have as much experience interpreting this data as the perf people do [23:18:47] * AndyRussG summons Krinkle.... aww not on IRC [23:18:52] (I did help come up with some of these metrics, and of course I designed the underlying caching system in the first place, but I am no longer the RL whisperer, Krinkle understands these graphs much better than I do) [23:19:11] Ah rather he's elsewhere [23:19:18] Yeah he's not in this channel [23:20:42] RoanKattouw: I'm gonna quote u in in -dev [23:21:31] o/ [23:21:39] ejegg: Blips in which data? [23:21:40] Krinkle: [23:21:50] ejegg: Banner impressions, or the RL dashboard? [23:21:52] 17:12 AndyRussG: So I was looking at https://grafana.wikimedia.org/dashboard/db/resourceloader?from=1480665600000&to=1480672800000 and the INM and 304 graphs have noticeable dips at 08:10 and 09:10 [23:21:54] 17:12 Zooming out now to see if that's just an hourly pattern [23:21:56] that RL dashboard [23:21:56] 17:13 Hmm while there are multiple dips, they are not hourly [23:21:58] 17:13 Maybe they're just another symptom [23:22:00] 17:13 RoanKattouw: hum!!! what is INM? [23:22:01] Oh yeah [23:22:02]