[15:10:31] Gasp, just saw honest-to-god GlobalCollect [16:05:03] I'm updating precise-security wget 1.13.4-2ubuntu1.1 -> 1.13.4-2ubuntu1.2 on all servers [17:28:55] morning awight, ejegg :) [17:35:14] AndyRussG: hey! [17:35:25] hey ho :) [17:35:45] We have a little meltdown of the donations queue we're trying to clean up... [17:35:53] ohno [17:36:07] can I help? [17:37:10] I have a "little" addition to make on the bucket proposal that I'd like to run by folks, following some calls this morning [17:38:04] See specifically the new Requirement 6 on page 3 of the bucketgoogledoc [17:40:27] mmm neway anytime such minor stuff might arrive on the to-do radar pls let me know, thanks in advance :) [17:40:50] sure thing! This will probably take a couple of hours... [17:41:38] cool, thanks! again, if I can help w/ anything I'm here... [17:52:02] Someone would like a translation (in french, of the message received after donation) [17:55:42] Frakir: that should already be translated--it sounds like this is a bug. c.f. https://wikimediafoundation.org/wiki/Thank_You/fr [17:57:03] Ok [17:58:28] Frakir: if you have the chance, please encourage this person to contact donate@wikimedia.org with a short description of the issue. Thanks! [17:58:42] (it is on OTRS) [17:58:55] aha, great [17:59:07] (PS1) Awight: cheesy solution to prevent null explosion [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/170739 [18:01:12] ejegg: ^^ some CR to laugh at [18:02:45] I'm looking at the fredge issue, now. Believe it or not, it seems to be completely unrelated! [18:03:16] wow, we're getting plenty of serendipity lately [18:04:21] OK, null->empty makes sense in a log. Don't want the flight recorder to be the cause of the crash. [18:04:40] (CR) Ejegg: [C: 2] cheesy solution to prevent null explosion [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/170739 (owner: Awight) [18:04:42] (Merged) jenkins-bot: cheesy solution to prevent null explosion [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/170739 (owner: Awight) [18:04:47] U missed it last week, when I deployed a payments thing that smelled a bit, and suddenly two different groups in the office converged on Katie with different banner bugs! [18:05:28] oh man, hope we're shaking 'em all out now [18:06:16] It may be time to apply more flea medication [18:06:26] we can kick that subscription message out of the queue and restart the consumer, right? [18:06:32] (PS1) Awight: Merge remote-tracking branch 'origin/master' into HEAD [wikimedia/fundraising/crm] (deployment) - https://gerrit.wikimedia.org/r/170745 [18:06:35] ejegg: I'd like to leave the msg in there [18:06:43] OK, test the patch, right [18:06:46] it should be consumable... [18:06:57] I'm not totally sure the message is still there, though... [18:07:05] was last I checked [18:07:22] ok yeah, "unrecoverable error" [18:07:29] yep, correlation id with the guid sticks out like a sore thumb [18:07:34] hehe [18:07:37] ok here goes nothing [18:07:45] deploying? [18:07:50] yep [18:08:06] Cool, I was about to do that for the recurring GC patch and the email change thing [18:08:17] those were rolled in... [18:08:22] thanks! [18:08:26] yah, https://gerrit.wikimedia.org/r/#/c/170745/ [18:08:32] (CR) Awight: [C: 2] Merge remote-tracking branch 'origin/master' into HEAD [wikimedia/fundraising/crm] (deployment) - https://gerrit.wikimedia.org/r/170745 (owner: Awight) [18:08:40] (CR) Awight: [V: 2] Merge remote-tracking branch 'origin/master' into HEAD [wikimedia/fundraising/crm] (deployment) - https://gerrit.wikimedia.org/r/170745 (owner: Awight) [18:09:09] !log update crm from f47ed6f7e55946388db1dde787ca458c27a57c5a to b8a1fa98b5d9252d708090c99b61fd22ebe8d2be [18:09:18] Logged the message, Master [18:09:28] !log restarting donations queue consumer [18:09:33] Logged the message, Master [18:09:50] looks healthy. [18:10:35] hrm, it choked on that subscr_signup still, but correctly kicked it into a failmail [18:16:52] the-wub: just wondering... if we turned on an Amazon banner last night or something. [18:19:17] utm_source was fr-redir.default~default~default~default~control.ramazon, so that's not a banner link, right? [18:20:37] I cannot rememeber where fr-redir comes from [18:20:58] let me check the utm_source documentation [18:21:00] j/k [18:21:32] heh [18:21:58] awight: heh. did it have a utm_campaign? [18:22:25] nope [18:22:54] oh, but the referrer was from wikimediafoundation.org/wiki/Home! [18:25:09] ah, found it. there are links to https://donate.wikimedia.org/wiki/Special:FundraiserRedirector right at the bottom of that page. I will fix those to have some tracking info [18:26:21] no idea how someone managed to make a recurring Amazon transaction though. they would have just seen our normal landing page. [18:27:45] Mysterious JS glitch that leaves the 'amazon' button up with 'monthly' selected? who knows... [18:28:26] There might be a way to change it within the Amazon workflow, also... [18:28:42] We should... probably just fix our recurring Amazon code. [18:29:45] huh, can't see a way to change it from amazon's site. but yeah, the amazon gateway shouldn't bounce that through in the first place [18:31:11] It's the listener that created that message... which is fine, but I guess it needs to land in the recurring queue. [18:37:55] ejegg: oh I lied, the two problems were caused by the same message [18:43:35] oh,cool [18:47:11] ejegg: if you're still logged into the queue box? [18:47:18] Please kill amazon-9263064609... [18:47:43] sorry, outside for a moment [18:47:51] one sec [18:48:04] ejegg: oh, don't worry, then! [18:55:32] !log restarting fredge consumer [18:55:38] Logged the message, Master [18:57:47] ejegg|brb: kicked the corrupt message out and restarted fredge... looks good for now. [19:00:43] I guess standup gets amazoned to e-mail today? [19:00:45] good stuff [19:01:48] Weren't we gonna move those back now that nobody's continental? [19:02:19] yah let's start that today? [19:02:26] 2:33 PST? [19:02:33] sounds good to me! [19:03:11] Oh yeah I forgot 8p blrrggg [19:03:39] thanks :) [19:04:06] I forgot too! [19:11:18] AndyRussG: ok, I looked at requirement 6, sounds correct! What was the impact on yr technical solution? [19:11:34] Or, which part u wanna discuss? [19:11:43] awight: mwahahahahahha [19:11:44] :) [19:11:53] halloweeny! [19:12:35] The impact on the technical solution is that the server will send a spooky per-campaign snippets of javascript together with the list of possible campaigns and banners together with the banner controller on the first request [19:12:43] Mwaaaaaaahahahahaheheheeee [19:12:50] hrrrm [19:12:59] what about the simpler "variant" thing... [19:13:04] Thereby eliminating the no-impression bump [19:13:07] Ah yes that is still there [19:13:30] I think that was enuf to solve for the top vs fullscreen thing [19:13:42] Sadly almost but not quite [19:13:49] I'm all for code-reviewed javascript, though! [19:14:12] Yeah I it can indeed be separate from other banner stuff [19:14:22] ah I see--cos cross-campaign [19:14:26] exactly [19:14:56] did you have that conversation with meganhernandez about using campaign to contain exactly one experiment? [19:14:58] some campaign-specific JS is needed if you want your full-scream-banner-already-seen cookie value to be interpreted [19:15:02] yes [19:15:22] yeah we talked this morning, and then again briefly with the-wub [19:15:28] agreement? [19:16:10] on the product side, re: campaigns/tests... it seems they do want the button to re-shuffly buckets at some designate time within a single campaign [19:16:27] rather than having so many campaings [19:16:37] Thinking about the per-campaign js, I'm not fully understanding something--cos we had realized that inlining the banner/campaign list would not have an effect on the first impression [19:16:40] since in the US due to the numbers, some tests can even last as little as an hour [19:16:49] urrrgh [19:16:55] nooooooooooooooooo [19:16:57] boo! [19:16:59] sorry [19:17:48] reshuffling buckets manually is such a bad idea... they will have to click that between *every* test [19:18:26] well it could also be like a thing where you set a time for it to happen and that's recorded somehwere... [19:18:37] I think it's not the most essential bit [19:18:57] the per-campaign js, basically pre-banner js, is for the use case of a already-had-fullscreeeen being present at the start of another campaign that has the same fullscreen-and-then-top-banner layout [19:19:16] It's just that there's almost no use case for not resetting buckets between tests... [19:19:41] you need something to run before the first banner is got to decide on the very first go to decide not to show the fullscree again [19:19:51] AndyRussG: yes, so what will it do, on the first impression of the second campaign? [19:20:08] Sounds like we still need compounded top/fullscreen banner html? [19:21:14] in that case you'll set some per-campaign pre-load js and that'll check for the full-screep cookie, and if you have it, put you in phase 2 right away and fetch the appropriate banner based on that [19:21:32] so we are accepting the additional round-trip? [19:21:34] or rather, the "fetch the appropriate banner based on that" will just be the normal bannerController code [19:21:36] no [19:21:44] We're avoiding the additional round-trip, exactly [19:21:54] I don't get it [19:22:00] mmm OK [19:22:16] the preload js is in the response from BannerRandom [19:22:21] no [19:22:23] in which we already have banner html... [19:22:39] Step 0: a campaign with full and then top banners is run and 60% of the users have a cookie that says "already saw full-screen" [19:22:44] k [19:23:17] Step 1: start a new campaign that also does the same full then top thing. As before full is phase 1, and top is phase 2 [19:23:26] ook [19:23:40] So user goes to enwiki [19:23:48] if it helps... https://www.mediawiki.org/wiki/Extension:CentralNotice/Banner_loading [19:25:57] enwiki responds with an HTML page with the normal bannercontroller js (as currently is the case), + a list of campaigns and banners for the client to choose from (as already set out in the proposal) + some pre-load per-campaign js snippets of the possible campaigns happen to have any (this is the new bit) [19:26:18] so there is no additional round-trip there, just some new small JS bits getting sent in with the initial response [19:26:19] wait [19:26:26] we cannot add anything to the banner controller response [19:26:37] it's served thru ResourceLoader... [19:26:52] where are we putting the list of campaigns to choose from? [19:27:17] I was proposing, it's inlined with bannerhtml, so it's available on the client for the second and following requests [19:27:32] maybe we could have a parallel request to get the list [19:27:56] it would require a skin change however, to load any sooner than the current BannerRandom [19:28:33] I was gonna send it directly in output from efCentralNoticeLoader, in CentralNotice.hooks.php [19:29:15] It'd stil be deterministic, based on language and project only [19:29:18] This is a skin change... which is doable, but a bit drastic [19:29:34] these take about two weeks to propagate [19:29:58] can't we stick it in with the hook? [19:30:06] that's in the skin, no? /me looks [19:30:29] yeah [19:30:40] The hook is BeforePageDisplay [19:30:58] you're right that we can have language and project, cos those are known at the time of skin render [19:31:19] this html is then cached for 15 days, and the PHP doesn't run again... fwiw [19:31:45] huh, 15 days? [19:32:09] I totally encourage this sort of drasticness, I'm just mentioning that the consequences expand, and we need to plan for risks [19:32:10] how do wiki page changes get thru then? [19:32:31] ah yeah I thought it was clear this was drastik [19:33:39] AndyRussG: I'm not seeing the cache stuff in headers... it's not cached on the client, but on the varnish front-end server [19:34:51] awight: If I go to an enwiki page I get wiki content in the HTML of the initial HTTP response [19:35:02] so that's gotta refresh faster than 15 days... [19:35:12] Yes that's what I mean about not caching on the browser [19:35:14] it doesn't necessarily have to go in the header [19:35:26] no, 15 days is how long it will take for pages that have not been changed to update [19:35:33] there is a manual purge we can do, but it's really expensive [19:35:41] only for emergencies, like a broken skin... [19:35:55] fwiw I'm looking at http://www.mediawiki.org/wiki/Manual:Varnish_caching [19:36:00] aaaaaaaaaaaaaaahhhh blarg! [19:36:12] They say, when a page is edited, mediawiki makes a purge request to varnish. [19:36:24] (PS1) Ejegg: Avoid some trouble in non-USD contribution recalc [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/170771 [19:36:27] (CR) jenkins-bot: [V: -1] Avoid some trouble in non-USD contribution recalc [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/170771 (owner: Ejegg) [19:37:39] well that is an issue........ [19:37:51] AndyRussG: Anyway, I like the idea of loading the banner+campaign list from the skin. [19:38:00] I think. [19:38:04] (PS2) Ejegg: Avoid some trouble in non-USD contribution recalc [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/170771 [19:38:14] why the skin? [19:38:26] == the efCentralLoader hook [19:38:54] still it'd get the same long-delay varnish issue you're talking about, no? [19:38:56] sorry, "body" is probably less confusing [19:38:58] yah [19:39:22] not a blocker. but it does raise the stakes. [19:39:26] what about like a rolling day-long refresh? [19:40:20] I think that's what effectively happens in a purge, cos pages are requested in random order [19:40:28] As in, when a campaign starts somewhere, we gradually refresh all the varnish cache for that project/language over the course of a day? [19:40:29] but -operations would have more to say about this... [19:40:57] oh, we can't go around purging more than once a month or something [19:41:06] drat [19:41:09] it takes maybe week for the load to settle, IIR [19:41:10] I think we're back to square 1 [19:41:38] why? we add a line to the body, which loads the banner/campaign list [19:41:50] we can set the cache lifetime on that list independently of the main page content. [19:42:13] all I'm pointing out is that we should be careful not to break the site :) [19:42:25] ah right point well taken on that one [19:42:56] We could even introduce the skin change now, before we've implemented the endpoint... it can respond with "" [19:43:11] and will be ready to use by the time you've made the real changes [19:44:17] awight: hmm OK just to see if I'm understanding... [19:45:12] banner controller is loaded via resource loader [19:45:36] so the idea would be to send in the list... as a separate js/json thing to pull in in parallel? [19:46:15] yeah, then we have some kind of thing which synchronizes banner loading once both the controller and list are present. [19:48:28] hehe I'm getting trout-slapped in -operations for suggesting this [19:50:17] awight: I see now hmmm [19:51:11] but for ejegg's benefit, they're saying that skin changes will take until "wednesday plus a week" [19:51:22] OK well making sure something is doable is really a prerequisite to moving forward on anything like this solution [19:51:39] Looks totally doable [19:51:44] let me finally study Varnish and the setup a bit more, I've been putting it off [19:51:52] And, we should make that skin change today or tomorrow... [19:52:09] I don't understand what you mean by a skin change [19:52:13] * AndyRussG shows his ignorance [19:52:38] Maybe this is not a shorthand people actually use. But I think that's what mediawikiots call changes to the HTML surrounding each page [19:52:45] even if it isn't strictly in the Skin module [19:53:04] so, loading anything beyond resourceloader modules? [19:53:14] even RL would be a skin change [19:53:33] cos we would be adding the HTML to load the module... [19:53:35] awight: ah OK I see [19:53:48] And that's cached differently somehow? in memcache? [19:54:03] oh wow, I thought we got rl modules updated with extensions most of the time [19:54:12] we do! [19:54:56] wouldn't it still need to wait the 15 days for varnish? [19:55:05] I mean the "skin" change [19:55:41] And we're still talking about adding another $out->addHeadItem() call in efCentralNoticeLoader? [19:56:18] no, RL modules are fine to update cos they are only one URL [19:56:30] for the skin, it is 50M URLs or something [19:58:50] I see... basically since varnish is a giant lookup table of URLs <-> content [19:58:59] yeah [19:59:05] changing 1 URL is a piece of pie [19:59:25] and changing a million is the biggest donut in the history of the universe [19:59:40] well, it just means that the PHP has to actually run for every page [19:59:49] and PHP is a dirty dog [19:59:57] that likes to poop in its own bed. [20:00:08] hmff [20:00:23] though it does run every page that is visited by a logged-in user, no? [20:00:36] (CR) Awight: [C: 2] "well spotted!" [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/170771 (owner: Ejegg) [20:00:38] (Merged) jenkins-bot: Avoid some trouble in non-USD contribution recalc [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/170771 (owner: Ejegg) [20:00:43] AndyRussG: I think so [20:00:49] OK [20:01:01] thanks, awight! Gonna try to get that onto staging [20:01:07] But since that's such a massively small percentage of visits, it's like nothing, no? [20:01:12] yep [20:01:35] OK I think I got it [20:01:40] So, you probably saw that the ops said "rollback your skin change next wednesday" :) [20:01:50] in case of emergency, carefully set glass on the floor [20:02:51] awight: OK I think I get it [20:03:10] skin change = the same as the whole cache refreshing and PHP running for every page? [20:03:33] err yeah but there is a controlled process for introducing skin changes [20:03:59] so we're free to make skin changes, we just can't dictate how they are propagated... we "ride the train" :) [20:04:38] awight: ahhh OK I see [20:05:21] and so if we introduce a change in the header to call another URL, even if it's gonna be blank for now, we still have to set up the URL to answer it, no? [20:05:28] yes [20:05:34] I guess it can just be a CN special page or something [20:05:57] or API! [20:06:20] And... if we're even remotely considering this, we should probably jump on the parts of the system which interact with the skin, and make those changes... today or something! [20:07:01] otherwise, I would worry that inertia will kill this initiative, cos it would then be > 2 weeks before we can start testing. [20:07:38] It's probably very prudent to roll out a null responder, too... so that we only change one variable at a time. [20:07:41] awight: right... [20:11:52] I think an actual rollback in functionality won't be too rough, though, since nothing happens until we start changing bannerController. [20:12:10] And so rolling back will just involve rolling back bannerController, too [20:12:14] Does that make sense? [20:12:34] well, there will be a second endpoint we'd have to rollback as well [20:12:52] but yeah, I think we can manage it, just have to keep in mind that all clients will be hammering our second endpoint [20:13:20] and that old and new styles have to be able to coexist, so that if we do roll back, we don't disable people from seeing banners for a week [20:14:15] awight: exactly [20:14:39] OK I'm gonna study to try to understand all of the infrastructure that affects this [20:15:02] rad. I'm excited! [20:15:07] and then, if you like, I can ask again on operations to make sure I've understood and the plan is, heh, "sound" [20:15:46] only if you want to feel the sting of cold trout :p [20:16:14] I'd rather be trouted than bring down the site or big Eng fundraizer, heh [20:16:22] yep! [20:16:39] Just don't tell katie or anne we're doing this ;) [20:16:45] katie who? [20:16:50] <_< [20:16:51] >_> [20:17:00] you must know her by her robot name [20:17:53] heh yeah [20:18:08] OK I think we have a plan, or at least a "plan" [20:18:42] awight: thanks so much for bringing me up to speed on these... "details" ;p [20:18:42] it's still tickling me that this is the architecture we abandoned 2 years ago. [20:19:00] ? [20:19:14] The only issue at the time was that loading the banner list was a second round-trip [20:21:48] ah hmmm right [20:22:00] K thanks! back in a short little bit [21:21:01] awight: I'd like to test the recurring globalcollect fix before I turn the job back on. So just "drush rg --batch=10" ? [21:24:45] ejegg: I think so. u can peek at how the job is configured in Jenkins for the exact commandline [21:25:31] ejegg: I remember that module has some really annoying semantics, there is a "maximum" batch size which you can set through the GUI, a default batch size (also in the GUI), then the CLI batch size just overrides all that. [21:25:35] urrrgh [21:26:12] heehee, what fun [21:26:26] yeah, drush help rg gave me that example [21:26:45] oof [21:26:51] can't trust it [21:27:00] ok, checking the code! [21:29:16] argh, my ssh-agent is a zombie [21:29:36] that is so three days ago [21:29:57] ejegg: yep, --batch [21:30:06] and you probably want "drush -v -v" [21:30:20] to print all info-severity messages [21:30:24] good call, thanks! [21:30:41] thanks for fixing that! [21:31:33] ooh, did a batch of 1 and got failure. not sure if we're fixed just yet. lemme decode all this [21:33:12] Strange, I thought it was the auditor that was choking? [21:33:21] oh, you're testing that we get the real ct_id [21:34:05] yeha [21:34:06] yeah [21:34:39] hmm, that one seems to have worked right on our side, just got a 'Not authorized' result back from GC [21:34:51] suppose i'll try another [21:35:48] and... Card expired [21:35:50] heh [21:36:14] baa [21:36:57] Not auth again! is it just catching all the ones that failed earlier and retrying? [21:36:58] ejegg: you know I ran into the same thing last week, the issue is that the rgc job also has this annoying workflow, where it tries to charge batch# of previously successful subscriptions, then batch# of previously failed [21:37:05] so yeah what u said :) [21:37:17] except you figured it out about an hour faster than I did [21:37:54] heh. Was just hoping all our recurring donors didn't turn deadbeat at once [21:38:38] yeah I was testing the DI unfork, so... my anxiety was that I had converted all cards to deadbeat [21:39:07] ok, got 4 successes! [21:39:20] let's see what's in the comms log [21:39:32] oof. Lemme know if that still works [21:49:06] awight: yeah, not finding the XML for those in any of the spots I've looked so far. But the arrays that do make it into the logs have legit ct ids in them [21:50:20] ejegg: was the old RGC processor really dumping into the comm logs? I thought those were only coming from payments boxes. [21:50:32] actually, probably not [21:50:53] but I just found the bit of the string GC sends back that corresponds to merchant id [21:51:07] excellent... [21:55:15] OK, looks like they've got legit IDs. Turning the job back on [21:55:56] !log enabled recurring globalcollect processor [21:56:02] Logged the message, Master [22:27:07] (PS1) AndyRussG: WIP API for campaign choices [extensions/CentralNotice] - https://gerrit.wikimedia.org/r/170843 [22:32:51] AndyRussG: ejegg: https://plus.google.com/hangouts/_/wikimedia.org/fr-tech-daily [22:32:54] if ya want [22:33:17] One sec :) [22:48:01] (PS3) Awight: WIP Customized LYBUNT report [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/170268