[16:31:56] dstrine: hi! There are some updates on the CN front. Generally good news... We should deploy today the shorter lag time on loader errors, and it seems there is an easy way to quickly purge any terribly horrible banners from the cache, if need be. I'll test that today [16:32:24] Also the thingy that was supposed to send an impression error when there's an error may be working properly after all [16:33:33] However, I have had some laptop issues this weekend... Specifically the hard disk on this loaner laptop seems to be having the same issue as the other one. I/O errors on the SSD. So I'm running from a live-boot USB stick and am trying to back stuff up before the thing fails completely. [16:34:06] I'm not sure it's as severe as what I had on the other laptop, but I decided to back up ASAP just in case... [16:34:31] I did recoup my SSH keys and all, so I haven't lost any production access. Just inconvenience of running like this... [16:34:39] fr-tech ^ fyi [16:34:42] Also, hi!!! [16:38:05] AndyRussG: great about CN, bummer about laptop! [16:47:59] cwd: yeah, contrary to most, I do like this model of laptop, except for the constantly failing disks... [16:48:58] I'm just setting up a backup system for non-work-related stuff w/ DO now [16:49:26] Fundraising-Backlog, FR-Ingenico, Spike: spike: investigate creating an ingenico form with no city and state - https://phabricator.wikimedia.org/T151769#2827507 (DStrine) [16:49:41] I wonder what is an acceptable backup solution for work stuff (like notes and CSVs pulled from prod) that might possibly have donor data [16:49:59] I guess a local encrypted disk is the only one [16:51:07] maybe an encrypted file [16:53:38] but even encrypted do we want to store it anywhere not under direct physical control? [16:56:36] AndyRussG: cwd ejegg XenoRyet soo last week we did not want to release the "shorter lag time on loader errors" what changed? tomorrow is big enlish. Is this safe [16:57:16] dstrine: Last week we decided against releasing the core changes [16:57:33] I think [16:57:50] Weren't we going to check with ops about the reduced cache? [16:57:57] ejegg: I thought we decided agains all of it. I just want all of you to be ok with anything that changes today [16:58:36] ejegg: dstrine: we did check w/ ops about the reduced cache. All good. So we should deploy that todayz [16:58:46] I'm not worried about the code changes involved, as long as ops is cool with the cache implications [16:58:59] yep! confirmed w/ bblack :) [16:59:02] it was a pretty small and easy to understand bit of code [16:59:17] unlike the core caching bits [16:59:22] yup [16:59:39] Re: core changes, we need to check that the revert patch was sent and gets merged before tomorrow's train [16:59:39] ok cool. that's great. AndyRussG what is the expected lag time now? [16:59:55] dstrine: after deploy, 2 min since last error, in case of error [16:59:58] AndyRussG: sounds like you have a pretty bad case of gremlins gnawing at your hardware [17:00:20] ejegg: aaaaaarg! ;p [17:00:33] I wish I knew where the gremlin nursery lay [17:01:02] AndyRussG: thanks and good luck with your computer. Please email OIT and Katie the second you think you need new hardware [17:03:29] dstrine: yeah... At least I should try to pick up a new box at AllHands [17:03:47] But I guess if this really fails now, working off the USB stick until then isn't an option.... [17:18:07] dstrine: should I go to the FR standup at 10 am PST? [17:25:16] Fundraising Dash, Fundraising-Backlog: Use URL fragments to allow linking to specific boards or queries - https://phabricator.wikimedia.org/T151772#2827630 (Ejegg) [17:25:48] Fundraising Dash, Fundraising-Backlog: Use URL fragments to allow linking to specific boards or queries - https://phabricator.wikimedia.org/T151772#2827654 (Ejegg) Open>Invalid [17:39:42] fr-tech the procedure for purging a banner from Varnish is pretty easy. I'm going to test it later today and will put instructions on our emergency procedures page. Anyone with deployer rights on general WMF production can do it [17:42:35] fr-tech: dstrine: I'm thinking an extra "FR starting tomorrow, don't send anything dicey to production" reminder to wmfall might be useful [17:43:20] I could suggest it to K4 at check-in today [18:00:02] fr-tech: Gleemites, n.: [18:00:02] Petrified deposits of toothpaste found in sinks. [18:00:03] -- Rich Hall & Friends, "Sniglets" [18:00:03] -- discuss. [18:05:41] fr-tech anybody want to discuss thangs? [18:05:54] I'm in the fr standup [18:06:11] It ends in 30 min, just thought I'd sit in in case CN questions [18:06:54] ooh, I should tune in [18:08:36] huh, I can't see it anywhere [18:08:46] back in a bit... [18:08:49] ejegg: weird I thought you were on it [18:08:55] I just sent it [18:09:02] thanks! [18:09:03] i have to finish some house work before 1on1 with k4 in 20 [18:09:16] today is different than the daily big english meeting [18:24:47] (PS4) Ejegg: WIP rename 'zip' to 'postal_code' [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/320267 [18:24:58] (CR) jenkins-bot: [V: -1] WIP rename 'zip' to 'postal_code' [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/320267 (owner: Ejegg) [18:28:09] mmm that was fun [18:30:18] fr-tech I could do with some ideas on how to really find out if T149107 works or not. I guess some kind of artificial error generation on beta cluster might do it. And also maybe some brainstorming for T141918... [18:30:19] T149107: CentralNotice: Relay banner loading issues in beacon/impression - https://phabricator.wikimedia.org/T149107 [18:30:19] T141918: Spike: Prioritized checklist of pre-December CentralNotice and related essentials - https://phabricator.wikimedia.org/T141918 [18:30:38] Those are the only things I'd have for tt, also IRC is fine [18:31:38] AndyRussG: I'm up for a checklist brainstorm. joining... [18:31:53] hangout, that is [18:33:30] ejegg: cool thx! mm there in one sec [18:37:17] ejegg: there now [18:39:46] https://etherpad.wikimedia.org/p/fr-tech_2016_checklist_notes [18:40:41] fr-tech: I'm videoless, but is there a meeting happening? [18:41:38] awight: ejegg said h might join but mebbe he had issues.... Here is all I was thinking of chatting about [18:41:50] could do with some ideas on how to really find out if T149107 works or not. I guess some kind of artificial error generation on beta cluster might do it. And also maybe some brainstorming for T141918...} [18:41:51] T149107: CentralNotice: Relay banner loading issues in beacon/impression - https://phabricator.wikimedia.org/T149107 [18:41:51] T141918: Spike: Prioritized checklist of pre-December CentralNotice and related essentials - https://phabricator.wikimedia.org/T141918 [18:45:08] AndyRussG: hey ah... I hear that we're reducing the varnish cache expiry on bannerloader from 10 to 2 minutes? [18:46:37] Fundraising-Backlog: New thank you email for big english - https://phabricator.wikimedia.org/T151784#2828076 (DStrine) [18:46:46] awight: yep. All cleared w/ opsen. Gotta book a SWAT deploy [18:46:50] mmm [18:46:55] I'd like to ask a few questions [18:46:57] don't stop or anything [18:47:03] ah pls go ahead! [18:47:08] but, how are we evaluating whether this is safe? [18:47:19] Is there a window defined in which we'll be making that decision? [18:47:31] I guess rollback is simple and risk-free? [18:48:15] awight: which possible area of unsafeness are u worried about? [18:48:21] also curious whether our stakeholders are pushing hard for this or if it's just "because we can" [18:48:41] I'm very wary of us finding unknown surprises in the Varnish layer [18:48:58] e.g. there could be other time-based things that conflict with this caching [18:48:59] WRT to prod infrastructure, bblack thinks it's fine. Also since it's controlled by config, it's easy just to put it up to 10 min almost immediately [18:49:09] and also, this is a lot (5x) extra load on application servers [18:49:10] hmmm what do u mean? [18:49:27] like, if there's another expiry that we aren't aware of, that caches things for 5 minutes [18:49:32] well, with any luck it's only one extra request [18:50:02] I just want to make sure we have monitoring open [18:50:02] if the 2 minutes is long enough to let the database catch up [18:50:13] so like, watching ganglia graphs and so on [18:50:31] and a specific time at which we're going to make the call that this is safe or not to leave up for BE [18:50:39] ejegg: I think awight is thinking of some other unforseen error that might not stop happening spontaneously? [18:51:02] ah right [18:52:23] Still likely not a big load. Basically instead of (1 request) x (possible permutations of banner loader URL) x (Varnish cache segments or whatever) / 10 minutes [18:52:40] it'd be (5 requests) x (possible permutations of banner loader URL) x (Varnish cache segments or whatever) / 10 minutes [18:52:43] maximum [18:52:49] In the case of an ongoing error [18:52:55] Still doesn't sound very risk to me [18:53:14] I don't expect that to be a huge deal either, but want to make sure we have the window open that tells us exactly how many of these requests we're serving [18:56:13] awight: right... Maybe let's peek at the CN debug log? Or the beacon/impression logs with error status? [18:56:26] kk [18:56:38] Didja see what I said the other day about my theory of why that may actually be working as advertised BTW? [18:56:41] also wondering if this is just a nice to have, or if fr-online really wants it? [18:56:45] Also, quoting bblack: "[20:37:29] I can give you a conceptual "+1 this seems like a fair bandaid for now until we fix everything related in a better way", but I have no business reviewing the actual PHP code :)" [18:57:03] Fundraising-Backlog, MediaWiki-extensions-DonationInterface: Ingenico: stop calling SET_PAYMENT when GET_ORDERSTATUS returns 25 - https://phabricator.wikimedia.org/T151788#2828151 (Ejegg) [18:57:11] http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20161123.txt [18:58:03] awight: I can't see how it wouldn't be an improvement [18:59:09] but if u feel it's also better not to deploy, out of caution, I'm also willing to go that route, fer sures [18:59:41] I'm... trying to inject a large dose of caution but without sabotaging whatever is already in the works [19:00:12] it does feel like "asking for it" [19:00:27] if the stakeholders aren't dying for this, IMO it just means that the scales are tipped in the direction of "do nothing" [19:00:49] Sounds like you're prepared though and will own it if it breaks [19:01:04] I'll help figure out the monitoring, if that's helpful [19:01:42] heheh I thought the due diligence done was enuf but maybe I'm wrong [19:04:19] I'm happy to not filibuster this. just * let's choose a time at which we decide whether to keep the change and * set up a few ways of verifying that the change isn't doing bad things [19:07:18] awight: that sounds eminently reasonable... What specific time would u suggest? BW starts at 6 am PST morrow [19:09:08] https://etherpad.wikimedia.org/p/fr-tech_2016_checklist_notes [19:09:21] Anyone remember who owns GeoIP? [19:09:35] AndyRussG: that's all done in Varnish now, right? [19:09:41] So I guess ops? [19:11:05] yah it's ops, though nobody is really owning it. I think we're the de facto owners. [19:12:29] so how do we detect failure trends? [19:13:06] missing banner impressions? [19:13:06] in geoip? [19:13:12] yeah [19:13:51] good question. I guess we have to rely on whatever we receive back in CN requests? [19:13:57] yeah that's the other alert I was thinking of trying to speed-engineer... pageview vs impression correlations [19:14:08] +1 [19:14:32] too bad about the lack of an operational database for pageview counts [19:14:42] so, if GeoIP is on the fritz, the page view tagging would also be off, right? [19:14:52] ah [19:14:56] that sounds right [19:15:07] but we should check the source code for kafka -> webrequests [19:15:52] yeah dunno if the geoiping of webrequests or pageviews on Hive goes through the same GeoIP process as what we get on the client side [19:16:10] Maybe sudden changes in pageviews for one country or another would be a signal to watch for? [19:16:11] perhaps https://github.com/wikimedia/analytics-refinery/tree/master/oozie/webrequest/load [19:16:59] awight: what is that? [19:17:06] gotta afk a bit, back in a few! [19:17:14] I think that's where the kafka->webrequest pipeline is defined [19:18:49] calls a geocoded_data(ip) function in https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/refine_webrequest.hql [19:19:13] so, looks like that's independent of Varnish's geolocation [19:21:13] *darn [19:21:41] maybe we can compare the geocoded_data column with country= param in that case [19:23:33] Fundraising-Backlog, MediaWiki-extensions-DonationInterface: Ingenico: stop calling SET_PAYMENT when GET_ORDERSTATUS returns 25 - https://phabricator.wikimedia.org/T151788#2828297 (Ejegg) [19:32:09] Fundraising-Backlog, MediaWiki-extensions-DonationInterface: Ingenico: stop calling SET_PAYMENT when GET_ORDERSTATUS returns 25 - https://phabricator.wikimedia.org/T151788#2828151 (awight) This is looking like more of a problem, now that I see the code. Doesn't this mean that we send a "failed" final st... [19:41:43] cwd: How were the results of that mobile test from last week, btw? [19:41:55] Did the new styles at least match existing stuff? [19:43:31] right, let's see about that [19:48:19] awight: good question [19:48:44] i don't know where to find that data [19:49:28] yah it currently sucks [19:49:47] the-wub: ^ did you happen to pull numbers for last week's mobile styles test? [19:49:50] google docs? [19:50:32] Fundraising-Backlog, MediaWiki-extensions-DonationInterface: Update Adyen iframe css to match ingenico - https://phabricator.wikimedia.org/T151795#2828393 (Ejegg) [19:51:34] awight: I did a quick spot check and results looked similar. need to get some more data though. [19:52:28] thanks! [19:54:59] meganhernandez: hi! Would you pls drop a link to a recent banner results stats spreadsheet? [19:56:13] sure awight https://docs.google.com/spreadsheets/d/10sHKj7-aiVFm-fXHhOsOYk1pYl0HqMNqFlhiLnNZzBQ/edit#gid=1154341655 [19:58:15] Fundraising-Backlog, MediaWiki-extensions-DonationInterface: Ingenico: stop calling SET_PAYMENT when GET_ORDERSTATUS returns 25 - https://phabricator.wikimedia.org/T151788#2828427 (Ejegg) yep, and worse, it seems to retry the SET_PAYMENT call 3 times [19:58:45] meganhernandez: ty! [19:58:58] just asked for permission... [20:03:25] cwd: sweet. Here's an example of the command fr-online runs to pull stats now: ./statler B1617_1128_en6C_dsk_p1_lg_txt -s 2016-11-28 [20:03:42] cwd: oh. cd to :/srv/br [20:08:05] oh fancy [20:12:46] awight: access denied...did you export mysql creds? [20:19:08] meganhernandez: sorry to bug you, but could we also please see the corresponding sheet for mobile? [20:21:37] sure, ejegg [20:21:38] https://docs.google.com/spreadsheets/d/18c3NcChAmtBI14EhklOBxWqy0mMPUBXK3LscN5Ygubg/edit#gid=1447771042 [20:23:24] thanks! [20:43:21] fr-tech: K4 requested holding off on the deploy to change the cache times until the BE lull, in a couple weeks [20:43:42] ah, cool [20:43:47] yeah [20:44:02] sounds wise [20:44:14] especially if we have a cache flush strategy [20:44:23] * AndyRussG tries to find where he left lost sense of prudence [20:44:35] yep... Still would like to test that... [20:53:01] Fundraising Sprint Waiting for Godot, Fundraising-Backlog, MediaWiki-extensions-DonationInterface, Unplanned-Sprint-Work: Ingenico: stop calling SET_PAYMENT when GET_ORDERSTATUS returns 25 - https://phabricator.wikimedia.org/T151788#2828151 (DStrine) [21:01:57] Fundraising Sprint Waiting for Godot, Fundraising-Backlog, Unplanned-Sprint-Work: New thank you email for big english - https://phabricator.wikimedia.org/T151784#2828806 (DStrine) p:Triage>High [21:13:35] Fundraising-Backlog, MediaWiki-extensions-DonationInterface: Determine impact of mobile CSS changes - https://phabricator.wikimedia.org/T151815#2828861 (Ejegg) [21:19:43] Fundraising-Backlog: Generate statistics of errors vs donation attempts - https://phabricator.wikimedia.org/T151817#2828899 (Ejegg) [21:55:45] (PS1) Ejegg: Allow weighting Minfraud response [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/323973 (https://phabricator.wikimedia.org/T151221) [21:59:06] fr-tech: I'm grabbing https://phabricator.wikimedia.org/T151784 unless I'm late to the party? [22:00:15] all you homie [22:00:31] thx! [22:02:31] fundraising-tech-ops: Yubikey for Thea Skaff, fundraising consultant - https://phabricator.wikimedia.org/T149839#2829105 (Jgreen) Open>Resolved [22:02:35] Fundraising Sprint Waiting for Godot, Fundraising-Backlog, Unplanned-Sprint-Work: New thank you email for big english - https://phabricator.wikimedia.org/T151784#2829106 (awight) a:awight [22:08:37] (CR) Cdentinger: [C: 2] Allow weighting Minfraud response [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/323973 (https://phabricator.wikimedia.org/T151221) (owner: Ejegg) [22:12:16] (Merged) jenkins-bot: Allow weighting Minfraud response [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/323973 (https://phabricator.wikimedia.org/T151221) (owner: Ejegg) [22:21:43] (CR) Awight: Allow weighting Minfraud response (1 comment) [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/323973 (https://phabricator.wikimedia.org/T151221) (owner: Ejegg) [22:26:14] (CR) Ejegg: Allow weighting Minfraud response (1 comment) [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/323973 (https://phabricator.wikimedia.org/T151221) (owner: Ejegg) [22:28:27] (CR) Awight: Allow weighting Minfraud response (1 comment) [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/323973 (https://phabricator.wikimedia.org/T151221) (owner: Ejegg) [22:33:04] (PS1) Ejegg: Make minfraud weighting more readable [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/323978 [22:35:14] awight: ^^ [22:36:14] (CR) Awight: [C: 2] "Thanks!" (1 comment) [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/323978 (owner: Ejegg) [22:39:24] (Merged) jenkins-bot: Make minfraud weighting more readable [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/323978 (owner: Ejegg) [22:40:58] (PS1) Ejegg: Merge branch 'master' into deployment [extensions/DonationInterface] (deployment) - https://gerrit.wikimedia.org/r/323979 [22:41:13] (CR) Ejegg: [C: 2] Merge branch 'master' into deployment [extensions/DonationInterface] (deployment) - https://gerrit.wikimedia.org/r/323979 (owner: Ejegg) [22:42:07] (Merged) jenkins-bot: Merge branch 'master' into deployment [extensions/DonationInterface] (deployment) - https://gerrit.wikimedia.org/r/323979 (owner: Ejegg) [22:44:27] (PS1) Ejegg: Update DonationInterface submodule [core] (fundraising/REL1_27) - https://gerrit.wikimedia.org/r/323980 [22:45:39] (CR) Ejegg: [C: 2] Update DonationInterface submodule [core] (fundraising/REL1_27) - https://gerrit.wikimedia.org/r/323980 (owner: Ejegg) [22:49:10] !log disable job Project Dedupe CiviCRM contacts (name-match) [22:49:11] (this job was effectively finished for now as it had run a tighter conflict resolution set through our earlier DB entries for the full name matches) [22:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:20] dstrine is this job obsolete now https://phabricator.wikimedia.org/T117462 - it seems out of date [22:52:40] (was just searching for a ticket & thought I could close that) [22:52:47] lol yeah [22:53:01] Fundraising-Backlog: CiviCRM contact dedupe--get to a point where we can let people key in merge resolutions - https://phabricator.wikimedia.org/T117462#2829255 (DStrine) Open>Invalid [22:53:04] I closed it [22:56:27] (Merged) jenkins-bot: Update DonationInterface submodule [core] (fundraising/REL1_27) - https://gerrit.wikimedia.org/r/323980 (owner: Ejegg) [23:08:11] !log updated payments-wiki from d7ed14407aa7be9a790778cae644c2b320bb7aa4 to bd8012ce876db59142e28bf6f6e4a2bd549f4481 [23:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:02] ugh. thank-you letter update is kicking my butt. [23:27:59] awight: dang, what's the complication? [23:28:20] getting all kinds of static from the template generation steps... [23:29:43] MBeat: ^ just a warning, we're trying to roll out a new thank-you letter today but there are glitches making it take longer than I hoped. [23:29:55] We'll need to sanity check the letters once I do manage to deploy [23:29:58] col, ty for headsup [23:30:01] want me to try locally? [23:30:33] ejegg: If you wish. But I haven't pushed the intermediate content yet. Lemme go to a sandbox [23:31:09] fr-tech I'm thinking that of the CN things that need to be worked on before tomorrow, the most urgent is ensuring the big MessageCache change doesn't go out on tomorrow's train, amirite? [23:31:21] +1! [23:31:33] IIRC, the conclusion of last week's meeting was, revert https://gerrit.wikimedia.org/r/#/c/318488/ before the branch is cut? [23:32:08] That's what I remember [23:32:11] oh [23:32:15] (PS1) Eileen: Update thank you letter. [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/323988 (https://phabricator.wikimedia.org/T151814) [23:32:19] actually, that the branch was already cut, so we have to revert there. [23:32:40] awight: huh! really! why is that, I wonder? [23:32:51] I guess I should talk to releng? [23:33:02] just cos once the branch is cut, it isn't automatically merged from master any more [23:33:08] * AndyRussG googles mw train [23:33:15] Yeah, but why is it alreadycut? [23:33:30] Don't people do things at the last minute around here anymore? [23:34:08] ejegg: https://meta.wikimedia.org/wiki/User:Awight_%28WMF%29/Sandbox_TY_20161128 [23:34:11] lol [23:35:05] thanks! [23:35:08] Oh woops, Meta is in group 1, not group 0, so it'd go out Wednesday [23:35:13] Why was I so sure it was group 0 [23:35:22] totally, I was just gonna say [23:35:35] * AndyRussG douses coffee directly on brain via ear [23:35:40] and it looks like wmf.4 isn't cut yet [23:35:45] weeeird [23:35:52] but sane [23:35:54] hmm [23:35:57] actually ignore me [23:36:04] I have no idea if wmf.4 is hiding somewhere [23:36:04] want some coffee? [23:36:09] I need to start some time [23:39:17] ejegg: sorry, that sandbox location is not acting nice. [23:39:22] I copied to a local wiki [23:39:23] oh? [23:39:28] WD make-thank-you: en -- Composite title object Sandbox_TY_20161128/en was malformed [info] [23:39:31] barf [23:39:40] k. I need to make a non-work call, back shortly.