[00:00:42] YuviKTM: thanks - but I'll rename everything and push it tomorrow (that way I can also try out gerrit for the first time ...) [00:03:22] YuviKTM: It seems tools-static is not using utf-8 [00:03:25] http://cdnjs.cloudflare.com/ajax/libs/riot/2.0.14/riot.min.js [00:03:28] http://tools-static.wmflabs.org/cdnjs/ajax/libs/riot/2.0.14/riot.min.js [00:03:38] "￰" vs "ï¿°" [00:03:58] charset content-type headers [00:27:47] sitic: ah cool :) [00:27:50] Krinkle: uh oh [00:28:29] Krinkle: wait is it jut not specifying content-type headers or is ther problem deeper? [00:28:52] I think it's just missing "; charset=utf-8" in the Content-Type header [00:29:07] yeah [00:29:23] Adding charset utf-8; in the config should do it [00:29:31] it'll salt the content-type headers based on that [00:29:39] Krinkle: hmm [00:29:41] > charset utf-8; [00:29:44] exists [00:29:55] exists, or is in the config right now? [00:30:05] is in the config [00:30:07] already [00:30:37] static-server.conf.erb [00:37:23] Krinkle: not sure why that’s being ignored. [00:43:28] YuviKTM: something failed when I created the crosswatch tool … "become crosswatch" fails with "no such tool" but I did a ssh logout/login several times [00:43:41] https://wikitech.wikimedia.org/wiki/Special:NovaServiceGroup says it exists, but it's not listed on https://tools.wmflabs.org/ [00:43:44] sitic: I think new tool creation is broken... [00:43:45] moment [00:43:53] ah ok [00:44:08] sitic: try now [00:44:34] yep works [00:44:37] thanks [00:44:54] 10Tool-Labs: New tool creation broken - https://phabricator.wikimedia.org/T97740#1251197 (10yuvipanda) 3NEW [00:45:09] 10Tool-Labs: New tool creation broken - https://phabricator.wikimedia.org/T97740#1251204 (10yuvipanda) @scfc do you know where it was applied earlier? [00:45:29] * sitic somehow runs into a tools creaton bug on every second tool he creates [00:49:17] Krinkle: found it. [00:49:21] sitic: heh [00:50:50] Krinkle: fixec [00:50:52] *fixed [00:50:54] can you verify? [00:51:53] YuviKTM: Yeah, the header is now fixed. [00:51:57] YuviKTM: The file is still broken though [00:52:00] http://tools-static.wmflabs.org/cdnjs/ajax/libs/riot/2.0.14/riot.min.js?purge [00:52:04] I wonder if that’s the cache [00:52:05] ah [00:52:09] let me see on disk [00:52:34] It should have [00:52:35] n(1)).replace(n(/\\{/g),"ï¿°").replace(n(/\\}/g),"￱");t=f(e,a(e,n(/{/),n(/}/)));return new Function("d","retu [00:53:18] I see a bunch of [00:53:18] "").replace(n(/\\}/g)," [00:53:20] on the server [00:54:12] YuviKTM: Compare to a fetch of http://tools-static.wmflabs.org/cdnjs/ajax/libs/riot/2.0.14/riot.min.js?purge ? [00:54:17] http://cdnjs.cloudflare.com/ajax/libs/riot/2.0.14/riot.min.js * [00:54:47] hmm [00:54:49] I see no diffs [00:55:04] root@tools-static-01:/srv/cdnjs# diff riot.min.js /srv/cdnjs/ajax/libs/riot/2.0.14/riot.min.js [00:55:07] root@tools-static-01:/srv/cdnjs# [00:55:30] I’m not sure what’s happening [00:56:16] * YuviKTM FFF0 is replacement character, isn’t it? [00:57:33] Krinkle: so they arean’t actually valid unicode characters [00:58:28] Krinkle: and cdnjs actually doesn’t set a charset. [00:58:36] Hm.. not on that one [00:58:37] strange [00:58:41] http://i.imgur.com/xznz7HQ.png [00:58:44] There's the differencd [00:58:52] Krinkle: yeah I see that too [00:59:10] Krinkle: but curl https://cdnjs.cloudflare.com/ajax/libs/riot/2.0.14/riot.min.js | less and you’ll see the random chars [00:59:14] what is x-javascript? [00:59:18] good question. [00:59:21] Why are we using that [00:59:23] :S [00:59:38] ah [00:59:43] x-javascript became javascript [00:59:49] became? [00:59:59] text/javascript -> application/javascript [01:00:02] Krinkle: https://www.rfc-editor.org/rfc/rfc4329.txt [01:00:11] that's the main transition in the past 25 years in the non-theoretical world [01:00:15] text/javacsript -> application/x-javascript -> application/javascript [01:00:18] true [01:00:22] but nginx seems to send x-javascript by default [01:00:29] Yeah, but only a fool would use that and break half the internet :P [01:01:01] > application/x-javascript js; [01:01:10] But would that cause it? [01:01:11] so that’s the default nginx package... [01:01:15] I don’t think io [01:01:16] *so [01:01:17] JS is UTF-16 technically [01:01:22] hmm [01:01:26] so maybe browsers know that under JS (not x-js) they read it that way [01:01:49] I can’t just set all content to be utf-16 can I? :) [01:01:52] hmm [01:01:59] I don’t know if all valid utf-8 is also valid utf-16 [01:02:16] actually I am pretty sure it isn't [01:02:33] charset=utf-16 is not a thing [01:02:35] won’t you get totally out of whack answers if you are trying to read utf-8 (or utf-16) as the other? [01:02:37] yeah [01:02:43] Anyway, it's transfer encoding, not program encoding [01:02:49] Krinkle: do you know what codepoints those actually are? [01:02:52] Javascript uses UTF-8 internally that's a fact. [01:02:55] UTF16 * [01:02:58] Krinkle: curl tells me they’re FFF0 and FFF1 which aren’t valid codepoints [01:03:13] yeah but browsers shouldn’t be taking what the server tells is utf8 and interpret that as utf-16 [01:03:14] Whether the data is tarnsffered in binary, base64, or utf-8 or anythign else doesn't matter much as long as it matches what the server is sending [01:03:24] yeah [01:03:38] PROBLEM - Puppet failure on tools-webgrid-07 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [01:04:09] so basically curl doesn’t see a difference but browser does [01:05:28] Krinkle: is it just that file? [01:05:45] Dunno, it was the first one I looked at when Dan was working with it in -releng [01:05:48] Just passing it along [01:06:33] https://raw.githubusercontent.com/cdnjs/cdnjs/master/ajax/libs/riot/2.0.9/riot.min.js [01:07:19] "ï¿°|￱".charCodeAt(0) [01:07:20] 239 [01:07:20] "ï¿°|￱".charCodeAt(1) [01:07:22] 191 [01:07:23] "ï¿°|￱".charCodeAt(2) [01:07:26] 176 [01:07:27] "ï¿°|￱".charCodeAt(4) [01:07:29] 239 [01:07:31] "ï¿°|￱".charCodeAt(5) [01:07:33] 191 [01:07:35] "ï¿°|￱".charCodeAt(6) [01:07:37] 177 [01:07:39] (from cdnjs) [01:07:41] cloudfare that is [01:07:47] Krinkle: https://raw.githubusercontent.com/cdnjs/cdnjs/master/ajax/libs/riot/2.0.14/riot.min.js [01:08:44] GitHub: [01:08:44] 10Tool-Labs: New tool creation broken - https://phabricator.wikimedia.org/T97740#1251284 (10scfc) IIRC tools-login.eqiad.wmflabs (!= tools-login.wmflabs.org). [01:08:45] "￰".charCodeAt(0) [01:08:45] 65520 [01:08:52] yeah, invalid. [01:09:01] so cdnjs is doing something that nginx / github isn’t doing [01:09:19] "￰".charCodeAt(0) [01:09:19] 65520 [01:09:22] tool labs too [01:09:48] Krinkle: http://tools-static.wmflabs.org/cdnjs/test.js [01:09:56] so that’s the exact same pipeline [01:10:03] that one looks right [01:10:15] so I think the problem is basically source on github != source being served [01:10:17] by cdnjs [01:10:47] well, raw.github is not a good source [01:10:51] not meant for browser evaluation [01:11:25] sure, but the fact that basically that file through any system other than cdnjs puts out invalid unicode makes me believe the file is at fault... [01:11:37] Krinkle: so I’m going to make it application/javascript now [01:11:40] and let’s see what happens [01:11:58] https://tools-static.wmflabs.org/cdnjs/test.js renders fine however [01:12:12] and has application/x-javascript; charset=utf-8 [01:12:12] through nginx [01:12:27] Krinkle: http://tools-static.wmflabs.org/cdnjs/ajax/libs/riot/2.0.14/riot.min.js?purged [01:12:40] Krinkle: so that’s application/javascript [01:12:52] and nginx automagically stopped sending charset=utf-8 there... [01:12:54] (wtf) [01:12:56] but it renders right [01:12:58] (wtf) [01:13:02] :D [01:13:04] Yup [01:13:08] Ok. don't touch it now [01:13:09] actually [01:13:10] yeah [01:13:10] :P [01:13:16] Krinkle: I wonder [01:13:25] Krinkle: if those are actually the chars that are supposed to be there at all? [01:13:41] Krinkle: I don’t see much other non-latin unicode in there [01:13:43] also. [01:13:43] Haha [01:13:44] WTF [01:13:52] You're saying the invalid char is the one intended [01:13:57] Possible [01:14:00] yeah [01:14:06] like they are using it as a marker or something [01:14:12] which doesn’t sound like the best of ideas. [01:14:26] however, besides all that [01:14:33] wtf! [01:15:01] Krinkle: I wonder if this is a bug report for cdnjs [01:15:41] Krinkle: hahaa [01:15:42] i was right [01:15:46] > .replace(brackets(/\\{/g), '\uFFF0') [01:15:50] > .replace(brackets(/\\}/g), '\uFFF1') [01:15:54] > // temporarily convert \{ and \} to a non-character [01:15:57] https://github.com/muut/riotjs/blob/d9f82a19a0e527331106763e6ec06d189e139d24/lib/tmpl.js#L88-L89 [01:15:59] Yeah [01:16:10] Wait, those are not literals [01:16:18] I think the minification framework did that [01:16:21] Yeah [01:16:23] replacing literals with actual characters [01:16:30] wow, that was a fun hole :D [01:16:45] definitely a bug for cdnjs ;P [01:16:48] and maybe uglify [01:16:56] https://github.com/muut/riotjs/blob/d9f82a19a0e527331106763e6ec06d189e139d24/riot.min.js [01:16:58] hmm, just cdnjs - I think the JS will be fine when served by us [01:16:59] No [01:17:06] They distribute their own minified version [01:17:07] which is also broken [01:17:09] right [01:17:16] so a bug for riotjs then :) [01:17:19] well [01:17:22] not broken [01:17:27] if theirs has the character literals [01:17:35] Hmm. Maybe [01:17:36] then it works on almost all servers [01:17:39] except cdnjs [01:17:43] which tries to convert it to valid codepoints [01:17:50] Right [01:17:53] And now ours [01:17:58] yeah [01:18:04] so what the fuck is up with application/javascript [01:18:07] actually [01:18:08] hold on [01:18:11] re-add charset [01:18:12] this isn’t even cdnjs [01:18:13] it’s the browsers [01:18:24] but keep non-x js [01:18:29] Krinkle: I can’t - I didn’t remove them. I just switched x-javascript to application/javascript [01:18:34] and nginx automagically decided to remove charset [01:18:44] Hm.. [01:18:58] How did you do the switch? [01:19:07] I guess app/js isn't registered in its mime system so it doesn't encode it properly [01:19:11] Krinkle: I just edited mimetype [01:19:20] It might be mapped elsewhere [01:19:24] hmm [01:19:33] charset utf-8; is the code we have in our config [01:19:38] so afaik that just forces it on everything [01:19:38] Right [01:19:44] Well, not images [01:19:56] I know it's obvious, but my point is something handles it. [01:20:01] oooh [01:20:02] yeah [01:20:26] there’s a map [01:21:32] Krinkle: https://github.com/perusio/piwik-nginx/blob/master/koi-utf is the map [01:21:55] Hm.. [01:21:57] Krinkle: http://nginx.org/en/docs/http/ngx_http_charset_module.html#charset_map explains the map [01:22:09] Why would it need its own map [01:22:11] that seems fishy [01:22:17] bah [01:22:19] that’s stupid [01:22:21] ignore that [01:22:28] that’s just for a particular charset map [01:22:29] https://en.wikipedia.org/wiki/KOI8-R [01:22:30] I’m an idiot [01:23:18] > Until version 1.5.4, “application/x-javascript” was used as the default MIME type instead of “application/javascript”. [01:23:54] Well, at least we did more research in 10 minutes than cloudfare ever did [01:24:27] RTFM and LMGTFY goes a long way [01:25:01] :D [01:25:18] Krinkle: now our behavior matches cloudflare tho [01:25:19] let me fix that [01:25:23] Yeah [01:25:27] Krinkle: booom, I think I know what’s happening [01:25:35] application/javascript for some reason doesn’t send the charset [01:25:40] and so your browser *guesses* [01:25:47] Right [01:26:24] I reverted the application/javascript hack [01:26:32] and the browser guessing stops [01:26:36] because it knows it’s utf-8 :D [01:28:01] Krinkle: so I guess ‘solution’ is to file a bug with riotjs? [01:28:07] and maybe cdnjs [01:28:11] and ask cdnjs to set a charset :) [01:28:13] so browsers don’t guess [01:28:52] Yeah [01:29:00] Well, riotjs is probably fine. [01:29:18] I dunno, they shoudl probably not be including invalid UTF codepoints in their minified source [01:29:28] Their minifier is upstream for them and I know which the one, they're not changing. [01:29:29] I think ultimately the bug is in the minifier it used [01:29:38] Well, it depends. [01:29:47] if it's destructive then yes. [01:29:58] But if the character there is a valid "invalid" character then it's fine. [01:30:00] it’s not a valid UTF-8 file [01:30:06] Right [01:30:07] it’s not a valid invalid character [01:30:14] because it’s just not defined [01:30:20] Right, so it's not Special:Badtitle, but it's an actual bad title. [01:30:20] there’s the valid invalid character [01:30:22] but this isn’t it [01:30:23] yeah [01:30:26] ok :) [01:30:44] Yeah, that' a minifier bug [01:30:59] should be acceptable upstream [01:31:05] Krinkle: https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Unicode_chart [01:31:17] Krinkle: 0-8 are undefined and invalid [01:31:32] well, undefined as of now [01:31:38] so this isn’t even a UTF-8 issue :) [01:31:41] this is just invalid unicode :) [01:31:46] grr, no quips [01:32:12] Krinkle: I think the riotjs people using these codepoints should also be cluebatted, maybe. but maybe not. [01:32:34] but ‘let me temporarily change this UTF-16 encoded string to contain invalid unicode codepoints’ sounds like asking for trouble [01:33:36] RECOVERY - Puppet failure on tools-webgrid-07 is OK: OK: Less than 1.00% above the threshold [0.0] [01:37:39] Krinkle: can I leave the bug reporting to you? [01:40:26] * Krinkle makes note for later [01:40:27] suer [01:40:28] sure [02:24:52] 10Tool-Labs: New tool creation broken - https://phabricator.wikimedia.org/T97740#1251321 (10yuvipanda) I've added it to tools-submit for now. [03:54:48] !log tools killed final job in tools-exec-20 (9911317), decommissioning node [03:54:54] Logged the message, Master [03:55:48] !log tools depooled and deleted tools-exec-20 [03:55:53] Logged the message, Master [03:57:21] 6Labs, 10Tool-Labs: Rebuild a bunch of tools instances - https://phabricator.wikimedia.org/T97437#1251350 (10yuvipanda) Depooled and deleted tools-exec-20 :) So trusty is fully on the new hosts now. I wonder how long we should give the currently running tasks before moving them. [03:59:20] PROBLEM - Host tools-exec-20 is DOWN: CRITICAL - Host Unreachable (10.68.17.251) [04:30:21] 10Tool-Labs, 3ToolLabs-Goals-Q4: Setup a tools checker service that can check all internal services for availability - https://phabricator.wikimedia.org/T97748#1251358 (10yuvipanda) 3NEW [04:49:36] PROBLEM - Puppet failure on tools-checker-02 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [0.0] [04:59:34] RECOVERY - Puppet failure on tools-checker-02 is OK: OK: Less than 1.00% above the threshold [0.0] [05:31:47] 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Setup a tools checker service that can check all internal services for availability - https://phabricator.wikimedia.org/T97748#1251413 (10Krinkle) [05:34:33] 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Setup a tools checker service that can check all internal services for availability - https://phabricator.wikimedia.org/T97748#1251415 (10yuvipanda) Metrics will be made available as I add them on http://p.catchpoint.com/ui/Entry/PD/V/A.RNP-Ov-jSUbDu8Jdg/Er... [06:36:35] PROBLEM - Puppet failure on tools-webgrid-generic-01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [06:53:20] PROBLEM - Puppet failure on tools-exec-1210 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [07:01:35] RECOVERY - Puppet failure on tools-webgrid-generic-01 is OK: OK: Less than 1.00% above the threshold [0.0] [07:14:30] 10Tool-Labs: Audit redis usage on toollabs - https://phabricator.wikimedia.org/T91979#1251438 (10Dfko) Still figuring out the significance of this, but there is a failure in a line that is dealing with TTLs in the RQ library registry.py that is resulting in churn from failure queue back onto the failure queue wh... [07:16:10] 10Tool-Labs: Audit redis usage on toollabs - https://phabricator.wikimedia.org/T91979#1251447 (10yuvipanda) Possibly unhelpful suggestion (feel free to ignore) - have you considered using celery for queuing instead? [07:17:30] 10Tool-Labs: Audit redis usage on toollabs - https://phabricator.wikimedia.org/T91979#1251448 (10Dfko) Considered early on and this is making me consider considering it again, though there might not remain enough time in the project for a big move like that. [07:18:20] RECOVERY - Puppet failure on tools-exec-1210 is OK: OK: Less than 1.00% above the threshold [0.0] [07:27:41] 10Tool-Labs: Audit redis usage on toollabs - https://phabricator.wikimedia.org/T91979#1251452 (10Dfko) I was not able to find an explicit TTL format that would not trigger this bug, but not specifying it at all seems not to trigger it. I am going to try to start things up again and see if it works that way. [07:35:11] 10Tool-Labs: Audit redis usage on toollabs - https://phabricator.wikimedia.org/T91979#1251457 (10valhallasw) How are you defining the TTL currently? The error suggests that result_ttl passed by worker.py:572 is a str instead of an int. As for not defining a TTL: that should use https://github.com/nvie/rq/blob/5... [07:47:57] 10Tool-Labs: Audit redis usage on toollabs - https://phabricator.wikimedia.org/T91979#1251464 (10Dfko) It happens whenever I pass anything via the command line parameter of worker, whether it can be parsed as an int or not, so I am guessing they just forgot to ever call int() on that parameter. Haven't yet gone... [07:50:16] 10Tool-Labs: Audit redis usage on toollabs - https://phabricator.wikimedia.org/T91979#1251465 (10Dfko) I've run for about 20 minutes, seems like queue has plateaued at about 2 gigs and is stable. Will keep on keeping an eye on it for a bit longer. [07:51:17] 10Tool-Labs: Audit redis usage on toollabs - https://phabricator.wikimedia.org/T91979#1251466 (10yuvipanda) The problem is that without TTL you fill up the redis queue causing it to be unusable for other tools as well... [07:53:20] 10Tool-Labs: Audit redis usage on toollabs - https://phabricator.wikimedia.org/T91979#1251467 (10valhallasw) >>! In T91979#1251464, @Dfko wrote: > It happens whenever I pass anything via the command line parameter of worker, whether it can be parsed as an int or not, so I am guessing they just forgot to ever cal... [07:54:46] 10Tool-Labs: Audit redis usage on toollabs - https://phabricator.wikimedia.org/T91979#1251472 (10valhallasw) >>! In T91979#1251466, @yuvipanda wrote: > The problem is that without TTL you fill up the redis queue causing it to > be unusable for other tools as well... If you don't specify a TTL, rq should use the... [07:56:29] 10Tool-Labs: Audit redis usage on toollabs - https://phabricator.wikimedia.org/T91979#1251474 (10Dfko) Do you think the default of 15 minutes will be too long? >20 events per second is rare, so a very conservative estimate of queue load would be 18,000 events in flight. [07:57:56] 10Tool-Labs: Audit redis usage on toollabs - https://phabricator.wikimedia.org/T91979#1251475 (10Dfko) We have workers on the failed queue already. We were getting things going from failed -> failed forever due to this but that should be resolved. [08:01:44] 10Tool-Labs: Audit redis usage on toollabs - https://phabricator.wikimedia.org/T91979#1251476 (10valhallasw) That should be fine; the entries are ~2.5kB each, so 18k events is < 50MB. [08:26:26] (03PS1) 10Krinkle: Implement wildcard features and misc spring cleaning [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/208079 [08:36:47] (03PS2) 10Krinkle: Implement wildcard features and misc spring cleaning [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/208079 [08:38:15] PROBLEM - Puppet staleness on tools-shadow is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [08:39:19] (03CR) 10Krinkle: [C: 032] Implement wildcard features and misc spring cleaning [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/208079 (owner: 10Krinkle) [08:39:29] (03CR) 10Krinkle: [V: 032] Implement wildcard features and misc spring cleaning [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/208079 (owner: 10Krinkle) [08:40:20] 10Tool-Labs-tools-Global-user-contributions: GUC: Russian wikis have broken url (http:/// instead of http://) - https://phabricator.wikimedia.org/T94351#1251502 (10Krinkle) 5Open>3Resolved https://gerrit.wikimedia.org/r/208079 https://github.com/wikimedia/labs-tools-guc/commit/beb5d5bf [08:40:26] 10Tool-Labs-tools-Global-user-contributions: Global user contributions: Support wildcard in username - https://phabricator.wikimedia.org/T66499#1251504 (10Krinkle) 5Open>3Resolved https://gerrit.wikimedia.org/r/208079 https://github.com/wikimedia/labs-tools-guc/commit/beb5d5bf [10:20:31] PROBLEM - Puppet staleness on tools-exec-catscan is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [10:24:14] PROBLEM - Puppet staleness on tools-exec-cyberbot is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [10:31:52] PROBLEM - Puppet staleness on tools-exec-15 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [10:32:16] PROBLEM - Puppet staleness on tools-exec-gift is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [10:33:12] PROBLEM - Puppet staleness on tools-exec-wmt is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [10:34:34] PROBLEM - Puppet staleness on tools-exec-13 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [10:36:08] PROBLEM - Puppet staleness on tools-exec-08 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [10:36:48] PROBLEM - Puppet staleness on tools-exec-14 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [10:37:04] PROBLEM - Puppet staleness on tools-exec-07 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [10:37:06] 10Tool-Labs-tools-Global-user-contributions: Global user contributions: Support wildcard in username - https://phabricator.wikimedia.org/T66499#1251553 (10Krinkle) [10:37:07] 10Tool-Labs-tools-Global-user-contributions: GUC: Russian wikis have broken url (http:/// instead of http://) - https://phabricator.wikimedia.org/T94351#1251554 (10Krinkle) [10:40:25] PROBLEM - Puppet staleness on tools-exec-02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [11:04:56] 10Tool-Labs, 10Wikimedia-General-or-Unknown: Missing information template links in templatelinks database - https://phabricator.wikimedia.org/T89441#1251575 (10Aschroet) @Jarekt, i checked again this issue and did not see this inconsistencies anymore. Could you please double check? If it really disappeared we... [11:33:07] 10Tool-Labs-tools-Global-user-contributions: GUC: incorrectly time edit and no username or IP-address in address bar - https://phabricator.wikimedia.org/T97756#1251592 (10Oleg3280) 3NEW [11:46:16] Sob, the webservice for https://tools.wmflabs.org/mediawiki-mirror/html/ had been down for over a month (no access.log at all) [11:56:37] 6Labs: Create WikiSpy project - https://phabricator.wikimedia.org/T96512#1251623 (10d33tah) I managed to set the project up and I will soon need a public IP. By the way, I wanted to ask in advance whether it would be a problem - I decided to correlate entries with rDNS on my own computer and get rid of the rDNS... [12:09:08] . [12:09:10] did usa intelligence supply isis with weapons like they did with al-qaeda to justify creating wars? [12:09:10] did usa excute the creative mess in the middle east like they said they will, does the creative mess include explosions with uncertain responsibles to create wars? [12:09:11] plz, send my qs to help limiting usa & israel aggression against others& may then lessen number of people killed in the middle east. [12:09:11] .did usa intelligence supply isis with weapons like they did with al-qaeda to justify creating wars? [12:30:09] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/AS was created, changed by AS link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/AS edit summary: Created page with "{{Tools Access Request |Justification=Analysis of anonymous edits in Ukrainian Wikipedia |Completed=false |User Name=AS }}" [13:06:59] 6Labs, 10Labs-Infrastructure: Add tools-bastion-02 as administrative hosts for sge - https://phabricator.wikimedia.org/T97767#1251767 (10Merl) 3NEW a:3coren [13:37:58] Oh well, all my plans for today are clearly shod. I think I'll do a day of papercuts. [13:46:37] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Hanyou23 was created, changed by Hanyou23 link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Hanyou23 edit summary: Created page with "{{Tools Access Request |Justification=Manually adding articles to projects (articles that are not being picked up). |Completed=false |User Name=Hanyou23 }}" [13:50:07] 6Labs: Install OpenStack Horizon for production labs - https://phabricator.wikimedia.org/T87279#1251828 (10Andrew) [13:50:10] 6Labs, 7Design: Fix horizon logo - https://phabricator.wikimedia.org/T91780#1251826 (10Andrew) 5Open>3Resolved I never heard anything from May so I made my own graphics. They're... ok. [14:13:00] (03Abandoned) 10MarkTraceur: Move config into a default file and WMF files [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/98141 (owner: 10MarkTraceur) [14:44:43] 6Labs, 6Analytics-Kanban: LabsDB problems negatively affect analytics tools like Wikimetrics, Vital Signs, Quarry, etc. {mole} - https://phabricator.wikimedia.org/T76075#1251890 (10mforns) @coren, @milimetric What I observed back in February is that the revision tables in the wiki DBs (analytics-store) are rec... [15:00:23] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Switchover Labs NFS server to labstore1002 - https://phabricator.wikimedia.org/T97219#1251934 (10Cmjohnson) [15:07:32] 6Labs, 10Labs-Infrastructure, 6operations, 10ops-eqiad: labvirt1005 memory errors - https://phabricator.wikimedia.org/T97521#1251945 (10Cmjohnson) Post Error Inlet Ambient Temperature: 17C/62F 207-Memory initialization error on Processor 1 Socket 4. The operating system may not have access to all of the... [15:31:42] andrewbogott: "HP SmartMemory authenticated in all populated DIMM slots". [15:32:07] andrewbogott: Really? DRM on DIMMs now? Inkjet flashback! [15:32:39] Oh, it didn’t occurr to me that ‘smartmemory’ was DRM, I figured it was just the name of the test software [15:32:51] I guess that’s another vote against HP, huh? [15:33:52] "HP SmartMemory is unique technology introduced for HP ProLiant Gen8 servers that unlocks certain features available only on HP Qualified Server memory." [15:34:13] I.e.: DRM that will cripple your server if you install reasonably-priced ram from a real manufacturer. [15:34:46] damn [15:35:17] I'm pretty sure RobH didn't know about that "feature". [15:35:41] ‘certain features’ on memory? What could a ‘feature’ even be? [15:35:55] Heh. "Full speed to spec?" [15:36:57] "The ability to not insert extraneous clock cycles of delay on DDR bus access?" [15:37:24] Although, right now, that certain feature seems to be "happy fun memory fail" [15:48:13] 6Labs, 10Labs-Infrastructure: Add tools-bastion-02 as administrative hosts for sge - https://phabricator.wikimedia.org/T97767#1252093 (10coren) 5Open>3Resolved [15:48:44] Hm. There's a huge bumb in labnet1001 conntracking stats. [16:11:57] 6Labs, 10Labs-Infrastructure, 6operations, 10ops-eqiad: labvirt1005 memory errors - https://phabricator.wikimedia.org/T97521#1252148 (10Cmjohnson) Moved bad DIMM module to processor 2 socket 4 to see if the error will follow the DIMM. After rebooting the error returned to the same socket. Post message b... [16:16:04] Coren: hey ! any update on https://phabricator.wikimedia.org/T97523 ? [16:17:09] tonythomas: No, but now's actually a good time for me to look at it. [16:17:34] Coren: thanks :) [16:20:25] 6Labs, 7Tracking: New Labs project requests (Tracking) - https://phabricator.wikimedia.org/T76375#1252179 (10coren) [16:20:27] 6Labs, 10MediaWiki-extensions-Newsletter, 7Tracking: Create Labs project for Newsletter Extension - https://phabricator.wikimedia.org/T97523#1252177 (10coren) 5Open>3Resolved Project 'newsletter' was created with Tinaj1234 as member and admin. [16:20:42] 6Labs, 10MediaWiki-extensions-Newsletter, 7Tracking: Create Labs project for Newsletter Extension - https://phabricator.wikimedia.org/T97523#1252180 (10coren) a:5Tinaj1234>3coren [16:21:03] Coren: thanks :) tinajohnson ^ [16:22:10] Coren: how will I add myself to that project ? should the admin do it ? [16:22:29] Yep; that's how it works. [16:22:49] okey. [16:49:11] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/AS was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=156927 edit summary: [16:49:19] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Hanyou23 was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=156930 edit summary: [17:22:30] 6Labs, 10MediaWiki-extensions-Newsletter, 7Tracking: Create Labs project for Newsletter Extension - https://phabricator.wikimedia.org/T97523#1252420 (10Tinaj1234) Thanks ! [17:30:43] 6Labs, 10Labs-Infrastructure, 6operations, 10ops-eqiad: labvirt1005 memory errors - https://phabricator.wikimedia.org/T97521#1252476 (10Cmjohnson) The cpu changed did nothing 207-Memory initialization error on Processor 1 Socket 4. The operating system may not have access to all of the memory installed i... [17:36:35] (03PS1) 10Florianschmidtwelzow: Remove definitions for #wikimedia-mobile [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/208134 (https://phabricator.wikimedia.org/T97798) [17:36:47] (03PS1) 10Florianschmidtwelzow: Remove definitions for #wikimedia-mobile [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/208135 (https://phabricator.wikimedia.org/T97798) [17:58:12] 10Tool-Labs-tools-Global-user-contributions: GUC: incorrectly time edit and no username or IP-address in address bar - https://phabricator.wikimedia.org/T97756#1252640 (10Krinkle) a:3Krinkle Sorry about that. Escaped the test. Addressing it now. [18:02:06] (03PS1) 10Krinkle: Fix regression: Timestamps wrongly use current time [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/208153 (https://phabricator.wikimedia.org/T97756) [18:02:33] (03CR) 10Krinkle: [C: 032 V: 032] Fix regression: Timestamps wrongly use current time [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/208153 (https://phabricator.wikimedia.org/T97756) (owner: 10Krinkle) [18:05:04] anyone else having trouble reaching tools-login via PUTTY? [18:05:56] russblau: what kind of error? [18:06:24] yuvipanda: What's the update interval for cdnjs? [18:06:32] (03CR) 10Merlijn van Deen: [C: 032] Remove definitions for #wikimedia-mobile [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/208135 (https://phabricator.wikimedia.org/T97798) (owner: 10Florianschmidtwelzow) [18:06:37] Looks like jQuery 1.11.3 isn't on there yet [18:06:44] (03Merged) 10jenkins-bot: Remove definitions for #wikimedia-mobile [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/208135 (https://phabricator.wikimedia.org/T97798) (owner: 10Florianschmidtwelzow) [18:07:26] labs-tools-wikibugs2-autopull FAILURE in 0s [18:07:27] ^ [18:07:35] "Couldn't agree on a client-to-server cipher" (followed by long list of abbreviations) [18:08:22] (03CR) 10Merlijn van Deen: [C: 032] Remove definitions for #wikimedia-mobile [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/208134 (https://phabricator.wikimedia.org/T97798) (owner: 10Florianschmidtwelzow) [18:09:37] russblau: ok, so yes, there was a recent change related to this, not allowing old ciphers for security. could you first verify your putty version is current? [18:10:09] probably not (0.58) [18:10:41] (03Merged) 10jenkins-bot: Remove definitions for #wikimedia-mobile [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/208134 (https://phabricator.wikimedia.org/T97798) (owner: 10Florianschmidtwelzow) [18:11:25] russblau: 2015-02-28 PuTTY 0.64 released, fixing a SECURITY HOLE [18:11:39] russblau: wanna try with a newer one? [18:12:20] yuvipanda: Meh, it's an upstream issue. They're not pulling jquery upstream. https://github.com/cdnjs/cdnjs/issues/4601 [18:13:04] Krinkle: 20mins [18:13:10] Runs on every puppet run [18:14:39] russblau: and somewhere in settings there is "cipher selection". the ones the server now offers are: chacha20-poly1305@openssh.com,aes256-gcm@openssh.com,aes128-gcm@openssh.com,aes256-ctr,aes19 [18:15:01] and for older distro: Ciphers aes256-ctr,aes192-ctr,aes128-ctr [18:15:44] what the flying pig. "channel 0: open failed: administratively prohibited: open failed" [18:15:46] so if it let's you select one of those that would work [18:15:52] but only if I try to ssh from within my vm [18:17:18] oh, I see [18:17:47] * valhallasw`cloud is doing something stupid [18:19:19] mutante: that did the trick; thanks! [18:20:26] !log tools.lolrrit-wm valhallasw: Deployed 5b7fddd5ce468ced7f3d3ed5061a5a4ae62891a5 Remove definitions for #wikimedia-mobile [18:20:32] Logged the message, Master [18:23:16] Hi [18:23:21] * yuvipanda makes way to office [18:23:45] Coren: should we try again on Monday? [18:23:52] Labstore switch that is [18:43:23] yuvipanda: I don't see why not. "Card not inserted right" isn't the kind of thing that's likely to recur. [18:43:59] It's also extraordinarily insane bad luck, but meh. :-) [18:45:06] Coren: did Dobby put his ears in between the card and the slot? [18:47:12] Coren: yeah :| [18:47:41] Coren: Currently encountering some slow ssh interaction (ssh itself is fast, but tools-bastion taking a long time for basic commands). [18:48:14] its'back up now but weird.. [18:48:23] I'm seeing very high NFS usage atm. [18:48:53] Nothing broken, just high load. Ima check that nobody is going cray-cray. [18:52:42] Coren: I guess it's because my PS1 does a few minor checks on disk unconditionally. [18:53:01] So whenever NFS is not responding, since tool labs is almost entirely NFS, I'm nowhere. [18:53:27] Krinkle: That'd make your shell more sensitive to NFS load, yeah, though I doubt you enter commands fast enough for it to be a factor in the load. :-) [18:53:50] No, but something else taking all the bandwidth means I can't even get git branch or ls to return [18:54:09] Yeah, I see one or two big outliers atm that I am looking into. [18:54:15] cool :) [18:54:22] I hope it's not cvn this time :D [18:54:38] That said, while I find the thing a bit sluggish I'm definitely not seeing anything major. What box are you feeling the hurt on? [18:54:49] tools-bastion-01 [18:55:34] Hm. It's a tool being asocial. [18:55:41] (03PS1) 10Krinkle: Make browser location bar reflect current query [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/208173 (https://phabricator.wikimedia.org/T97756) [18:56:26] (03PS2) 10Krinkle: Make browser location bar reflect current query [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/208173 (https://phabricator.wikimedia.org/T97756) [18:56:32] (03CR) 10Krinkle: [C: 032 V: 032] Make browser location bar reflect current query [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/208173 (https://phabricator.wikimedia.org/T97756) (owner: 10Krinkle) [18:56:45] Oh, FFS. sqlite on NFS [18:56:55] 10Tool-Labs-tools-Global-user-contributions, 5Patch-For-Review: GUC: incorrectly time edit and no username or IP-address in address bar - https://phabricator.wikimedia.org/T97756#1253058 (10Krinkle) 5Open>3Resolved [18:57:09] 10Tool-Labs-tools-Global-user-contributions, 7Regression: GUC: incorrectly time edit and no username or IP-address in address bar - https://phabricator.wikimedia.org/T97756#1253060 (10Krinkle) p:5Triage>3High [18:59:30] Ah, but the biggest offender is maps-tiles3 [19:00:39] Woah. That poor thing is getting /hammered/ on. [19:01:55] OruxMaps v.6.0.7 [19:02:51] FFS. There's a mobile app that hammers on a labs service? [19:03:49] Coren: interesting.. this http://www.oruxmaps.com/ ? [19:03:59] mutante: Yeah, that's the one. [19:04:10] do they want to use our OSM tile servers or something? [19:04:29] mutante: The issue isn't that they want to - is that they apparently already are. [19:05:08] I think I need to loop in legal now. [19:05:35] If that's not an issue, then we need to loop in Damon because that means the tile server will start needing actual metal. [19:06:55] are the labs servers to test for production use later, or for actual use [19:07:21] let me guess, one of those "semi-prod" things? [19:07:41] mutante: That's never been made very clear to me - I think it's meant for "actual" use, but for mostly internal use by the projects and tools and wasn't meant to be scaled like that. [19:08:19] ok, yea, that's the "semi-prod" :p [19:08:24] mutante: you mean the tiles servers? [19:08:27] yea [19:08:54] Yeah afaik those are not on a trajectory to prod. They're set up in labs (not in tool labs btw) because it's community run. [19:08:57] or used to be. [19:09:00] or maybe they just need to throttle it ? [19:09:24] mutante: Perhaps. At any rate, I'll let the C-level decide what they want to do about it. [19:09:25] I don't know if Coren / yuvipanda took it on to maintain it as part of providing Tool Labs the way Toolserver had. [19:09:42] Coren: Do "we" maintain the tiles servers? [19:09:45] Krinkle: We don't. THere's been occasional ops support, but it's not "ours" [19:09:48] it's a seperate project iirc [19:09:50] Also, there's more than one, right? [19:09:59] I'm still confused why were have two wiki atlas systems [19:10:01] Krinkle: For different aspects, yes. [19:10:16] but seperately, there have _also_ been tickets about having "official" tileservers in prod [19:10:17] French Wiki is the worst, they display two buttons in the page [19:10:25] https://fr.wikipedia.org/wiki/Amsterdam [19:10:33] Click the globe vs "carte" [19:10:57] One is "wikiwosm" and one is "WikiMiniAtlas" [19:11:06] Krinkle: heh. I'm going to guess no-one noticed the globe does anything [19:11:24] or, alternatively, internal wiki struggle ending up in a weird compromise [19:11:57] so it just seems to show there is a demand for these servers [19:12:12] 6Labs: Upgrade Labs Compute nodes to Trusty - https://phabricator.wikimedia.org/T90822#1253133 (10Andrew) This is going to happen via migration to newly-imaged boxes. labvirt1001-1006 are already running trusty. Virt1011 is empty and ready to be re-imaged as Trusty. Next I'll cold-migrate everything from virt... [19:12:14] would vote for the "make it real" solution [19:12:19] One loads from https://wma.wmflabs.org/tiles/mapnik/9/322/tile_322_39.png .. the other from https://tiles.wmflabs.org/osm-multilingual/fr,_/12/ .. [19:12:39] :-/ [19:12:50] They should be able to re-use at least the tiles, this is crazy. [19:12:58] Didn’t we have a long Ops project to set up OSM on real hardware? I think Alex worked on it but the puppet manifests never really got in proper shape so the project expired… [19:13:00] different themes are supported afaik? [19:13:06] Unless I’m confusing this with some other Maps project [19:13:20] Yeah, I heard that too. [19:13:24] andrewbogott: yes, that, thanks! [19:13:43] Though it is complicated by the community not having settled on which tile server and font-style to use. It seems someone forked it. [19:14:06] It’s definitely something we would/should support once we know what ‘it’ is. [19:14:24] valhallasw`cloud: Yeah, and the coordinates lead to something else evern further. [19:14:24] Officially, the WMF likes reusers - the issue is mostly technical. The tile server put the tiles on NFS because storage requirements, but that only scales so far. [19:14:31] That links to geohack. [19:14:52] errr [19:14:55] When can I have my storage space? [19:15:00] Krinkle: actually, those subdomains are from the same project [19:15:03] you will say it's unrealistic, i know, but every once in a while i remember how when we started labs we expected everything in there is only for testing and if it ever gets really used it moves to production, and wasn't really "because it's community-run it needs to be in labs", that was supposed to be solved by puppetizing [19:15:04] :-P [19:15:06] Krinkle: they're both https://wikitech.wikimedia.org/wiki/Nova_Resource:Maps [19:15:19] valhallasw`cloud: Hm.. ok. [19:15:28] just different virtual servers [19:15:51] I guess good (memory and local disk) caching should go a long way towards not loading every request off of NFS [19:16:06] sharding in some way [19:16:08] mutante: That's not what Labs ended up being in practice. :-) [19:16:13] yay! i just finally uploaded the enwiki database to wikispy! [19:16:34] mutante: We can put with other broke hopes, like an SVG editor (2005) [19:16:59] Looks like someone experimented with varnish [19:17:01] Krinkle: There are technical solutions - a proper object store would fix 99% of the issues. The question is "do we want to do it in the frist place" which the C-level now has to make (I've just emailed) [19:17:02] mutante: andrewbogott so the postgres database is on real hardware, with OSM replication but the tile servers themselves are on labs [19:17:25] mutante: andrewbogott Coren there was a maps team that existed for about two weeks (MaxSem and yurik) and disappeared during the re-org [19:17:30] I thin it disappeared, at least. [19:17:35] Coren: You mean whether we'll tolerate third parties hotlinking it for commercial purposes? [19:17:45] yuvipanda: that’s not so bad, seems like it must be a short step to prod then. If anyone works on it. [19:17:53] yuvipanda & Coren - we're not gone [19:17:58] andrewbogott: for some definition of ‘short' [19:18:07] we still wotk on it. full time [19:18:08] Krinkle: The license allows it. The question is "throttle mechanically, limit, or allow and allocate resources" [19:18:12] I don't think it's an option to not host maps at all. I mean, unless we'd somehow code it into the policy; community will maintain and set it up somehow. [19:18:16] Hot linking and I'm getting shit about running my own server [19:19:10] Coren: Hm.. is our tile content custom in someway or a mirror of OSM's public servers? [19:19:14] if it’s affecting QoS on NFS for rest of users I think we can just block the UA. [19:19:20] I assume the reason we don't use theirs is because of privacy policy right? [19:19:26] Dispenser: You were told, at length, what you could and could not do yet chose to do as you wished. You were also told, in detail, what you needed to do to get a proposal in if you needed resources. You chose not to despite plenty of people willing to help you. [19:19:31] So should be fine for that third party to just use OSM directly.. [19:19:41] Krinkle: That, and also I think we have some custom layers? [19:20:04] Coren: Yeah, though I don't think "they" would use those layers. [19:20:18] yuvipanda, ehh - does the tileserver store the tiles on NFS? [19:20:28] MaxSem: it does. [19:20:29] it does. [19:20:42] wtf? [19:20:46] and that doesnt scale [19:20:47] Krinkle: I think the point is moot for now - I just poked legal and Damon and they'll set the beat. :-) [19:20:48] that's sooo wrong [19:21:07] nfs scales for size, not for load [19:21:09] MaxSem: I would have thought some object store would be better than the filesystem for sure. [19:21:15] it needs better caching if we're to allow it. [19:21:18] or we could just ask the guys "hey, you need to do this slower"? [19:21:35] Coren, mod_tile supports Rados, in theory [19:21:45] pretty sure nobody used in in prod [19:21:53] mutante: AFAICT, the load isn't coming from them it's coming from individual app users. [19:21:55] yuvipanda: i think i finished adding the alpha/beta of my website to my instance. i added a firewall exception and now i'd like to make the website accessible to the world. should i request a public ip via "manage addresses"? [19:22:11] Coren: ah, ok [19:22:14] d33tah: nope, just use ‘manage proxies’ on the sidebar on wikitech [19:22:21] Coren: yeah, oruxmaps has a 'download this area' feature, and I'm guessing it has a list of tile servers to try [19:22:28] Coren: so we should block / throttle them at the proxy level, I think [19:22:29] yuvipanda, how much space do the tiles occupy in nfs? [19:22:31] *opens oruxmaps* [19:22:43] Coren: I wrote a proposal with betacommand. Funny how whenever I write something proper I get ignored and I have to pull stunts to get attention. [19:22:52] yuvipanda: Thankfully, there is a clear and unambiguous UA [19:23:17] yuvipanda: "New proxy backend [19:23:17] Hm.. if it's only about the privacy policy and we can phase out our custom layers, we could change it to be a caching proxy for http://*.tile.openstreetmap.org - somewhat like apt/ubuntu.wm.o [19:23:20] ? [19:23:38] with some local on disk caching, distributed to multiple instances. [19:23:41] oh well, we'll see [19:24:26] d33tah: you can just create it yourself :) [19:24:28] d33tah: no need to request [19:24:33] Krinkle: i thought that was the plan for the production tile servers [19:24:41] Krinkle, caching OSM is a tricky thing. also, you'd need a prior permission from their sysadmins to avoid being shot immediately [19:24:48] Krinkle: I hope MaxSem and yurik end up with something nice. [19:25:00] Coren: it’s going through labs proxy, right? tileserver? [19:25:12] aaarg, need to reset keyboard firmware, brb [19:25:24] yuvipanda: I don't know; lemme check. [19:25:48] yuvipanda: Yep. [19:25:51] MaxSem: Sure. Though usage would be less than ours, since it is caching of course, but then again, the world is big, and there's many layers. So there's not gonna be a super high cache hit ratio for a long time. [19:26:06] MaxSem: I suppose it might even be an option to support OSM more officially, like with freenode. [19:26:13] If they're short on resources that is. [19:26:31] Krinkle, we're about to build our own tile cluster [19:26:36] yessss, keyboard reset fixed it. [19:26:44] Coren: give me a UA? I’ll just block them there. [19:26:45] Coren: Its also pissing me off with all the writing (that I have trouble with due to my dyslexia and aspergers) to get anything done. I have proofs of concepts. I have projection calculation. I have alternative uses for data collected. I just can't describe shit longer than a paragraph. [19:26:45] MaxSem: what does that mean? [19:26:49] we can also throttle if needed but we can block. [19:26:58] MaxSem: I mean, mirror OSM yeah? [19:27:10] But without fetching each tile individually, compiling them ourselves? [19:27:13] yuvipanda: "OruxMaps v.6.0.7" [19:27:23] Krinkle, a better mirror, rather [19:27:32] is that the full UA? [19:27:35] Though I expect "OruxMaps.*" is a better thing to check for [19:27:37] Yes. [19:28:01] we gonna have vector tiles, horizontal scalability and lower latency [19:28:34] Hm. There seems to be more than one app too. [19:29:03] Hah. [19:29:08] http://mobac.sourceforge.net/ [19:29:09] do you guys need our help with patching up this stuff for now? [19:29:14] yuvipanda: i was asking because some wiki pages seemed to suggest that i should ask resources like IP addresses through official channels. i think i read that somewhere. [19:29:37] d33tah: yeah, you need a public IP for anything other than http, but for http the proxy should do. [19:29:42] d33tah: can you point to which page so I can fixit? [19:29:56] MaxSem: so right now I’m just going to block the UA [19:30:09] MaxSem: I'm keeping an eye on things and Yuvi is about to throttle them, so I think we're good in the short term. [19:30:14] 6Labs, 10OpenStreetMap: Block OruxMaps app from hitting labs proxy - https://phabricator.wikimedia.org/T97841#1253184 (10yuvipanda) 3NEW [19:30:18] MaxSem: https://phabricator.wikimedia.org/T97841?workflow=create [19:30:27] I’m actually going to block them :D [19:30:48] yuvipanda, I recommend considering moving tiles to /var from NFS [19:31:09] MaxSem: There isn't enough space on instance disks by an order of magnitute or three. :-) [19:31:10] MaxSem: SomOne(TM) should do it. I’ll be happy if you can help, but remember labs instances have at most 160G of storage. [19:31:44] eh, how many terabytes is it taking? [19:31:59] MaxSem: IIRC, about 60 million files using some 3T [19:32:14] holy motherfuck [19:32:33] i think we should add MaxSem to the project [19:32:35] that's a indication of a serious, widespread usage [19:32:46] yuvipanda: https://wikitech.wikimedia.org/wiki/Help:Addresses#Request_a_Public_IP_address - that's what i meant [19:33:02] anyway, i managed to set up https://wikispy.wmflabs.org/ [19:33:13] Nice endcap to a truly shitty week. [19:33:50] d33tah: added an entry [19:34:24] Coren: on the plus side, there’s been a lot less end user disruption :) [19:34:27] yuvipanda: thanks, it's clearer now [19:34:35] we fiddled around almost all the instances, and nobody complained... [19:34:41] (for the most part) [19:35:02] Still. Now it turns out labvirt1005 is actuall DOA... Yeay. [19:35:38] heh [19:35:49] anyway, i've got the website now. two things - can i park a domain for it, like wikispy.net? and second - is there anybody i could ask to verify if my rules/privacy policy is okay and whether i need any other special pages to be ok with wikilabs rules? so far these documents are only there to make polish government happy. [19:37:12] d33tah: hmm, I think we’d prefer not to have it be parked on a non *.wmflabs.org domain, but I don’t think anyone has asked yet. [19:37:20] Coren, disk space can be very seriously conserved by nuking old metatiles on deeper zoom levels (say, past zoom 13). just kill everything older than a month (if access time is saved, even better) [19:37:57] osm.org garbage collects their metatiles [19:37:58] MaxSem: I've no doubt the whole thing can be improved a lot. :-) [19:38:14] yuvipanda: i understand. so, if i wanted to try that, it would need separate discussion? i'm okay with adding "powered by wikimedia labs" and so on to my website. [19:38:22] it's just about making the address easier to remember. [19:38:40] d33tah: so I think having a *redirect* from wikispy.net to wikisky.wmflabs.org is ok. [19:38:53] d33tah: as for hosting it on the domain itself, I think it would require discussion, yes. [19:39:03] d33tah: open a phab ticket and poke us? we’ll poke legal / ops... [19:39:04] that'd be good, but people are going to share the address they are on anyway [19:39:10] d33tah: yeah, true. [19:39:14] yuvipanda: two separate threads? [19:39:22] d33tah: yeah, one for Privacy policy and one for the domain. [19:39:26] Coren, and then move it off NFS;) [19:40:16] yuvipanda: sorry for so many questions but - subtask "create wikispy project" where i got the instance? or is it not relevant what i subtask here? [19:40:17] MaxSem: it running on labs != labs ops’ have to fix it :P [19:40:26] heh [19:40:39] d33tah: doesn’t matter, I think. just create an empty ticket, just add the ‘Labs’ project [19:40:39] well, you want your nfs to work, right? :P [19:40:47] * yuvipanda MaxSem: we are making it, by killing the requestor :P [19:41:08] kekeke [19:41:16] Coren: I’m returning 503 for these guys now. [19:41:36] at the proxy. so load should lower. [19:41:50] Krinkle: How's interactive performance? [19:42:25] I'm not on ssh at the moment, coding atm. [19:42:32] I think we should report all labs prod hosts perf metrics on to graphite.wmflabs.org too [19:42:36] my labs maintenance is done for today [19:43:54] Coren: andrewbogott on the plus side, have you guys seen http://p.catchpoint.com/ui/Entry/PD/V/A.RNP-Ov-jSUbDu8Jdg/ErLK [19:44:20] 6Labs: Request to review privacy policy and rules - https://phabricator.wikimedia.org/T97844#1253240 (10d33tah) 3NEW a:3yuvipanda [19:44:29] yuvipanda: I hadn't, but this is nice. [19:44:29] yuvipanda: :D we should tune that 3.5s response time though [19:44:42] valhallasw`cloud: for home? yeah. [19:44:45] yuvipanda: can nginx/lighttpd do caching for us if we set an Expires: header? [19:45:02] valhallasw`cloud: I think we should fix home in other ways too :D have you looked at the code? :D [19:45:06] yes [19:45:10] Honestly, the landing page by now should be an explanation and pointers to docs and not just straight up be the tools list [19:45:12] that's exactly why I'm looking at nginx ;D [19:45:19] valhallasw`cloud: home (admin project) should also be independent of NFS [19:45:27] what Coren said too. [19:45:55] valhallasw`cloud: I’m also curious why precise lighttpd response time is about 100ms faster. [19:46:07] we need to recreate webgrid nodes anyway. [19:46:31] Coren: I’m going to recreate the webgrid nodes, but name them tools-webgrid-lighttpd-12xx and -14xx and also tools-webgrid-generic-14xx. [19:46:38] Coren: and make them all larges... [19:47:04] yuvipanda: it's not? it's maybe 15 ms for longer time scales [19:47:15] yuvipanda: but what are the error bars...? [19:47:52] valhallasw`cloud: http://p.catchpoint.com/ui/Entry/PC/V/ARQf-C-D-OvGeBjSUO.H8WMAA [19:47:54] another way of looking at it [19:48:48] yuvipanda: aka it's a boatload of noise [19:48:59] valhallasw`cloud: yeah, needs to run for longer, I guess. [19:49:06] we should let it run for like a week before we try to get anything out of it [19:49:33] yuvipanda: not gonna help [19:49:41] 6Labs: Discussion: can I park WikiSpy under a separate, simpler domain? - https://phabricator.wikimedia.org/T97846#1253259 (10d33tah) 3NEW a:3yuvipanda [19:49:44] we’ll see [19:49:45] yuvipanda: just look at the graphs, they're all over the place [19:49:57] those are 30 minute aggregates there fyi [19:50:01] oh [19:50:03] I didn’t know that [19:50:11] if you make it 5 min it'll get more jumpy :D [19:50:20] yuvipanda: oh, apparently i assigned it those tickets you by mistake [19:50:23] From 04/30/2015 13:35 To 05/01/2015 13:35 ET By 30 Min [19:50:29] i mean this was the default, i hope it's ok [19:50:29] valhallasw`cloud: I think having availability numbers is important, though :) [19:50:34] d33tah: yeah, that’s ok :) [19:50:48] and even this 30 min averaged data is like 100ms plus or minus 3000ms [19:51:13] what are you guys looking for? [19:51:15] can we just add SD content load (ms) to the table? [19:51:49] chasemp: hmm, can we add SD? [19:51:54] sd is? [19:51:58] standard deviation [19:52:10] yes [19:52:13] should be able to [19:52:23] the footer says they have columns for SD, avg, geometric mean and median [19:52:44] they have more than that even [19:52:50] but unsure hwo it translates to the status page exactly [19:53:12] the public status page is a bit weird [19:53:21] chasemp: I guess we can’t give peope restricted access? [19:53:26] to? [19:53:28] we could [19:53:35] we can do "observer" type thing [19:53:40] aaaah [19:53:42] that might be nice :D [19:53:48] to just give it to the other toollabs admins [19:53:57] (so valhallasw`cloud and scfc) [19:54:07] chasemp: is that free? [19:54:27] so we can break things up and do that more or less depending on how many people [19:54:37] it's not free persay but we are way under any user limit [19:54:46] ah cool [19:54:58] we may want to split tools into a division which is like [19:55:02] managed alone [19:55:10] yeah, that sounds nice.. [20:00:00] I don't know what's more reasonable really, we could also make teh data available to labs admins for perf in general [20:00:04] and cache a copy via api [20:00:17] there is a myriad of things [20:01:21] chasemp: hmm, so ideally tools (NOT LABS!) admins should be able to see the general catchpoint interface for tools and do anything that doesn’t cost [20:01:54] ok let me think on it a bit as in theory that's possible but involves features we don't currently use [20:02:41] chasemp: ok! [20:03:24] you can also make "widgets" in the analytics interface and add them to the dashboard for now [20:03:30] for more "stuffs" [20:13:36] OK random question: Can I set up git to pull from GitHub or is that not allowed? [20:14:18] Matthew_: for your own tool? [20:14:20] feel free to :) [20:14:26] yuvipanda: Yes. [20:14:27] Thank you. [20:38:41] … Are PHP shorttags enabled by default? [20:39:50] should be disabled by default now [20:39:59] the default webservice start is giving you php 5.5 [20:42:33] Matthew_: Excellent. Thank you. [20:48:12] yuvipanda: but then we don't need the respawn at all (switching channels) [20:49:28] valhallasw`cloud: yeah but puppet does it only once every 20minutes. [20:49:36] mm. [20:50:16] valhallasw`cloud: so basically no if we put a respawn limit, it’s not going to be respected after 20mins, but if we don’t have respwan we could have 19minutes of preventable downtime. [20:50:36] yuvipanda: not really [20:50:49] well, it depends on why it crashes [20:50:52] true [20:50:59] but adding a respawn limit doesn’t give us anything [20:51:02] it dos [20:51:04] it does [20:51:13] it prevents the job from restarting 100s of times per second [20:51:18] hmm [20:51:23] the example limit in the docs is respawn limit 10 5 [20:51:24] well, alright. adding it doesnt’ hurt [20:51:26] 10 times in 5 secs [20:51:46] oh but that's default apparently [20:51:54] valhallasw`cloud: anyway, explicitly set that too [20:52:00] ok! [20:52:27] your uwsgi command is still a mess :P [20:52:32] ordering, yo. [20:52:44] callable should be next to wsgi-file [20:52:45] I like putting —plugin first [20:52:51] bah fair [20:53:04] plugin first is fine [20:53:13] but then plugin - wsgi-file - callable - the rest [20:53:38] yuvipanda: and the Uwsg::app['...']? [20:53:47] no, that should definitely go [20:54:28] valhallasw`cloud: whelp. yes, gone now. [20:54:43] yuvipanda: and if you really want to be fancy, pass the socket from puppet [20:54:52] to the nginx erb and the upstart file [20:55:00] valhallasw`cloud: nah, I don’t want to be that fancy :P [20:55:08] this much duplication is ok, IMO [20:55:12] sure [20:56:43] yuvipanda: what does --master do? [20:58:00] valhallasw`cloud: spawns subprocesses and ‘manages’ them from a parent process, IIRC [20:58:14] yuvipanda: mkay. we might need to configure upstart for that as well [20:58:29] valhallasw`cloud: I think that’s automagically done... [20:59:03] valhallasw`cloud: basically I’ve no idea why —master isn’t the default [20:59:05] > uWSGI’s built-in prefork+threading multi-worker management mode, activated by flicking the master switch on. For all practical serving deployments it’s not really a good idea not to use master mode. [20:59:12] (from https://uwsgi-docs.readthedocs.org/en/latest/Glossary.html) [20:59:16] mmm, it doesn't daemonize, right? then it's fine [20:59:19] no it doesn't [20:59:42] ok [21:00:42] valhallasw`cloud: anything else? [21:00:54] yuvipanda: lgtm otherwise [21:01:10] valhallasw`cloud: \o/. +1? [21:01:23] yuvipanda: fix the uwsgi::thing first :P [21:01:31] valhallasw`cloud: I did [21:01:32] and pushed [21:01:35] oh wait [21:02:00] yuvipanda: mmm. [21:02:16] will nginx break if that sock isn't there? [21:02:28] it won’t ‘break’ but will return 503 [21:02:37] err [21:02:43] it’ll start but requests get 503 [21:02:46] yeah, that's fine. I was wondering whether that require made sense [21:03:01] it's not harmful either, so w/e [21:03:04] ah, yeah - it’ll serve 503 otherwise. [21:04:08] valhallasw`cloud: :D yay [21:04:10] valhallasw`cloud: thanks a lot :) [21:04:24] valhallasw`cloud: also I was wondering if the lighttpd response variations are actually real. uwsgi is flat, because sit doesn’t hit NFS [21:04:31] while lighttpd has to hit NFS for every request [21:04:48] yuvipanda: from the data we have, there is no difference between the two [21:04:54] valhallasw`cloud: uwsgi? [21:04:57] uwsgi is flat. [21:05:05] (you can see it if you scroll allll the way to the right) [21:05:06] again, the variations are on the order of seconds while the difference we see is tens of ms [21:05:26] hmm [21:05:28] eh [21:05:28] I’ll check again in a while [21:05:32] I'm not sure what you mean then [21:05:38] trusty vs precise is not something we can see [21:05:45] the spikes are real, I think, but could be nfs related, yes [21:06:00] valhallasw`cloud: look at http://p.catchpoint.com/ui/Entry/PC/V/ARQf-C-D-OvGeBjSUO.H8WMAA [21:06:11] valhallasw`cloud: trusty / precise lighttpd are somewhat noisy. [21:06:21] valhallasw`cloud: and then you have uwsgi responses which are a *lot* more consistent [21:06:22] yeah [21:06:37] uwsgi is 50 \pm 50 ms or so [21:06:38] well [21:06:47] maybe 75 \pm 50 :P [21:06:50] soon we’ll have NFS measurements of some sort [21:06:54] and we can compare. [21:06:55] let’s see [21:07:34] yay, correlations [21:08:21] yeah :) [21:10:34] PROBLEM - Puppet failure on tools-checker-02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [21:25:35] RECOVERY - Puppet failure on tools-checker-02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:34:24] yuvipanda: can we set shinken to only complain after 2 or 3 failed puppet runs? [21:34:53] yuvipanda: that warning was for a single puppet run failing as far as I can see [21:35:25] valhallasw`cloud: yeah, we probably can... [21:35:28] but they usually dont fix themselves after another run, except on very first runs of fresh machines? [21:35:49] i'd drop the "55.56% of data" stuff though [21:36:18] mutante: not so sure. This one definitely did [21:36:38] we can also just completely turn them off, because I haven't seen an admin actually respond to them ;-) [21:37:05] yuvipanda: also, we can has logstash or something like that? [21:37:25] yuvipanda: it would be So Awesome(TM) if a shinken alert came with a link to a log file browser [21:38:53] valhallasw`cloud: yup, we need logstash for tools at some point, I think. [21:38:57] so much work to do, so little time... [21:39:22] 10Tool-Labs: Tool Labs: Install php5-mcrypt on Trusty - https://phabricator.wikimedia.org/T97857#1253572 (10Danmichaelo) 3NEW [21:39:42] a puppet run that fixes itself without a human changing anything? [21:39:55] that would be very special [21:40:36] that happens all the time. lots of our manifests are ‘fails on first run, succeeds on second' [21:40:50] that was the exception i listed above [21:40:55] "unless on the first run" [21:41:09] ah, I didn’t see that. [21:41:16] we also have the daily storm whenever the puppetmaster gets logrotated [21:41:20] mutante: I'm not even sure why this one had a failure to begin with; the puppet log is entirely clear of errors [21:41:34] valhallasw`cloud: \o/ http://tools-checker.wmflabs.org/nfs/home [21:41:43] valhallasw`cloud: tools-checker? it’s because I’m futzing with it actively :D [21:42:00] yuvipanda: oh, it's a human run that failed? :P [21:42:03] yes [21:42:11] valhallasw`cloud: maybe the bug is in the "use graphite" part of detecting it then [21:42:17] mutante: yeah, probably. [21:42:27] well, in this case it was a proper failure :) [21:42:38] (https://gerrit.wikimedia.org/r/#/c/208269/ was the fix) [21:43:12] so you got a fix out of it, reason not to just delete everything [21:43:30] valhallasw`cloud: there's an admin reacting to it, yuvi :) [21:43:33] well, I manually ran puppet so I saw the errors there :P [21:43:34] or we should just teach ourselves to set hosts to 'maintenance' when we fiddle with it :P [21:43:52] somehow because I have no clue how, really. [21:43:53] valhallasw`cloud: yeah, for that we need to enable ‘maintenance’ on shinken... [21:44:00] anda for that we need to enable user accounts... [21:44:22] it's really hard for me not to say something about having icinga here ... [21:44:24] for which we need to write a SUL (or) wikitech provider (can’t use LDAP) [21:44:28] which has all that [21:45:07] mutante: feel free to clean up the puppet manifests and set them up. I can show you where the source is if you’d like. [21:45:18] yuvipanda: or local passwords? :P it's for four people anyway [21:45:36] and those four people have or should have a password manager [21:45:37] valhallasw`cloud: shinken is technically for everyone, but we can do that in a pinch, yeah :). [21:47:29] yuvipanda: "the source"? [21:51:50] valhallasw`cloud: we can’t just use ldap for shinken, sadly - because ops don’t want ldap auth to go through labs hosts... [21:52:12] so a good solution might be to keep shinken on a prod host but then we’ll have to open up a fuckton of holes in that labs/prod firewall and mark vetoed that when I started... [21:53:43] yuvipanda: we don't need it [21:53:54] yuvipanda: again, we just need four authenticated accounts [21:53:59] valhallasw`cloud: for the 4-5 of us yeah. [21:54:04] feels a bit dirty tho :) [21:54:12] valhallasw`cloud: file a bug? I’ll do that monday... [21:55:30] !log tools.wikibugs Updated channels.yaml to: b6c7fa03a61f5b27061be11900b6e432d500b765 Remove definitions for #wikimedia-mobile [21:55:35] Logged the message, Master [21:56:32] 10Tool-Labs: Tool Labs: Install php5-mcrypt on Trusty - https://phabricator.wikimedia.org/T97857#1253615 (10yuvipanda) Hmm, I see that it is marked to be installed for both precise and trusty - and I just checked, and it is installed on trusty too. Do you have sample code that shows it not being usable on trusty? [21:59:39] 10Tool-Labs: Provide centralized logging (logstash) - https://phabricator.wikimedia.org/T97861#1253625 (10valhallasw) 3NEW [22:00:38] yuvipanda: btw, do you know if the procurement mail queue stuff is working? I'd like that for our root mails [22:01:08] tagged tool-labs + wmf-nda [22:01:32] valhallasw`cloud: procurement mail queue? [22:01:40] valhallasw`cloud: you mean the email -> phabricator task thingy? [22:01:41] hmmm [22:01:44] yeah [22:01:53] valhallasw`cloud: chasemp would know [22:01:54] email bot? [22:01:57] also, wait, does shinken have a web interface? [22:02:09] ah, yes, it does [22:02:16] must have missed that [22:02:50] valhallasw`cloud: we are also running an ancient version. should upgrade [22:03:09] valhallasw`cloud: debian jessie has a much more recent version. I need to find some time to sit and upgrade.. [22:03:23] 10Tool-Labs: Add shinken admin accounts for tools ops - https://phabricator.wikimedia.org/T97862#1253636 (10valhallasw) 3NEW [22:03:31] I'm pretty staunchly against anything auto creating issues persay based on errors that's much more handleable in a otrs type thing [22:03:44] but in essence phab doesn't have good mechanisms to handle the eventual spam / overzealousness [22:06:04] (03PS1) 10Jforrester: Put a few things into -editing. [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/208281 [22:06:26] true. otoh, it also doesn't help to send mail to four people and *not* have a central location for communication [22:06:36] but I guess something like otrs could help with that as well [22:07:54] (03CR) 10Merlijn van Deen: [C: 032] Put a few things into -editing. [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/208281 (owner: 10Jforrester) [22:08:06] (03Merged) 10jenkins-bot: Put a few things into -editing. [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/208281 (owner: 10Jforrester) [22:08:22] Thanks. :-) [22:09:00] valhallasw`cloud: there is an upstream thing on the horizon calls nuance that is like an end-user ticket tracker thing [22:09:05] and also useful for auto reporting bugs [22:09:08] but who knows when [22:10:43] !log tools.wikibugs Updated channels.yaml to: 831099cc50dbc6828c2ef5ff8f2e6aa41cd97310 Put a few things into -editing. [22:10:47] Logged the message, Master [22:14:39] chasemp: when someone invest a M€ in phabricator, probably ;-) [22:15:06] true [22:17:19] anyway, time for bed. [22:17:37] 6Labs, 10Beta-Cluster, 7Monitoring: Setup (simple) catchpoint monitoring for betacluster - https://phabricator.wikimedia.org/T97865#1253670 (10yuvipanda) 3NEW [22:18:32] valhallasw`cloud: night! \o/ [22:21:18] 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Setup a tools checker service that can check all internal services for availability - https://phabricator.wikimedia.org/T97748#1253677 (10yuvipanda) NFS, redis, lighttpd - precise, lighttpd - trusty, lighttpd uwsg-python tests done :D [22:22:11] 6Labs, 10Beta-Cluster, 7Monitoring: Setup (simple) catchpoint monitoring for betacluster - https://phabricator.wikimedia.org/T97865#1253678 (10greg) [22:52:00] 6Labs, 10Beta-Cluster, 7Monitoring: Setup (simple) catchpoint monitoring for betacluster - https://phabricator.wikimedia.org/T97865#1253746 (10greg) See also {T88705}