[00:24:21] 3Wikimedia Labs / 3deployment-prep (beta): populateBacklinkNamespace script causing massive slave lag on beta - 10https://bugzilla.wikimedia.org/68349#c16 (10Bawolff (Brian Wolff)) 5PATC>3RESO/FIX Slave lag is back down to 0. Guess this is fixed. [00:56:36] 3Wikimedia Labs / 3deployment-prep (beta): populateBacklinkNamespace script causing massive slave lag on beta - 10https://bugzilla.wikimedia.org/68349#c17 (10Greg Grossmeier) 5RESO/FIX>3REOP I want to leave this open until we've figured out if we can prevent this from happening again. [01:20:31] do I need labs shell access in order to run queries like this SELECT DISTINCT query on enwiki? https://www.mediawiki.org/wiki/Manual:Logging_table#log_action [01:29:06] leucosticte_: You also need to be a member of the Tools project. Do you have an account on wikitech.wikimedia.org/Gerrit? [01:34:41] scfc_de: I have a gerrit account, "leucosticte" [01:34:57] scfc_de: commit name "tisane" [01:35:25] scfc: also a labs account, "Leucosticte" [01:40:50] 3Wikimedia Labs / 3deployment-prep (beta): populateBacklinkNamespace script causing massive slave lag on beta - 10https://bugzilla.wikimedia.org/68349#c18 (10Bawolff (Brian Wolff)) (In reply to Greg Grossmeier from comment #17) > I want to leave this open until we've figured out if we can prevent this > from... [01:50:54] legoktm: bah, can you change the callback URL of an OAuth application after it's been approved? [02:07:01] leucosticte_: Added you to Tools. Welcome! https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help for help. [02:18:39] scfc_de: Thank you! [02:56:08] Anyone still up? [02:56:20] I wrote a python script that runs fine under a tool account [02:56:32] But doesn't appear to be doing much when I submit it as a job [02:56:39] Any error messages? [02:56:54] and any virtualenvs involved? [02:57:14] (that's probably in ~/python.err or something like that...) [02:57:30] No [02:57:58] is the process running? (qstat) [02:58:40] And no, nothing like virtualenvs involved [02:59:26] I tried spamming qstat, it usually gets given a queue (e.g. task@tools-exec-13.eqiad.wmfla) and then disappears very quickly [02:59:51] qacct -j python ? [03:00:07] maybe it's an issue with python getting killed OOM before the task eve nstarts or something like that [03:02:03] give it a lot of memory? [03:02:14] There's an awful lot of output from that command valhallasw`cloud [03:02:34] Also I don't see why I should give it a lot of memory [03:03:34] It does some DB connections and then opens a socket connection. I can run it fine directly from tools-login... [03:03:54] because on the grid the entire vmem is calculated, and that's typically huge for interpreted languages [03:05:00] So shall I just try "-mem 1g" or something? [03:05:46] valhallasw`cloud: also, do you know if it is possible to change the callback URL for an app after it's been approved? [03:06:00] YuviPanda|Sleepy: nope, not possible [03:06:03] sigh [03:06:03] Krenair: yes. [03:06:06] will submit a new one soon then [03:06:11] YuviPanda|Sleepy: BUT you can test without it being approved [03:06:23] valhallasw`cloud: indeed, that's what I'll do now [03:06:33] valhallasw`cloud: the previous one was all fine, except I forgot // in the beginning of the URL [03:06:48] ouch. Yep, did that once too. [03:06:52] actually, I saw one job was sitting around on qstat just now [03:08:30] qdel'd it, trying again with -mem 1g [03:08:48] valhallasw`cloud: my callback URL is: [03:08:50] http://quarry.wmflabs.org/oauth-callback?oauth_verifier=b5a0b31a19063a23f0ee48986da45c18&oauth_token=095c137cb1a3d81c0e8447dec8fb9589 [03:08:51] and yet [03:09:00] KeyError: 'oauth_token' [03:09:02] WHY [03:09:04] this one is also sitting on qstat, not seeming to do anything [03:09:09] * YuviPanda|Sleepy has not had a good day with code [03:09:44] YuviPanda|Sleepy: hrm. what's oauth-callback? that's not the standard URL, right? [03:09:53] valhallasw`cloud: that is? [03:09:54] it's been a while since I worked on oauth stuff [03:09:59] valhallasw`cloud: it's the default one from flask-mwoauth [03:09:59] ok, then I'm just ocnfused [03:10:04] I am too [03:10:11] oh. [03:10:16] you can only use the URL once [03:10:35] but I'm not sure if that gives a keyerror... [03:11:58] indeed, I am too [03:12:05] since this was working not less than 5 minutes ago [03:12:14] I go fix another error and come back, and BAM [03:12:27] just go through the oauth go-around again [03:12:50] I've tried it several times [03:13:04] even created a new consumer without the // but with http:// (since I don't have https yet) [03:13:05] still [03:14:21] valhallasw`cloud: heh, works in incognito mode [03:14:29] I suppose cookies messed up? [03:14:34] sounds like it [03:14:45] still shouldn't give crappy error messages :-p [03:14:50] yep :D [03:15:03] yeah, I called /logout, and now things are fine [03:15:10] well, you knooooooow where the pull request button is O-) [03:15:17] heh :D [03:15:25] valhallasw`cloud: btw, I can't pip install the new release you made [03:15:38] crashes at trying to read README.rst, saying no such file exists [03:15:43] oh god [03:15:46] yeah [03:15:47] manifest issues [03:15:52] heh :D [03:15:56] make another release when you've the time? [03:15:58] fuck python packaging [03:16:00] seriously [03:16:03] yeah, I have none [03:16:06] heh [03:16:10] I can kill the readme-reading part [03:16:12] just a sec [03:16:18] yeah, I spent about an hour trying to fucking get relative imports to wokr [03:16:19] work [03:16:22] no fucking way, apparently [03:16:29] oh god [03:16:36] don't get me started [03:16:44] I like python, but this python 3 thing :-p [03:16:49] :) [03:17:24] okay I *would* try fixing it [03:17:28] but my / is full [03:17:29] FFS [03:17:35] man, today's a bad day [03:18:15] probably filled with kernel updates again [03:18:47] heh [03:20:25] Remove the following packages: [03:20:25] 1) linux-headers-server [03:20:25] 2) linux-server [03:20:26] ... [03:20:27] yes [03:20:32] that's a great idea, aptitude [03:22:07] yeah [03:23:38] I also don't get why aptitude doesn't remove the gazillion old kernels [03:23:50] seriously, 600M in old kernels is not funny on an 8G VM [03:31:08] YuviPanda|Sleepy: should be fixed now [03:31:13] w00t [03:31:14] tyy [03:31:16] everything is MANIFESTed now [03:31:18] I think [03:34:23] 3Wikimedia Labs / 3wikitech-interface: Enable HSTS (HTTP Strict Transport Security) on Wikitech - 10https://bugzilla.wikimedia.org/67303#c3 (10fn84b) 5REOP>3RESO/FIX https://gerrit.wikimedia.org/r/148290 fixed this issue. [04:14:00] ok BED [04:14:05] später. [06:12:42] Hi, I got an account recently with shell acess, but I can't manage to login to tools-login.wmflabs.org ("Permission Denied"). I did set my ssh key in preferences. What might be wrong ? [06:26:03] are you using the command-line "ssh" client? could you try running it with '-vvv' for make its output verbose? [06:26:11] *to make [06:42:05] 3Wikimedia Labs / 3deployment-prep (beta): populateBacklinkNamespace script causing massive slave lag on beta - 10https://bugzilla.wikimedia.org/68349#c19 (10Antoine "hashar" Musso) I am also wondering how we are going to handle that update in production. Might end up taking a long time as well. [06:52:31] https://www.mediawiki.org/wiki/User:KartikMistry/ImportToBetaWikis - which page is better to put this information? [06:58:38] there isn't a beta cluster page on wikitech? [06:58:40] * ori boggles. [06:58:52] the one on mediawiki.org is more of a project page [06:59:04] honestly, i'd create 'beta cluster' on wikitech and throw that in there [06:59:57] on wikitech it's called "deployment-prep" [07:00:21] https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep has a bunch of links [07:00:41] ori: sorry was busy for a while. I tried with -vvv and I see it's trying to offer my public key [07:00:46] nothing says "beta cluster" like "Nova_Resource:Deployment-prep" [07:02:21] 3Wikimedia Labs / 3deployment-prep (beta): populateBacklinkNamespace script causing massive slave lag on beta - 10https://bugzilla.wikimedia.org/68349#c20 (10Bawolff (Brian Wolff)) (In reply to Antoine "hashar" Musso from comment #19) > I am also wondering how we are going to handle that update in production... [09:07:59] ori: legoktm Updated https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Import_to_betawiki [09:08:16] will improve it over time. [09:39:20] andrewbogott_afk: can I haz public ip for project bots? I need it to set up identd for wm-bot [09:39:37] freenode needs to be able to contact the identd running on instance wm-bot [09:44:09] 3Wikimedia Labs / 3tools: Update lighttpd default settings - 10https://bugzilla.wikimedia.org/68431 (10metatron) 3UNCO p:3Unprio s:3normal a:3Marc A. Pelletier According to: https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help/Performance#Webservice_stalling.2C_OOM_or_otherwise_unresponsive... [10:51:52] Change on 12mediawiki a page OAuth (obsolete info)/en-gb was modified, changed by 90.25.31.134 link https://www.mediawiki.org/w/index.php?diff=1074326 edit summary: [10:52:05] Change on 12mediawiki a page OAuth (obsolete info)/en was modified, changed by 90.25.31.134 link https://www.mediawiki.org/w/index.php?diff=1074328 edit summary: [10:55:56] Change on 12mediawiki a page OAuth (obsolete info)/en-gb was modified, changed by Shirayuki link https://www.mediawiki.org/w/index.php?diff=1074334 edit summary: Reverted edits by [[Special:Contributions/90.25.31.134|90.25.31.134]] ([[User talk:90.25.31.134|talk]]) to last revision by [[User:FuzzyBot|FuzzyBot]] [10:56:39] Change on 12mediawiki a page OAuth (obsolete info)/en was modified, changed by Shirayuki link https://www.mediawiki.org/w/index.php?diff=1074336 edit summary: Reverted edits by [[Special:Contributions/90.25.31.134|90.25.31.134]] ([[User talk:90.25.31.134|talk]]) to last revision by [[User:FuzzyBot|FuzzyBot]] [12:41:47] @notify andrewbogott_afk [12:41:47] This user is now online in #wikimedia-labs. I'll let you know when they show some activity (talk, etc.) [12:53:36] 3Wikimedia Labs / 3tools: Update lighttpd default settings - 10https://bugzilla.wikimedia.org/68431#c2 (10Tim Landscheidt) (In reply to metatron from comment #0) > [...] > Additional: > Is /usr/bin/webservice script puppetized? It's at Coren: hi! do we have any occurrence of labs instance being allowed to connect to production servers on 10.x network? [12:55:14] Labs /instances/? Normally not, by design. What's your use case? [12:56:06] Coren: I am refactoring Zuul to scale a process that craft merge commits that are used for testing. Will be crafted on both gallium and lanthanum the later having a 10.x address [12:56:28] I have a bunch of slaves in labs that would need to be able to talk to lanthanum (10.x prod server) over git protocol [12:56:42] maybe I should write that down and reach out to ops for more input / idea [12:59:05] 3Wikimedia Labs / 3tools: Can't send email from tools-exec-11, -12, or -13 - 10https://bugzilla.wikimedia.org/67912 (10Tim Landscheidt) 5NEW>3RESO/FIX [13:00:39] hashar: You should indeed. It may be technically possible to do but it's something we'd very much want to avoid (breaching any holes in the labs-not-labs wall) [13:01:12] hashar: Normally, we serve things like that with a server in the labs infrastructure network and let /that/ connect back (c.f. the db replicas) [13:02:39] that would be doable (i.e. a server in labs infra) [13:03:51] hashar: From a prod point of view, labs = the wide Internet. :-) [13:04:03] yeah that is understandable [13:04:10] another way would be to add a public IP on lanthanum [13:04:16] but that is a waste of a rare resource [13:04:40] I guess I finally have to write that long overdue architecture document [13:04:59] hashar: Not that rare; but I expect the worries will be security not address space. Best thing is definitely to throw your scenario onto the mailing list and we'll figure something out. [13:05:28] I am face palming currently. So obvious that I should have thought about it [13:21:50] 3Wikimedia Labs / 3deployment-prep (beta): populateBacklinkNamespace script causing massive slave lag on beta - 10https://bugzilla.wikimedia.org/68349#c21 (10Tim Landscheidt) (In reply to Bawolff (Brian Wolff) from comment #20) > (In reply to Antoine "hashar" Musso from comment #19) > > I am also wondering h... [14:50:04] andrewbogott_afk: can I haz public ip for project bots? I need it to set up identd for wm-bot [14:50:19] poor wm-bot4 can't connect :/ [14:50:28] too many connections it say [14:56:07] 3Wikimedia Labs / 3tools: Update lighttpd default settings - 10https://bugzilla.wikimedia.org/68431#c6 (10metatron) Thanks. [15:00:37] petan: yep, done. [15:00:48] thanks! [15:12:13] Coren: took me two hours but I got a schema :-] https://upload.wikimedia.org/wikipedia/commons/e/e9/Integrationwikimediaci-zuul_git_flows.svg :D [15:12:16] posting to ops [15:12:48] hashar: There was some chatter yesterday about ssl having regressed on beta during the datacenter move. Do you know if that's right? [15:13:17] andrewbogott: yeah nobody can't figure out whether it happened during the eqiad move or only yesterday [15:13:41] andrewbogott: we have nginx proxies on the text varnish box. nginx terminates the ssl connection [15:13:49] Hm.. the data currently pushed to labs graphite for vm metrics (cpu, disk, memory etc.) is that real? [15:13:49] ok… I think that spage was complaining about it a month or so ago. Although possibly that was a different issue. [15:13:55] andrewbogott: nginx refuses to start because the ssl star.wmflabs keys do not match [15:14:02] Krinkle: should be [15:14:03] hashar: anyway, just wanted to make sure you're on top of it. Let me know if I can do anything... [15:14:11] It seems the value is constant since the first value on July 7 [15:14:19] Krinkle: they are collected using a python daemon named "diamond" which push metrics to graphite [15:14:35] for any metric I pick, it's always a flat line [15:14:39] andrewbogott: I think we will bike shed about it a bit more. That is a long standing issue :-/ [15:14:46] ok [15:15:41] Krinkle: http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1406128526.477&target=deployment-prep.deployment-jobrunner01.cpu.total.user.value [15:15:50] Krinkle: seems to work for me. Maybe the diamond daemon is broken on the instance? [15:15:59] hashar: thx [15:16:04] hashar: try the 'cvn' project [15:16:19] even time since puppet run, that value should alternate overtime [15:16:21] it can't be fixed [15:16:27] s/fixed/constant [15:16:56] Krinkle: make sure puppet pass and is up to date [15:17:03] Krinkle: then maybe restart the diamond process [15:17:10] http://graphite.wmflabs.org/render/?width=578&height=289&_salt=1406128626.713&from=00%3A00_20140723&until=23%3A45_20140723&hideLegend=false&target=cvn.*.cpu.total.user.value [15:17:13] Krinkle: it has some log in /var/log/diamond/ or something [15:17:25] it is not a gap though [15:17:47] it has data for hour of every day [15:17:54] 3 weeks window: http://graphite.wmflabs.org/render/?width=578&height=289&_salt=1406128626.713&from=-3weeks&hideLegend=false&target=cvn.*.cpu.total.user.value [15:17:57] apparently never worked [15:18:46] so even if something is broken on the cvn side, it should look like a gap instead of a constant line? Maybe the log aggregator is forgetting to check the data is from recent and not looking at the same value each time? [15:19:01] hi. i get OperationalError: (OperationalError) (2006, 'MySQL server has gone away') ERROR while using my flask app.. what can i do?? [15:19:49] Krinkle: maybe it only worked once [15:19:59] Krinkle: and keep repeating the last known value [15:20:58] hashar: but then it should have a gap in the line on graphite. Looks like there is a process (presumably the aggrator from outside the local inside) that is extracting the value from the instance without ensuring it is a new value. Maybe it should delete the data after it is read so that if it wasn't recomputed, there is no data, and not hte same data. [15:21:11] ganglia does that [15:21:24] !log deployment-prep Upgraded hhvm to 3.1+20140630; seeing problems with luasandbox extension [15:21:27] Logged the message, Master [15:21:46] puppet runs fine on cvn-dev.eqiad.wmflabs [15:22:39] The /var/log/diamond dir contains files but all last touched < July 7 [15:22:54] -rw-r--r-- 1 diamond nogroup 6.6M Jul 7 15:42 archive.log [15:22:54] -rw-r--r-- 1 diamond nogroup 11M Jul 2 23:59 archive.log.2014-07-02 [15:22:54] -rw-r--r-- 1 diamond nogroup 11M Jul 3 23:59 archive.log.2014-07-03 [15:22:55] -rw-r--r-- 1 diamond nogroup 11M Jul 4 23:59 archive.log.2014-07-04 [15:22:57] -rw-r--r-- 1 diamond nogroup 10M Jul 5 23:59 archive.log.2014-07-05 [15:22:59] -rw-r--r-- 1 diamond nogroup 11M Jul 6 23:59 archive.log.2014-07-06 [15:23:01] -rw-r--r-- 1 diamond nogroup 1.4M Jul 8 17:03 diamond.log [15:23:03] -rw-r--r-- 1 diamond nogroup 19M Jul 7 14:03 diamond.log.2014-07-06 [15:23:52] Krinkle: maybe /var is full ? [15:24:09] verify diamond is running [15:24:15] and redstart it [15:24:17] that might fix it [15:24:20] Coren Sry for that abandoned change ... [15:24:27] Shouldn't puppet or cron (via puppet) ensure diamond is running? If this will be our low-level way for montoring that replaces incinga/ganglia for labs, I need to be able to rely on it and not have to make sure it is running. especially if the fallback when it is down is presented as a repeating of the last data. [15:24:39] ps -u diamond f [15:24:42] should give something [15:25:04] puppet probably attempt to maintains it, something like service { diamond: ensure => running } [15:25:05] ps -aux | grep diamond [15:25:06] empty [15:25:21] start it! :-] [15:25:29] it probably has some problem starting up though [15:25:57] /dev/vda2 1.9G 524M 1.3G 29% /var [15:26:01] good [15:26:05] /dev/vda1 7.6G 1.3G 5.9G 19% / [15:26:09] they all look good [15:26:19] diamond is start via upstart, so there might be some clue in /var/log/upstart/diamond.log [15:26:54] I do see a mega ton of these in the syslog: [15:26:56] Jul 23 15:24:03 cvn-dev nslcd[1063]: [df1fbd] error writing to client: Broken pipe [15:26:56] Jul 23 15:24:03 cvn-dev nslcd[1063]: [d35109] error writing to client: Broken pipe [15:26:57] Jul 23 15:24:03 cvn-dev nslcd[1063]: [caec43] error writing to client: Broken pipe [15:26:58] Jul 23 15:24:03 cvn-dev nslcd[1063]: [837a65] error writing to client: Broken pipe [15:27:05] I noticed that on almost all instances [15:27:19] Coren: btw. something is pestering xtools with up to 35 req/sec [15:28:10] Coren: but I won't complain atm. It's a kind of endurance test for the webservice and it's spider algorithm [15:28:22] Krinkle: that is nslcd daemon, unrelated :) [15:28:29] *anti-spider [15:29:02] I know it's unrelated, but errors that repeat themselves for months on end almost every 20 seconds in the logs are either a serious bug we're ignoring, or a false positive that shouldn't end up as an error in the log. [15:29:38] I think it's the latter, perhaps there's a way to fix its config (maybe create a uid?) [15:30:22] 3Wikimedia Labs / 3deployment-prep (beta): populateBacklinkNamespace script causing massive slave lag on beta - 10https://bugzilla.wikimedia.org/68349#c22 (10Greg Grossmeier) (In reply to Antoine "hashar" Musso from comment #19) > I am also wondering how we are going to handle that update in production. > M... [15:33:06] 3Wikimedia Labs / 3deployment-prep (beta): populateBacklinkNamespace script causing massive slave lag on beta - 10https://bugzilla.wikimedia.org/68349#c23 (10Antoine "hashar" Musso) 5REOP>3RESO/FIX Excellent! So there is nothing to talk about anymore =) Beta is happy, slave lag is back to 0 seconds.... [15:34:20] !log deployment-prep Reverted hhvm to 3.1+20140630-1+wm1 on deployment-mediawiki02 [15:34:22] Logged the message, Master [15:35:41] $ service diamond start [15:35:41] start: Rejected send message, 1 matched rules; type="method_call", sender=":1.49" (uid=2008 pid=6910 comm="start diamond ") interface="com.ubuntu.Upstart0_6.Job" member="Start" error name="(unset)" requested_reply="0" destination="com.ubuntu.Upstart" (uid=0 pid=1 comm="/sbin/init") [15:35:50] hashar: Hm.. not sure what that means [15:36:11] $ service diamond status [15:36:11] diamond stop/waiting [15:37:25] 3Wikimedia Labs / 3Infrastructure: WMFLabs: "service" incorrectly says init.d is invoked and suggests itself as alternative - 10https://bugzilla.wikimedia.org/68442 (10Krinkle) 3NEW p:3Unprio s:3trivia a:3None $ /etc/init.d/diamond status Rather than invoking init scripts through /etc/init.d, use the... [15:40:34] Krinkle: no clue :-/ [15:40:52] Krinkle: chasemp / YuviPanda would be able to help with diamond [15:43:38] 3Wikimedia Labs / 3Infrastructure: WMFLabs: Diamond not running / won't start - 10https://bugzilla.wikimedia.org/68444 (10Krinkle) 3NEW p:3Unprio s:3critic a:3None Looking at graphite, the values for cvn instances appear all constant (cpu, memory, time since puppet run, everything). For example: htt... [15:43:39] chasemp: YuviPanda: ^ [15:57:45] Krinkle: so the I haven't dug in disclaimer :) [15:57:48] but I will say [15:57:58] we whitelisted projects that diamond is enabled for in puppet [15:58:07] so if it's not in the array in manifests/role/diamond.pp [15:58:12] it's expected to be stopped [15:58:29] and also one of the undesirable features of lots of the statsd implementations [15:58:41] is this continuity idea where they keep flushing stats even without a real source [15:58:52] to keep whisper files from having invalid xfactor ratios [15:58:55] which is...insane [15:59:09] but this sounds exactly like that and is a big reason I did my own thing statsd wise in the past [16:00:08] 3Wikimedia Labs / 3Infrastructure: WMFLabs: Diamond not running / won't start - 10https://bugzilla.wikimedia.org/68444#c1 (10Chase) from irc: so the I haven't dug in disclaimer :) chasemp but I will say chasemp we whitelisted projects that diamond is enabled for in puppet chasemp so if it's not in the array... [16:00:51] Krinkle: what is the project you are trying to run diamond for? [16:00:59] cvn [16:01:32] these are the only ones it's enabled for $labs_enabled_projects = ['tools', 'deployment-prep', 'graphite'] [16:01:51] YuviPanda would be the better person to ask as i haven't had much exposure to the limits of graphite in labs [16:02:01] but all projects/instances do have data reported in graphite [16:02:09] yes at one point they were all reporting [16:02:10] or most [16:02:12] other projects [16:02:20] but it was quickly discovered there wasn't disk space to handle it [16:02:25] and maybe it wasn't cleaned up [16:02:26] I guess it was started globally and than after than stopped via puppet? [16:02:31] yep [16:02:33] exactly that [16:02:41] that* [16:02:45] how is the clean up handled for those 3 problems. [16:02:47] projects [16:02:57] which cleanup do you mean? [16:03:08] the clean up needed to avoid /var/ from filling up [16:03:10] as you said [16:03:12] ah [16:03:49] it really isn't but the logging that's filling disk should be disabled, so if it's causing disk issues can all be removed [16:04:19] the lack of mass orchestration in labs limited how nicely this could all be handled [16:04:27] I guess there is a salt master but it's not 100% [16:04:50] anyway, if the project isn't in that list, files can be removed, won't be recreated on the host itself [16:04:59] my guess is if three is "data" in graphite and it's not sending from host [16:05:02] 1) do you take requests to add projects? 2) it being opt-in for scalability, is that temporary or is the idea to enable this for all instances eventally? 3) the way it is handled right now for those 3 projects, is that automated and I can use it, or will I have to maintain it myself if I opt-in now? [16:05:03] it's a crappy statsd implementation [16:06:00] I'm having trouble getting a script to run on the grid [16:06:00] 1) I assume yes based on resources? YuviPanda? 2) yep there is a plan in the works for more horsepower to run more widely 3) should be automated as far as local logs, and graphite [16:06:07] I can run it directly from tools-login [16:06:24] However when I submit the job, I check qstat and almost as soon as it gets a queue, it disappears from the list [16:06:31] the .err and .out files are empty [16:07:32] Krinkle: submit a diff w/ your project and add YuviPanda to review? [16:07:50] Krenair: check the job id records after wards [16:07:58] to see why it was killed [16:08:04] probably for memory reasons [16:08:11] I think its like $ qacct -j [16:08:14] or something like that [16:08:26] it also takes a name [16:08:34] to get all past instances of that job [16:08:42] it's like qstat's archive [16:08:45] but more detailed [16:09:13] chasemp: so that array item will ensure the directory structure and the process running etc.? [16:09:23] start_time Wed Jul 23 16:05:03 2014 [16:09:23] end_time Wed Jul 23 16:05:03 2014 [16:09:49] and other lines like: failed 0 [16:09:57] Krinkle: what do you mean by directory structure? but essentially yes on the second [16:10:14] I just rm -rf'ed /var/log/diamond on the 4 cvn instances [16:11:07] chasemp: On other instances in projects also not in that array, will rm-rf'ing that directory keep the data from repeating in graphite? Or is that caused elsewhere? [16:11:23] Krenair: What's the exit code (or the job number)? [16:11:25] that array just says, run diamond otherwise ensure stopped [16:11:38] scfc_de, 2589495 [16:11:42] exit_status 0 [16:11:58] the data in graphite, assuming not from diamond is another matter [16:13:16] chasemp: k [16:13:25] chasemp: the puppet manifest says keep logs for 0 for all of labs [16:13:36] 0 days _post_ today [16:13:41] yeah [16:13:48] there were log files for 7 days on the cvn instances though [16:13:53] yes but how old? [16:14:01] I guess this is only enforced during runs? [16:14:08] not by logrotate [16:14:17] yep not by logrotate [16:14:20] k [16:20:14] !log deployment-prep hhvm upgraded to 3.1+20140723-1+wmf1 on deployment-mediawiki0[12] [16:20:16] Logged the message, Master [16:21:24] scfc_de, any idea? [16:30:42] Krenair: I see from the command line history that you started the job with quotes (") around python reporter/soething. When I removed them, the job ran with output in ve-needcheck-reporter.out. [16:31:05] ah [16:31:06] thanks scfc_de [16:34:56] bd808: is that hhvm package in apt.wikimedia.org ? [16:37:13] !log integration upgraded hhvm / elasticsearch on jenkins slaves [16:37:15] Logged the message, Master [16:41:11] Krinkle: chasemp there's a machine being provisioned for graphite [16:41:12] hashar: yes [16:41:21] Krinkle: chasemp should be up in a few days, and then it should be fine [16:41:31] Krinkle: chasemp in the meantime, we can add cvn to the project list [16:41:43] bd808: it is also missing hhvm-dev [16:42:01] bd808: ori filled a bug to get Jenkins job to compile our extensions with hhvm / hphize something ( https://bugzilla.wikimedia.org/show_bug.cgi?id=63120#c3 ) :D [16:43:16] hashar: deployment-mediawiki01 has both hhvm and hhvm-dev at 3.1+20140723-1+wmf1. _joe_ did the update the second time but it seems like he would have just pulled from apt [16:43:35] hmm [16:43:41] I am on a contint jenkins slave in labs [16:43:45] integration-slave1003 [16:43:56] hashar: Try `apt-get install hhvm hhvm-dev hhvm-fss hhvm-luasandbox hhvm-wikidiff2` ? [16:43:57] hhvm: Installed: 3.1.0~precise [16:44:11] ahh [16:44:13] Trusty!!!!!!!!!!! [16:44:18] yes! [16:44:20] the Jenkins slaves are using Precise hehe [16:44:47] HAT! (hhvm, apache, trusty) [16:45:11] good one [16:45:30] ori dreamed that up and is trying to "make it a thing" :) [16:45:44] or just hack [16:46:58] !log integration apt get upgrade on integration-slave1004-trusty (it is not pooled yet) [16:47:00] Logged the message, Master [16:47:37] and I am out of the coworking place. be back tomorrow [16:47:46] good night hashar [16:53:40] chasemp: the graphite box is almost done, should be up in a day or two I guess [17:25:48] !ping [17:25:48] !pong [17:49:52] 3Wikimedia Labs / 3deployment-prep (beta): HHVM: crashes with "boost::program_options::invalid_option_value" exception - 10https://bugzilla.wikimedia.org/68413#c1 (10Bryan Davis) Still seeing these in beta with new builds: hhvm 3.1+20140723-1+wmf1 hhvm-dev 3.1+20140723-1+wmf1 hhv... [17:52:23] 3Wikimedia Labs / 3deployment-prep (beta): HHVM: crashes with "boost::program_options::invalid_option_value" exception - 10https://bugzilla.wikimedia.org/68413#c2 (10Bryan Davis) Host: deployment-mediawiki02 ProcessID: 3498 ThreadID: 7f0de33ff700 ThreadPID: 4312 Name: unknown program Type: Segmentation fault... [17:56:50] 3Wikimedia Labs / 3tools: Failed to set group members for local-oclc-reference - 10https://bugzilla.wikimedia.org/65534#c2 (10Ocaasi) (In reply to Andrew Bogott from comment #1) > Sorry for the delay in responding. If this is still happening, can you tell > me what member you are adding? Hi Andrew, Usern... [18:02:41] 3Wikimedia Labs / 3tools: "Bad gateway" - 10https://bugzilla.wikimedia.org/68457 (10Magnus Manske) 3NEW p:3Unprio s:3major a:3Marc A. Pelletier I am running a webservice on instance wikidata-wdq-mm port 80. This works reasonably well, but large results sets cause a "502 Bad Gateway" error. Example:... [18:12:21] 3Wikimedia Labs / 3deployment-prep (beta): beta labs getting 503 Service unavailable or slow - 10https://bugzilla.wikimedia.org/68407#c2 (10Bryan Davis) Most of the crashes I'm seeing right now are for bug 68413. [18:12:25] 3Wikimedia Labs / 3deployment-prep (beta): HHVM: crashes with "boost::program_options::invalid_option_value" exception - 10https://bugzilla.wikimedia.org/68413 (10Bryan Davis) [18:20:22] 3Wikimedia Labs / 3deployment-prep (beta): HHVM crash logs need to go somewhere more visible than /tmp on the apache hosts - 10https://bugzilla.wikimedia.org/68459 (10Bryan Davis) 3NEW p:3Unprio s:3normal a:3None Having HHVM's stack trace logs is awesome, but having to ssh to each apache server to fi... [18:20:51] 3Wikimedia Labs / 3deployment-prep (beta): beta labs getting 503 Service unavailable or slow - 10https://bugzilla.wikimedia.org/68407 (10Bryan Davis) [18:21:05] 3Wikimedia Labs / 3deployment-prep (beta): HHVM crash logs need to go somewhere more visible than /tmp on the apache hosts - 10https://bugzilla.wikimedia.org/68459 (10Bryan Davis) [18:47:05] !ping [18:47:05] !pong [18:48:37] 3Wikimedia Labs / 3Infrastructure: "Bad gateway" - 10https://bugzilla.wikimedia.org/68457#c1 (10Tim Landscheidt) a:5Marc A. Pelletier>3None From the outside, "wget http://wdq.wmflabs.org/" retrieves something immediately, while "wget 'http://wdq.wmflabs.org/api?q=claim[31]&props=31'" returns a 502, so th... [19:16:06] 3Wikimedia Labs / 3tools: Failed to set group members for local-oclc-reference - 10https://bugzilla.wikimedia.org/65534#c3 (10Andrew Bogott) It seems to be working for me. Much of that code has been rewritten recently, so maybe this was fixed as a side effect. Please verify that this is working for you as... [19:19:35] Coren, are you interruptible? Want to talk about labs+swift [19:20:24] andrewbogott: Yeah; what be up? [19:21:02] andrewbogott: I got the basic openstack test install up so I'm golden. Ish. [19:21:08] woo! [19:21:17] Off and on folks have talked about prividing a swift service for labs with keystone. Which sounds good to me until I think about it a bit, at which point I'm not so sure what the 'with keystone' part would mean. [19:21:44] If swift uses the same auth as labs, and the service is there for labs instances to use... [19:22:05] I really hate to bug you but I wrote a series of commands and it works okay when I run it directly and when I run it jusb it stops without any notice or error (like memory error, etc.) and the commands are very simple, git, git review, similar [19:22:32] Coren: that requires us to store keystone auth credentials on labs boxes, right? Which seems… bad. [19:23:26] Amir1: don't worry about bugging us, that's what this channel is here for :) [19:23:48] That said, I don't have an easy answer -- you aren't getting a log or stderr output when you run it? [19:24:04] andrewbogott: Well, no worse nor better than any other credentials really. But the bigger issue IMO is that *users* have creds in Lab's keystone but not service groups or projects. [19:24:12] andrewbogott: thank you [19:24:32] The log is in /data/project/maintenance-bot/maintenance.err and log [19:24:38] Amir1: We'll need more information than that to help, I fear. Either a log, or more detail on when the sequence fails and why. [19:24:41] *and out [19:25:03] Coren: Right, that's exactly the question -- I don't know who/what keystone would be authorizing in this scenario. [19:25:04] Amir1: But also, how have you determined you have not simply run over your vmem limit? [19:25:32] do you think it is secure to upload gpg keys to your home directory on labs? [19:25:44] /data/project/maintenance-bot/command is commands [19:25:58] andrewbogott: There is, a priori, no reason why service groups couldn't have keystone creds - in fact, that's probably going to be a requirement of the "make the api available to users" milestone. [19:26:28] Coren: when my codes exceeds the memory limit, an error raised named "Memory limit" and the code crashes [19:26:36] physikerwelt: you should probably assume that all project members have access to any data you put on a labs box. [19:26:37] physikerwelt: I would very much recommend against it. In theory, everyone who has root has signed an NDA but key material that is that personal should never be entrusted to anyone. [19:26:39] Since most everyone has sudo [19:26:49] and obviously it doesn't need more than 256M [19:27:24] thank you I'll create a new key to sign the package [19:27:40] the code is for doing this: https://gerrit.wikimedia.org/r/#/q/owner:Maintenance-bot [19:27:48] Coren: OK, this is making sense. If service groups have credentials, then swift containers can be ACL'd by service group which makes a lot more sense than by per user. [19:28:05] Coren: I was thinking that it would be per-project rather than either -user or -servicegroup. But projects don't exactly have auth either. [19:28:29] andrewbogott: And it's trivial for a project to create a service group to auth with if they need it. [19:28:35] yep. [19:29:21] Ok, so it seems like teaching keystone about service groups is step one. I will think about that… if I can coax keystone to support two ldap schema at once then it should be pretty easy. [19:29:38] schemas? Is 'schema' singular or plural? [19:29:42] https://gerrit.wikimedia.org/r/#/q/owner:Maintenance-bot+status:merged,n,z [19:30:01] the only bad thing is that I can run it for test like everyday [19:30:06] andrewbogott: I'm pretty sure Ryan could help. Isn't he the one who wrote the LDAP backend for keystone to begin with? [19:30:18] (It's programmed to be done twice a week, in crontab) [19:30:49] Amir1: It's going to take a bit before I have the time to take a deep look at your logs since you don't have more details; please stand by. [19:31:03] Coren: I don't think he wrote it originally but he's certainly worked on it a fair bit. [19:31:15] okay Coren [19:31:17] :) [19:32:17] andrewbogott: plural 'shema' is 'shemata' [19:32:40] shema would be plural if the singular was shemum. :-) [19:32:51] yikes, ok. [19:33:06] Pretty sure I've never heard anyone actually use the word 'schemata' :) [19:33:42] I have, but I'm pedant about using the proper plurals of words with greek and latin origin. :-) [19:34:04] (BTW, same greek ending as 'stigma -> stigmata') [19:34:08] The internet says that 'schemas' is also acceptable. [19:34:28] Hm, when I hear 'stigmas' vs 'stigmata' I think of very different things :) [19:35:01] andrewbogott: Sure, if you like sounding like a 'murican. :-) [19:35:38] * Coren takes a peek at Amir1's logs. [19:36:30] Amir1: Your log had the error very visible: [19:36:35] git: 'review' is not a git command. See 'git --help'. [19:36:59] You need git-review installed on the exec nodes (it's only on bastions atm as it is nominally a dev tool) [19:38:13] Amir1: Open a bugzilla for it; if I don't get to it today I'm sure someone else will since it's a simple patch. [19:38:27] Amir1: Or submit a patch for it. :-) [19:40:57] Or deconstruct git review? It's just a rebase and a push, right? [19:41:12] Hm, maybe there's more to it than that. IDs and such [19:43:30] Coren: I think I did it already [19:48:23] YuviPanda: I went through the process of requesting access for my account months ago, and jeremyb created a project for me on Tool Labs called "nara". [19:48:40] For U.S. National Archives projects. [19:48:41] ah, right, so he created a tool [19:48:56] Dmcdevit: have you read https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help [19:48:56] valhallasw`cloud: How do you handle your review for https://tools.wmflabs.org/gerrit-patch-uploader/ ? (scroll up for Amir1's problem) [19:48:57] But I'v never actually done anything beyond that yet. [19:49:44] multichill: thank you, I am sure it worked in jsub several times before [19:49:59] maybe some servers have it and some don't [19:50:16] multichill: err [19:50:43] I generate an uuid in python [19:50:53] YuviPanda: I've looked at it, but it's written for someone with more basic knowledge than me. I am stumbling simply at the part about creating a SSH key and logging in with it. [19:51:16] no, wait [19:51:23] the commit message hook is seperate from git-review [19:51:32] nscp -p gerrit:hooks/commit-msg .git/hooks/ [19:51:35] scp -p gerrit:hooks/commit-msg .git/hooks/ [19:51:42] installs the commit hook in the repository [19:51:56] then you can just git commit && git push origin HEAD:refs/for/master [19:52:05] Amir1: ^ [19:52:07] Dmcdevit: hmm, does this section: https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Generating_and_uploading_an_SSH_key not help? [19:52:24] * valhallasw`cloud is gone again [19:53:45] thank you [19:55:19] YuviPanda: This is what I'm getting when I try following that: http://pastebin.com/5vdwY2VK [19:55:48] Dmcdevit: can you remove the file at /Users/Dominic/.ssh/known_hosts:4 and try again? [19:59:42] YuviPanda: Heh, well now I am getting a connection refused error, which I think means progress, because that looks like the firewall on this wifi network. [19:59:51] Dmcdevit: ah :) [20:00:05] can you run 'ssh -v tools-login.wmflabs.org' and paste the output? [20:00:53] ssh -vvv :P [20:01:39] 3Wikimedia Labs / 3deployment-prep (beta): beta labs getting 503 Service unavailable or slow - 10https://bugzilla.wikimedia.org/68407#c3 (10Ori Livneh) 5NEW>3RESO/FIX Resolved by https://gerrit.wikimedia.org/r/#/c/148743/ and https://gerrit.wikimedia.org/r/#/c/148754/ . (That isn't to say that we've reso... [20:02:30] YuviPanda: http://pastebin.com/yPRJ3nTa [20:06:24] Coren: scfc_de did the hostkey change? [20:06:26] I'm getting a warning [20:06:47] What host? [20:08:50] Coren: tools-login [20:09:16] No change. [20:09:19] YuviPanda: If that's tools-login.wmflabs.org, sure, during the move to eqiad, eh, four months ago?! :-) [20:09:20] Check vs [20:09:22] https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/tools-login.wmflabs.org [20:10:37] scfc_de: heh, I must've been using jus tmosh [20:11:07] Coren: scfc_de doesn't match, I'm getting 6f:7d:bd:08:d1:02:b6:0e:94:d0:4e:20:d0:18:f8:7a. [20:11:21] Then you [20:11:40] hmm? [20:11:45] you're not talking to the box you think you are talking to. Are you going through bastion? [20:12:29] Coren: I'm an idiot [20:12:36] scfc_de: sshing to tools.wmflabs.org [20:12:57] Dmcdevit: yeah, that does look like your firewall [20:13:41] I think it's blocking SSH. [20:14:05] Well, clearly it's intercepting it. Which is evil. [20:14:12] Yeay fingerprint checking! [20:15:00] Well, it also blocks ports for things like IMAP and IRC (that's why I'm on webchat), so it doesn't surprise me. [20:15:22] Might have to head back home before going deeper. :-) [20:18:25] Hoi Coren ... did the installation of the new hardward proceed well ? [20:18:48] Dmcdevit: :) [20:19:51] GerardM-: At the "wiping disks" stage. Lemme ask Chris how that's going. [20:20:36] He's not online; but I think he was in the DC today. We should get news soon. [20:23:26] :) [20:25:38] 3Wikimedia Labs / 3tools: Some issues: tools-webgrid-03/04, tools-login - 10https://bugzilla.wikimedia.org/67329 (10metatron) 5UNCO>3RESO/FIX [20:29:23] Coren: How about my first gerrit patch? scfc_de has kindy done some polish. thx [20:32:51] 3Wikimedia Labs / 3tools: tools-webproxy's Puppet status is "stale" - 10https://bugzilla.wikimedia.org/63436 (10Tim Landscheidt) 5NEW>3RESO/FIX [20:35:26] Coren: I find it https://bugzilla.wikimedia.org/show_bug.cgi?id=62871 [20:36:31] Amir1: That installed git-review to dev_environ, not exec_environ. [20:38:23] Coren: Yes, I realized so but I'm thinking how we can install it in exec_enviroment [20:38:48] Amir1: Like I said, open a bugzilla requesting it. Either I'll get to it soon or someone else will - it's a simple change. [20:39:04] specially since the patch is in operations/puppet which doesn't have very much commits: http://git.wikimedia.org/summary/operations%2Fpuppet.git [20:39:19] Coren: Sure, I reopen it [20:41:50] 3Wikimedia Labs / 3tools: Install git review on tools - 10https://bugzilla.wikimedia.org/62871#c3 (10Amir Ladsgroup) 5RESO/FIX>3REOP I need gerrit review in exec_environment (in order to be able to use it while using jsub) [20:42:16] hedonil: If you amend the commit message with scfc_de's suggestions, I'll +2 and merge it. [20:42:29] (I.e.: have your commit message be more explicit about what the change does) [20:42:38] You can simply 'git review --amend' [20:43:02] * hedonil loves git ! and tries [20:49:13] !log deployment-prep Changed config to run lua via external executable to avoid hhvm crashing bug [20:49:15] Logged the message, Master [20:54:23] 3Wikimedia Labs / 3tools: tools-shadow's Puppet status is "failed" - 10https://bugzilla.wikimedia.org/63437 (10Tim Landscheidt) 5NEW>3RESO/FIX [20:56:23] 3Wikimedia Labs / 3deployment-prep (beta): HHVM: crashes with "boost::program_options::invalid_option_value" exception - 10https://bugzilla.wikimedia.org/68413#c5 (10Bryan Davis) 5PATC>3NEW Patch was just a hack to switch to luastandalone mode until this crash can be examined/patched in hhvm/luasandbox. [21:11:00] Coren: Wow. the gui is my friend. [21:24:29] hedonil: Jenkins is lagging behind (booo!); but one it +2V I'll merge. [21:24:47] yeah [21:36:22] 3Wikimedia Labs / 3deployment-prep (beta): beta labs mysteriously goes read-only overnight - 10https://bugzilla.wikimedia.org/65486#c10 (10Chris McMahon) Adding Sean Pringle. This seems to be getting worse. I'd like to either update the db less often or else make it less disruptive. [21:36:50] 3Wikimedia Labs / 3deployment-prep (beta): beta labs mysteriously goes read-only overnight - 10https://bugzilla.wikimedia.org/65486#c11 (10Chris McMahon) also see https://github.com/wikimedia/operations-mediawiki-config/commit/38990c671fd3b8d15f31a7c819e7bdd52ecef3ef [21:58:31] Woo. \o/\o/ [21:59:21] hedonil: congrats [21:59:31] hedonil: New web services should use your new config. [22:00:24] YuviPanda: Coren: Yeah, let's do some webservice start [22:00:30] :D [22:18:49] Coren: Do you have any objections raising the webgrid job limit to 7 GB in webservice script? [22:20:57] mainly to allow users to configure more fcgi-worker, if needed [22:42:28] hedonil: I'd rather make the script accept an option; I definitely don't want it above 4G by default -- most people who run into the limit are just being careless with memory (like snarfing huge SQL results blindly) [22:43:01] Coren: option would also be fine [22:43:05] even better [22:54:02] Coren: considering a certain amount of control and to keep thing simple with bigbrother I'd suggest to refurbish variables from /data/project/.system/config [22:54:18] Coren: this time GB instead of workers [23:07:52] hedonil: This removes the flexibility of tool maintainers picking a memory requirement themselves. [23:14:54] scfc_de: if you are ok on admin-side with an additional parameter to webservice script, I'd be the last to protest. Just ensure that on restart bigbrother will take care of this parameter [23:30:51] !log deployment-prep Running `find . -type d -exec chmod 777 {} +` in /data/project/upload7 to finx shared image dir permisisons [23:30:54] Logged the message, Master [23:31:03] finx. nice [23:33:57] scfc_de: To be fair, I'd rather tool maintainers have to ask (and show cause) to get a bump in the footprint of webservers. [23:34:17] scfc_de: Remember that the webgrid is overcommited, unlike the general nodes. [23:35:15] And, well, much of the code that has real issue with the 4G vmem limit are very poorly written. [23:36:00] Yes, I have seen "SELECT * from revision where something_that_returns_thousands_of_rows" with the entire result set being snarfed in ram; on every web request. [23:36:41] If someone comes to be with "Look, I'm doing image manipulation on xxx and need foo G of ram; and I rate-limit XYZ" then I'd be glad to bump the limit up. [23:58:55] Coren: +1 to get rid of bad code :-). But at the moment, hedonil & Co. have to run their own webservice scripts, and it would be nice if we could unify that for redundancy & inconsistency. Re limits, if, I'd like to do them in SGE, i. e. for example have a restriction group "webservices-7G" whose owners can slurp up 7G in the webgrid queue, while others get only 4G.