[02:41:09] andrewbogott: woot! http://tools-proxy-test.wmflabs.org/ [02:41:16] andrewbogott: works without custom package on trusty! [02:41:24] andrewbogott: well, I think at least it isn't the custom package. let me verify [02:41:53] andrewbogott: woo! Indeed, it works [02:49:30] YuviPanda|zzz: BTW, is SPDY enabled ATM? (Cf. https://bugzilla.wikimedia.org/65134.) [02:49:46] scfc_de: yeah, but not spdy/3. ANd yes, that should be re-op because of the downgrade [02:50:09] scfc_de: can you re-open it? it's 4aM and I should crash soon [02:50:57] YuviPanda|zzz: Sure. (I thought you were already up again :-).) [02:51:10] scfc_de: hehe, not yet. [02:51:20] scfc_de: we can get rid of the hand-rolled nginx server soon! [02:51:38] scfc_de: am wondering if we should bring up two proxy boxes when this comes up and use DNS load balancing [02:52:58] YuviPanda|zzz: Well, I still think KISS rules -- I'd prefer to keep it simple if there's not a compelling reason :-). [02:53:08] scfc_de: hmm, that is valid as well [02:53:15] scfc_de: I would like to have way more monitoring on that one though [02:53:52] scfc_de: we also need a way to load test these things that don't depend on other things' performance [02:54:32] 3Wikimedia Labs / 3tools: Support SPDY/3 on the proxy - 10https://bugzilla.wikimedia.org/65134#c3 (10Tim Landscheidt) 5RESO/FIX>3REOP a:5Marc A. Pelletier>3Yuvi Panda (As confirmed by Yuvi on IRC.) [02:56:54] YuviPanda|zzz: +1 for monitoring, of course. But the problem so far has been whether a) make operations/puppet-style monitoring work on Labs (security issues) or b) set up a separate system that feeds Icinga with information on what to monitor. I haven't looked into this deeper. What do you mean by load test? [02:57:19] scfc_de: connections, avg. resp time, etc [02:57:24] scfc_de: ok, I have to go now :( will talk more later [02:58:41] YuviPanda|zzz: Good night! [07:09:30] hello [10:11:28] Hello Coren and andrewbogott_afk! I would like to have another IRC office hour before "the big date". What about Wednesday June 11th at 5 p.m. UTC? [10:27:41] Silke_WMDE the big date is Wikimania ? [10:28:38] hehe GerardM- no - the toolserver shutdown at the end of this month. [10:28:51] ah THAT makes sense [10:37:51] Silke_WMDE: i have two bugs that i'd like to be fixed before ts shutdown: https://bugzilla.wikimedia.org/show_bug.cgi?id=62387 and https://bugzilla.wikimedia.org/show_bug.cgi?id=56995 [10:38:30] i would package it myself but i have absolutely no idea how [11:20:44] gifti: OK, I see. [11:28:31] 3Wikimedia Labs / 3tools: build and install tcl fcgi - 10https://bugzilla.wikimedia.org/56995#c15 (10Silke Meyer (WMDE)) p:5Normal>3High s:5enhanc>3major This needs to be done before TS shutdown. What are the steps that need to be done / tested exactly? [11:36:47] 3Wikimedia Labs / 3tools: Missing Toolserver features in Tools (tracking) - 10https://bugzilla.wikimedia.org/58791 (10Silke Meyer (WMDE)) [11:36:47] 3Wikimedia Labs / 3tools: build and install tcl fcgi - 10https://bugzilla.wikimedia.org/56995 (10Silke Meyer (WMDE)) [11:38:16] 3Wikimedia Labs / 3tools: Missing Toolserver features in Tools (tracking) - 10https://bugzilla.wikimedia.org/58791 (10Silke Meyer (WMDE)) [11:38:16] 3Wikimedia Labs / 3tools: Update tcl-trf to version 2.1.4-dfsg-3 - 10https://bugzilla.wikimedia.org/62387#c1 (10Silke Meyer (WMDE)) p:5Unprio>3High s:5normal>3major This is needed before TS is shut down. Is the requirement clear? Any comments? [15:20:42] Silke_WMDE: is https://bugzilla.wikimedia.org/show_bug.cgi?id=56995#c15 directed at me? [15:21:27] Well... To everyone who can explain details. [15:21:39] i think the details are already there [15:21:50] at least which packages are needed [15:21:54] hey Silke_WMDE - if you have a moment, come into #mediawiki so I can introduce you to my intern? [15:23:38] gifti OK. So the question is "just" how to find someone how build that source deb for Coren [15:24:30] or someone who can explain to me what I have to do for packaging here [15:24:31] sumanah: I'm in the middle of something [15:24:39] oh ok [15:24:42] sorry [15:24:46] Silke_WMDE: it's ok [15:24:48] short intro: [15:24:55] Frances Hocutt, fhocutt, is my intern this season, working on https://www.mediawiki.org/wiki/Evaluating_and_Improving_MediaWiki_web_API_client_libraries . She's going to be evaluating, documenting, and improving several MediaWiki web API client libraries in 5 languages! So tool-makers and bot-runners will be benefiting from her work if they use those libraries [15:25:01] Silke_WMDE: just wanted you to know that [15:25:17] cool! [15:25:29] thanks [15:26:26] cool. See you :) [15:41:10] Silke_WMDE: It's mostly just a matter of time. I may well be able to scrounge up the time to do so, but it's hard to estimate exactly when. [15:41:35] Coren: Before the end of the month! ;) [15:42:56] That's when it's be needed, not necessarily when I'll have time enough to do it. That said, all I'm doing for the next two weeks is working on bugs like this one except for one deploy this week, so I'll have more time to dedicate to this than I usually do. [15:47:49] bd808: If I change the l10nupdate gid/uid in labs, can you do the janitorial stuff in beta to get file ownership fixed? [15:48:25] andrewbogott: Probably. It should be a pretty simple salt command no? [15:48:49] um, yes, if the beta VMs are set up properly with salt [15:49:12] andrewbogott: They are! We have our own salt master so we can run trebuchet [15:49:20] yeah, should work then [15:49:35] hashar, any objection to me changing that right now? It might cause some beta hiccups. [15:53:35] Coren: Sounds good! [15:59:46] bd808: ok, sent an email to you and hashar about coordination. [15:59:59] Changing the ID is trivial, I just don't want to do it until y'all are ready. [16:00:14] bd808: Can we sit on that gerrit patch in the meantime? [16:00:45] Yeah. Feel free to -1 and give the reason [16:01:30] andrewbogott: as bd808 said, we can fix them with salt I guess :) [16:01:52] hashar: Are you alert enough that we can just do this right now? [16:01:55] andrewbogott: we might have l10nupdate mwdeploy files on NFS server as well ( /data/project ) [16:02:04] we are both in a conf call, then I am out for dinner [16:02:13] but blindly trust Bryan :] [16:02:18] hashar: Yes, a system-wide find will catch those as well... [16:02:37] ok, so how about tomorrow 10AM PST? [16:03:03] andrewbogott: That should work for me. [16:05:25] andrewbogott: I'm guessing/hoping you still have the find command needed somewhere from the other uid/gid changes you have been doing? [16:05:41] sure, although I haven't done gids I don't think... [16:05:48] I'm sure I can write one if needed, but debugged versions are helpful :) [16:06:43] Switch from s/-uid/-gid/ and s/chown/chgrp/ is probably the fix for that I'd guess [16:07:26] bd808: I pasted my command into the bug [16:07:31] 3Wikimedia Labs / 3Infrastructure: l10nupdate gid should be 10002 to match production/Puppet - 10https://bugzilla.wikimedia.org/65588#c1 (10Andrew Bogott) I'm going to change this tomorrow. After that some file ownership will need to change... via running something like this on the beta salt master: $ salt... [16:07:32] thanks [16:15:29] hey andrewbogott! [16:15:38] andrewbogott: reworked the patch to use ubuntu mongo, https://gerrit.wikimedia.org/r/#/c/135442/ [16:15:46] YuviPanda: ok, I'll look... [16:15:50] Also, you got a Trusty proxy working? [16:16:34] andrewbogott: YES! [16:16:41] andrewbogott: http://tools-proxy-test.wmflabs.org/ is running stock trusty packages [16:16:57] YuviPanda: were you ever able to set up stress tests that replicated our 403 problems with the official 1.6 package? [16:17:07] andrewbogott: no, haven't had the time to do those yet, no. [16:17:19] andrewbogott: this does use a *lower* version than what we are using tho (1.4.6 vs us using 1.5.0) [16:17:30] hm, ok. [16:17:51] I think we shouldn't rock the boat until we have a better understanding of what went wrong before. [16:18:01] But, switching over to an all-upstream Trusty proxy will be great. [16:18:28] andrewbogott: yeah, agreed. not sure when we'd want to do that either, though [16:18:54] andrewbogott: we'd also need to schedule downtime for a few mins (5-10) while we switch the proxy over to a new machine [16:19:27] yeah [16:19:54] YuviPanda: the mongo class is meant to be applied on its own node? Not on all exec nodes? [16:20:06] andrewbogott: yes, own single machine [16:20:16] andrewbogott: it's on tools-mongo-test right now, actually. self-hosted puppet [16:20:47] YuviPanda: is mongo something that will ultimately need to be on a bare-metal host? Or is performance OK on a labs box? [16:21:01] I worry about the durability of that database, if it's just in local instance storage... [16:21:58] andrewbogott: performance is ok on a labs box. Usage would have to skyrocket *massively* before it needs its own box [17:00:03] 'k [17:00:03] andrewbogott: I spoke to people who have done biggish mongo installations on EC2, and their perf was ok. I guess our storage shouldn't be much slower than that [17:00:04] Coren: any maintenance going on? [17:00:04] things are pretty down [17:00:04] hedonil: Yeah, nothing that _should_ have affected anything so of course it did. I'm about to rollback. [17:00:04] Hmm. Labs isn't loading. I'm in the middle of ACC work. [17:00:04] Coren: well it's website as well as tools-login [17:00:05] hedonil, everything is down. [17:00:05] Cyberpower678: hi!, yeah, we're doomed! [17:00:05] Imagine if OTRS was on labs. :p [17:00:05] it would be more reliable? ;) [17:00:05] Based on now? Nope. [17:00:05] !newlabs [17:00:05] !newtoollabs [17:00:05] WTF? [17:00:05] Cyberpower678: At worse, this'll be a few minute's outage. [17:00:06] Network issue; paravoid is on it. Hang tight people. [17:00:06] Heh, my local network went down right when labs did [17:00:06] andrewbogott: I'm pretty sure that's not Labs' fault. :-) [17:00:06] Yeah, I'm going to call it coincidence until it happens a second time. [17:00:06] quantum tunneling, that's what it is [17:00:06] andrewbogott: thanks for the merge! [17:00:06] Is Labs happy again? (I can't still access an instance but that could easily be on my end) [17:00:06] andrewbogott no. [17:00:06] :( [17:00:06] * andrewbogott stands back and shuts up [17:00:06] andrewbogott: No. Paravoid is on it atm, will roll back shortly if he can't figure it out soon. [17:00:06] Coren: just curious - what was the upgrade like? [17:00:06] hedonil: Just turning on a bound port; but the switch seems to not have wanted it. [17:00:06] Coren: bound port = link aggregation? [17:00:06] hedonil: Yes, though all the copper was already in place so this shouldn't have been disruptive. [17:00:06] * Cyberpower678 makes it happen a second time. [17:00:06] FLOOD [17:00:06] I'm going to lunch and taking my scooter to the shop… back later. [17:00:07] * hedonil sings LACP and Spanning tree, wheee, whee, whee ... [17:00:19] Network to server is back up; things should return to normal. [17:01:03] Sorry about this people; this _was_ suppose to be a brief config switch with no visible effect. And then entropy decided that was too easy. [17:09:00] 3Wikimedia Labs / 3(other): Database dewiki_p on dewiki.labsdb : views hashs and links broken - 10https://bugzilla.wikimedia.org/55708#c5 (10Marc A. Pelletier) Hm, maintain-replicas doesn't actually /remove/ views it no longer cares about. I'm whipping up a script to cleanup now. [17:36:59] 3Wikimedia Labs / 3(other): Database dewiki_p on dewiki.labsdb : views hashs and links broken - 10https://bugzilla.wikimedia.org/55708#c6 (10Marc A. Pelletier) Both tables have been cleaned up. [17:37:13] 3Wikimedia Labs / 3(other): Database dewiki_p on dewiki.labsdb : views hashs and links broken - 10https://bugzilla.wikimedia.org/55708 (10Marc A. Pelletier) 5ASSI>3RESO/FIX [17:45:31] Coren: what happened that caused the outage? [17:46:57] Betacommand: It's not clearly determined yet; we aborted the investigation to rollback and restore service. As far as we can tell, there is something subtly wrong with the switch configuration of LACP. [17:50:41] !log tools Brief network outage. source: It's not clearly determined yet; we aborted the investigation to rollback and restore service. As far as we can tell, there is something subtly wrong with the switch configuration of LACP. [17:50:43] Logged the message, Master [17:51:12] Maybe the links don't feel like being aggregated [17:58:56] Coren: ping when you have a bit of time, need to PM (not-critical) [17:59:18] YuviPanda: Go ahead, worse that can happen is lag in my responses. :-) [17:59:27] Coren: :D [18:12:12] andrewbogott_afk: thoughts on moving the general labs proxy to trusty first? [18:16:29] jeremyb_: +1 :) [18:22:53] Coren: I'm going to create tools-mongo soon, and then have a patch that creates a mongo.creds.json file for every tool. [18:23:05] kk [18:27:51] jeremyb_: subscribe to that bug? you realize it was me who created it? [18:28:10] so clearly I do support idea of ipv6 in labs [18:42:14] 3Wikimedia Labs / 3deployment-prep (beta): beta labs mysteriously goes read-only overnight - 10https://bugzilla.wikimedia.org/65486#c7 (10ryasmeen) Created attachment 15558 --> https://bugzilla.wikimedia.org/attachment.cgi?id=15558&action=edit Screenshot I have reproduced this issue today on Betalabs, att... [18:43:00] Coren: https://wikitech.wikimedia.org/wiki/Special:NovaInstance mentions puppet status as 'failed' for all the nodes? [18:45:12] YuviPanda: the logic to decide that is somewhat dumb; there's currently a circular dependency in some packages tha cause a (harmless) error, but the script doesn't know that [18:45:24] Coren: heh, right. I remember reading that bug [18:52:39] Coren: andrewbogott_afk mounting NFS via the role seems to fail on trusty [18:52:43] err: /Stage[main]/Role::Labs::Instance/Mount[/home]: Could not evaluate: Execution of '/bin/mount -o rw,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,noatime,nofsc /home' returned 32: mount.nfs: mounting labstore.svc.eqiad.wmnet:/project/tools/home failed, reason given by server: No such file or directory [18:53:31] YuviPanda: You might simply have beaten the NFS server to the punch [18:53:46] Coren: hmm, this happened with my proxy test trusty i mage as well [18:54:11] Hm. I can think of no reason why Trusty would behave differently, but there is a delay before a new instance gets added to the ACLs. [18:54:24] Coren: hmm, let me see if it still works [18:54:43] Once puppet ran, you can simply try with 'mount /home' [18:55:10] yeah, puppet is still running [18:55:21] rr: /Stage[main]/Role::Labs::Instance/Mount[/data/project]: Could not evaluate: Execution of '/bin/mount -o rw,vers=4,bg,hard,intr,sec=sys,proto=tcp,port=0,noatime,nofsc /data/project' returned 32: mount.nfs: mounting labstore.svc.eqiad.wmnet:/project/tools/project failed, reason given by server: No such file or directory again [18:55:57] Yeah, that's normal - if one has failed the other almost certainly will. Either you've been ACL'ed or you haven't. What is the instance IP? [18:56:19] Coren: nope, mount.nfs: mounting labstore.svc.eqiad.wmnet:/project/tools/home failed, reason given by server: No such file or directory [18:56:31] Coren: 10.68.17.143 [18:56:38] Coren: previous one was 10.68.17.141 [18:56:53] Huh. Neither are in the ACLs. [18:57:16] Coren: might be a trusty issue [18:57:29] Ah, manage-nfs-volumes-daemon died on an LDAP outage. [18:57:38] oh [18:57:43] so it's all new instances? :) [18:57:54] * Coren restarts it. [18:57:57] It'd be. [18:58:05] Chances are, it should already work now. [18:58:50] hmm, mount /home still fails [18:58:53] should I run puppet again? [18:59:37] YuviPanda: No, that's not going to help -- but there may be some negative caching involved. [19:00:01] Coren: hmm, right. Does it have a defined TTL I can just wait out? [19:00:05] I see both your instances in the ACLs. [19:00:13] Coren: let me reboot [19:00:24] Coren: rebooting [19:10:09] Coren, got a minute? I have a new work laptop and new ssh keys. Right now I can do "ssh -A cmcmahon@bastion.wmflabs.org" OK, but when I try to ssh from there to e.g. deployment-bastion host I'm getting rejected because of ssh key. [19:10:50] chrismcmahon: You have to either set a ProxyCommand or forward your key for that to work. [19:11:19] Coren "ssh -A" forwards the key, yes? [19:12:04] chrismcmahon: Iff key forwarding is allowed by the bastion (which, IIRC, it is). [19:12:30] chrismcmahon: But -A will only forward your key if your /agent/ has it, not if it was just presented to the login host. [19:12:35] (I.e.: you have to ssh-add it first) [19:12:46] Coren: aha, thanks for the clue, I think I know what I did wrong [19:17:34] Coren: nfs seems to work now. ty [19:27:43] 3Wikimedia Labs / 3deployment-prep (beta): beta labs mysteriously goes read-only overnight - 10https://bugzilla.wikimedia.org/65486#c8 (10Chris McMahon) Antoine, would these messages be relevant? They do not seem to happen at any particular interval but they might be correlated to the time at which Rummana... [19:37:22] coren, petan: cron not working on Tools? I tried 'crontab -l' and nothing happened for minutes, until I interrupted it [19:37:54] russblau: Lemme look into it. [19:38:25] russblau: Looks like the submit host is ill. [19:38:27] * Coren beats it up. [19:44:13] don't hurt it.... :-) [19:45:17] russblau: It's baaa-ack. [19:45:41] thank you! [19:46:31] But also, never fear: IT is the only domain where the cure for being ill is often a good beatup. [19:46:35] :-) [19:55:18] Coren: ping? the script that generates mysql usernames and passwords for tools. does it run on tools-db? [19:55:53] No, on labstore1001 (which is pretty much the only box that has unfettered access to storage and the DB simultaneously for all projects) [19:56:48] Coren: ah, hmm. I guess I could get my similar script for mongodb to run on tools-mongo, which should be ok? [19:57:00] Coren: since it'll have access to mongodb and also to /data/project? [19:57:12] Coren: and it's ok since this is tools only anyway [19:57:14] YuviPanda: Right, and it doesn't need to manage credentials for non-tools projects. [19:57:19] Coren: yup [20:19:14] 3Wikimedia Labs / 3deployment-prep (beta): beta labs mysteriously goes read-only overnight - 10https://bugzilla.wikimedia.org/65486#c9 (10Chris McMahon) I saw this just now also. [21:11:25] petan: so what? i was telling e.g. Jan and whoever else might be lurking that there's a bug not just a thread [21:12:42] * jeremyb_ runs away [21:41:51] hi, is there any page listing the tools (not projects) in Tool Labs like https://meta.wikimedia.org/wiki/Toolserver/Projects ? [21:43:43] danilo: http://tools.wmflabs.org/ [21:45:24] this is the projects list, i'm asking about list of web tools, some projects have more than one, some have no one [21:46:38] oh, then I don't know. [21:48:03] danilo: I fear there is no such list [21:49:49] danilo: you can take a look here (and sort by equiad webservice column) http://tools.wmflabs.org/tools-info/migration-status.php [22:00:17] ok, I will create a page in meta, I have many tools in one project, I think it is important to have a list with the description of each web tool function, I have already created a tool not knowing it already has a similar one [22:03:42] there's also a dewiki list [22:06:37] thanks! https://de.wikipedia.org/wiki/Wikipedia:Technik/Labs/Tools I'll use this as base to metawiki page [22:07:00] :) [22:30:06] !log deployment-prep Deleted unused /data/project/apache/common-local on NFS share. [22:30:08] Logged the message, Master