[01:41:16] [bz] (8NEW - created by: 2Liangent, priority: 4Normal - 6major) [Bug 54934] Wikimedia Labs database replication has seemingly stopped (s1 and s2?) - https://bugzilla.wikimedia.org/show_bug.cgi?id=54934 [02:40:58] @replag [02:40:59] Replication lag is approximately 1.18:46:56.9735520 [03:20:56] @replag [03:20:57] Replication lag is approximately 1.08:52:07.7722680 [03:21:09] @replag [03:21:10] Replication lag is approximately 1.08:49:12.9079780 [05:07:55] @replag [05:07:55] Replication lag is approximately 02:12:34.2429770 [05:28:01] @replag [05:28:01] Replication lag is approximately 00:00:01.4605640 [10:05:10] !log deployment-prep applied iptables NAT rules on deployment-bastion {{bug|45868}} [10:05:16] Logged the message, Master [10:12:10] (03PS1) 10Hashar: draft exposed as 'draft-1' instead of 'PD1' [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/88037 [10:18:54] !log bastion labstore3 shows up a system CPU plateau ( http://ganglia.wikimedia.org/latest/?r=hour&s=by+name&c=Labs%2520NFS%2520cluster%2520pmtpa ) seems NFS is dead now. [10:19:06] :/ [10:19:14] *goes to ping ops people... [10:19:19] gah, again? [10:19:29] I can login to tools-dev and tools-login but cannot access the shell, what happened? [10:19:29] my bot!!!! [10:20:04] http://tools.wmflabs.org/audetools/coords is dead [10:20:36] hashar buy the funeral^^ [10:21:35] labstore3 seems to have gone mad again :) [10:22:00] it always happens when Coren is sleeping [10:22:11] maybe he is awake soon [10:22:13] o_O [10:22:13] yup, has it been antoehr 2 weeks already? [10:22:20] most probably [10:22:22] ah, someone please press the reset button! :P [10:22:31] hashar: Im sure I added it to my calander :P [10:22:46] did we document the steps to fix it? [10:22:51] didn't we [10:23:18] we cant fix, only ops people can, I did tell whoever did it last time to document it as they did it, not sure if they did or not [10:23:29] also I cant remember exactly who i poked last time [10:23:30] yeah [10:26:23] !log bastion Faidon rebooting labstore3 due to NFS lock up. [10:26:44] the fix is to reboot it basically [10:26:51] morebots: dead again aren't you ? [10:27:15] and something needs to be started again [10:30:02] is there any alternative server to login? [10:31:09] ebraminio: no [10:31:23] this would affect all login servers [10:34:17] aude: will it fixed soon? [10:34:37] I mean just interested to know an estimation about when it will be fixed [10:34:50] Hi all. I have a problem wih my account. In pas i could connected via ssh. For an unknown reason i cant connect now. I tyoe ssh -A ...@tools-login.wmflabs.org but is not responding [10:35:25] Seems someone must put an informing message on channel topic [10:36:33] omg [10:36:34] sorry [10:36:47] ebraminio: i think ops are working on it [10:43:45] ebraminio: panos_ aude nfs just came back up [10:43:57] give it a min or two and everything should be roughly back to normal [10:44:00] w:) [10:44:09] addshore: great, thank you! [10:44:30] great [10:46:31] labs-morebots: ping [10:46:31] I am a logbot running on tools-exec-06. [10:46:31] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [10:46:31] To log a message, type !log . [10:46:54] !log bastion labstore3 showed up a system CPU plateau starting at roughly 10:08am ( http://ganglia.wikimedia.org/latest/?r=hour&s=by+name&c=Labs%2520NFS%2520cluster%2520pmtpa ) seems NFS is dead now. [10:46:56] Logged the message, Master [10:47:09] !log bastion labstore3 Faidon rebooted labstore3 and restarted NFS. Issue solved. [10:47:10] Logged the message, Master [10:47:47] "solved" [10:48:12] !log deployment-prep applied iptables rules for {{bug|45868}} on deployment-apache{32,33} and jobrunner08 [10:48:17] Logged the message, Master [10:48:22] [bz] (8RESOLVED - created by: 2Maarten Dammers, priority: 4Immediate - 6critical) [Bug 54847] Data leakage user table "new" databases like wikidatawiki_p and the wikivoyage databases - https://bugzilla.wikimedia.org/show_bug.cgi?id=54847 [11:44:22] [bz] (8PATCH_TO_REVIEW - created by: 2Antoine "hashar" Musso, priority: 4Normal - 6enhancement) [Bug 36994] [OPS] Add disk I/O to ganglia reports - https://bugzilla.wikimedia.org/show_bug.cgi?id=36994 [12:49:13] Oh, you have GOT to be shitting me. [12:49:23] Coren, ? [12:49:41] And of course, I was all prepared to switch away from labstore3 to 4 last week but things got delayed because of the leak. :-9 [12:49:54] Coren, ? [12:50:42] Ah, I see Faidon already fixed it. [12:50:55] Coren, ? [12:51:03] labstore3 got a headache again. [12:51:06] Oh [12:51:10] Earlier. Check backscroll. [12:51:24] Nothing there [12:51:45] Coren, how's Cyberbot coming? [12:51:48] [06:21:35] labstore3 seems to have gone mad again :) [12:51:48] [06:22:00] it always happens when Coren is sleeping [12:51:48] [06:22:11] maybe he is awake soon [12:51:48] [ [12:52:17] The node is slowing down. [12:52:17] :> [12:52:31] Well I gotta go. [12:52:45] Coren: much quicker this time [12:52:49] Coren: amusingly it is 2 weeks after the last one happened [12:52:53] we're getting good at this [12:53:03] but shouldn't happen at all, obviously [12:53:22] addshore: I am, honestly, not amused. :-/ [12:54:05] indeed, but at least the downtime isnt as bad as it was the first time, only about 45 mins today [12:54:26] CP678: You're still #2 on my todo list; switching away from labstore3 is my #1 [13:06:30] guys can you help me with hotcat? [13:32:19] Pratyya: What's up? [13:43:56] Coren: labstore3 went crazy at 10:05am UTC roughly. Not sure if you have been made aware of it [13:44:13] Coren: Faidon rebooted the box and ran whatever script was needed to start NFS [13:44:25] hashar: I have. The "fun" thing is that if it hand't been for the leak I would have already switched to labstore4. :-/ [13:44:36] the only thing Iknow is that I did a git pull of mediawiki/extensions.git around that time, but that is probably unrelated [13:44:57] hashar: It almost certainly is. It had been two weeks since the last restart. [13:45:47] yup [13:46:13] * Coren has much hate for labstore3. [13:48:28] je compatis :-D [13:48:36] ^^^ I have no clue how to say that in english [13:50:55] Coren: any idea how user home dirs are populated on NFS [13:51:09] the user yurik has a home on gluster but none on the NFS server /home [13:51:20] ( labnfs.pmtpa.wmnet:/deployment-prep/home ) [13:51:22] hashar: That's normally created by PAM on ligin. [13:51:26] ahh [13:51:31] login* [13:51:31] so that is something else [13:51:39] * Coren nods. [13:52:40] En anglais, on dit généralement "I empathise" pour a-peu-près le même sentiment. [13:52:56] hashar: yep [13:53:01] Coren: je tacherais de m'en souvenir [13:53:08] i'm connecting to ssh deployment-staging-cache-mobile01 [13:53:21] ssh api1 works [13:53:32] yurik@bastion1 [13:53:35] yurikNomad: so apparently you are able to connect to bastion1 so you should be able to connect to deployment-staging-cache-mobile01.pmtpa.wmflabs [13:53:39] but there is no homedir for you there [13:53:50] coren confirmed that the homedir is created by pam on first login [13:54:09] so its a pubkey issue [13:54:36] I am removing you from the project and reading you [13:54:40] can you check if api1 has the same set of keys for me? [13:54:42] ok [13:54:46] checking online [13:55:53] yurikNomad: done. can you try again ? [13:56:28] hashar: nope [13:56:30] :( [13:56:42] Coren, how's the Cyberbot exec node coming? [13:56:49] It seems to be getting slower. [13:57:12] yurikNomad: can you try on deployment-bastion.pmtpa.wmflabs ? [13:57:12] CP678: You're still #2 on my todo list; switching away from labstore3 is my #1 [13:57:28] Coren, ok. Thanks. :) [13:57:33] hashar: you mean i should ssh to that first or through bastion ? [13:57:40] yurikNomad: through bastion [13:57:52] hashar: yei! [13:59:03] yurikNomad: you have a /home now. Can you try again on deployment-staging-cache-mobile01.pmtpa.wmflabs ? [13:59:18] that instance is probably mis configured [13:59:29] hashar: from bastion? [13:59:45] doesn't work from bastion [14:01:27] My hotcat is in problem. My preference is fixed. The Hotcat button is checked. But still it isn't working. I've also cleared my browser's cache. Coren [14:01:55] yurikNomad: I probably need to fix the instance so [14:02:12] Pratyya: I'll need a bit more detail than that -- what is the mediawiki install you are talking about? [14:03:20] Actually I don't understand your question clearly. What do you mean about mediawiki install Coren [14:08:20] Well, you're having problem with a mediawiki extension; I'd need to know what wiki that extension was supposed to be running on if you need help. :-) [14:08:35] yurikNomad: anyway regarding https://gerrit.wikimedia.org/r/#/c/87328/1/templates/varnish/mobile-frontend.inc.vcl.erb,unified [14:08:44] yurikNomad: you could add an options to the template [14:09:29] yurikNomad: mark has been using the hash cluster_options sometime [14:09:45] yurikNomad: so potentially you could add to cluster_options a new key such as use_esi [14:10:15] hashar: and use some erb-templating "if" magic [14:10:18] https://gerrit.wikimedia.org/r/#/c/86258/3/templates/varnish/mobile-frontend.inc.vcl.erb [14:10:20] yurikNomad: then your vcl conf will have something like: if cluster_options.fetch('use_esi', false) [14:10:52] yurikNomad: look at bits.inc.vcl.erb , an example is the 'test_hostname' cluster option [14:13:17] yep, i see it, will make a patch soon to enable ESI via a role [14:13:56] will i need to alter cache.pp with $common_cluster_options = { [14:13:58] 'test_hostname' => "test.wikipedia.org", [14:13:59] 'enable_geoiplookup' => true, [14:14:01] } [14:14:14] or i will just suply it via CLI? [14:14:15] coren it's 1.22wmf19 [14:15:13] Pratyya: The version help, but I need to know what wiki you are using. Is it a tool on Tool Labs or does it have its own project? [14:16:01] OH!. it is en wiki Coren [14:16:31] hashar: so what are our steps now? I will at some point do a patch for option-based ESI config [14:16:36] I can't fix up deployment-staging-cache-mobile01 [14:16:41] so going to create a new one [14:17:25] oki, and i guess i will be able to simply edit vcl puppet file on it and push it, and later reset it [14:17:43] yeah that is the idea [14:17:43] since the only change is really uncommenting things [14:17:47] fetch from gerrit to the instance [14:17:48] run puppet [14:17:51] do your tests [14:18:16] on deployment-staging-cache-mobile01 puppet can't find the private modules for some reasons [14:18:58] puppet will push it to the whole beta cluster i take it. So then i will just have to look at apache logs to see why we get 503 [14:19:14] varnish logs that is [14:26:35] Coren: you there? [14:26:40] hashar: need to step away for a bit, ping me with the new instance, will try to get the puppet to run. Hope noone else will complain that beta will go down :) [14:27:05] yurikNomad: deployment-staging-cache-mobile02.pmtpa.wmflabs [14:27:11] yurikNomad: will get varnish and puppetself installed [14:27:34] hashar: connected!!! [14:27:47] will play with puppets in an hour or so [14:27:48] Pratyya: Oh, sorry I didn't see your message. This channel is really about the Wikimedia Labs; I'm not sure I can help you with Hotcat on enwiki -- perhaps #wikipedia-en might be a better place to ask? [14:27:55] hashar: thanks!!! [14:45:24] hiya, qchris is trying to log into labs instances that he has had access to before [14:45:32] this is in the analytics project [14:45:43] he can access some nodes in the analytics fine [14:46:19] in other nodes he gets Failed publickey for qchris [14:46:36] "Permission denied (publickey)." [15:21:24] [bz] (8RESOLVED - created by: 2Maarten Dammers, priority: 4Immediate - 6critical) [Bug 54847] Data leakage user table "new" databases like wikidatawiki_p and the wikivoyage databases - https://bugzilla.wikimedia.org/show_bug.cgi?id=54847 [15:46:14] YuviPanda: I'm not sure if this is appropriate or necessary, but… could the API have a call that responds with the IP of the actual proxy gateway? We shouldn't take for granted that the API runs on the same box as the proxy. [15:46:47] (This is based on my understanding that I'm responsible for setting up the DNS that points the proxy front-end name to the gateway.) [16:36:48] nerus: hey! [16:37:02] YuviPanda: Hi [16:38:08] nerus: what subprocess calls? [16:39:00] So [16:39:37] YuviPanda: when I do a os.system in a system call in cgi handler method that works [16:39:43] but when i do a subprocess call [16:39:52] it just fails [16:39:57] fails with what? [16:40:01] dies blankly :P [16:40:05] and what exactly do you mean by 'subprocess call'? [16:40:27] OS system call - os.system("/usr/bin/qstat -u '*' -r -xml") [16:40:38] Subprocess call - xml = subprocess.check_output("/usr/bin/qstat -u '*' -r -xml", shell = True) [16:40:53] the second gives me the output i can use to show further stuff [16:41:38] ah, hmm [16:41:52] nerus: no stacktrace? [16:42:02] sadly, thats been happening from start [16:42:19] even Coren's pro tip isn't working for me, but i have a feeling i didn't set it right [16:42:41] hmm [16:42:45] let me go through logs [16:42:51] and see if I can find anything at all [16:44:01] i am trying a couple of other stuff [16:44:43] tried the sh module? [16:44:52] I do that and it works well for me [16:45:03] i am guessing this could be a process limit from apache but not really sure because i dunno whats the difference between and os and subprocess [16:45:06] yeah tried Sh too [16:45:09] that didn;'t work either [16:45:19] @YuviPanda then i am sure i am doing something wrong [16:45:29] could you point me to your code? [16:45:45] nerus: moment [16:51:10] nerus: hmm, lost in history :D let me dig it up [16:51:29] YuviPanda: I have time! [17:00:07] andrewbogott: hey! [17:00:19] andrewbogott: well, I'd assume that the IP for the proxy would be a config variable for wikitech [17:00:27] since it's going to be one IP [17:00:49] andrewbogott: isn't that cleaner than making an API request every time? [17:01:56] YuviPanda: It's simpler. If there's only ever one proxy gateway (as opposed to one per region, etc....) [17:02:26] andrewbogott: still, I think, for now, makes sense to just keep it a config variable. [17:02:32] You can add it to your feature wishlist -- I certainly don't need it in the short run. [17:02:37] sure! [17:07:43] YuviPanda: Do you have a log you can tail & tell me what's happening in the proxy API right now? [17:07:53] i can [17:07:55] logging in [17:08:03] thanks [17:08:13] I'm about to try to create a proxy... [17:08:21] andrewbogott: ok [17:08:38] see anything? [17:08:41] I see a PUT [17:08:46] 10.4.1.57 - - [07/Oct/2013 17:08:31] "PUT /v1/visualeditor/mapping HTTP/1.1" 200 - [17:08:56] Oh, I bet I'm forgetting to pass the body :) [17:09:45] OK, how 'bout now… anything different? [17:11:13] YuviPanda: was there a body that time? [17:11:24] andrewbogott: 3) "frontend:toast.brunch.wmflabs.org" [17:11:28] was that what you created? [17:12:03] yeah, should have a single backend, visedinstance2.pmtpa.wmflabs:80 [17:12:10] Is it formatted right? [17:12:12] andrewbogott: 1) "a5012ca5-1766-4f40-9000-11dc685ebfde:80" [17:12:16] is what it has :| [17:12:28] YuviPanda: any luck on your code? [17:12:33] ooh, that's not right! [17:12:47] nerus: gimme a moment. I forgot where it even was :| [17:13:00] ha :P [17:13:18] may be i should build a tool to view the logs of other tools in the web [17:13:33] nerus: so that works for PHP. only doesn't work for CGI [17:13:35] because CGI sucks [17:13:49] he he [17:13:55] i know access logs exists for php [17:14:20] error logs too [17:16:41] YuviPanda, how about that time? [17:17:20] And, what do you return on success? 201? [17:17:43] andrewbogott: 1) "visedinstance1.pmtpa.wmflabs:80" [17:17:46] andrewbogott: 200 [17:18:53] YuviPanda: OK, looks like it's working and I was just expecting the wrong rval. [17:18:58] oh [17:18:59] Look OK on your end? [17:19:04] it should 400 or 404 on error [17:19:07] yeah! [17:19:19] I copy/pasted code that checks for success; else fail [17:19:22] rather than the reverse :) [17:19:34] Anyway… thanks. [17:19:42] Will bug you again when I get to delete. [17:19:45] nerus: https://github.com/yuvipanda/SuchABot/blob/dcd97585a51f6e2499b13a2a426c1a3be69a376d/suchabot/receiver.py [17:19:50] andrewbogott: :) [17:20:00] andrewbogott: I enabled spdy support in the proxy over the weekend :) [17:20:35] I saw that! spdy must not work how I thought it worked, though, I thought it had to be implemented all the way at the endpoint. [17:20:58] YuviPanda: last time I checked spdy was evil :):):) [17:21:07] andrewbogott: you need that for effecient use of some of its features (such as push), but not all [17:21:22] saper: why so? It is not a Chrome only implementation anymore, and hasn't been for a year or so now [17:21:27] It's basically a multiplexer, in this case? [17:21:41] andrewbogott: pretty much, yeah. [17:21:46] cool. [17:21:53] andrewbogott: I haven't run any perf tests over it yet [17:22:02] YuviPanda: perfect for pushing unsolicited content? I'm just trolling a bit [17:22:09] "slower, but more complicated!" [17:22:13] @YuviPanda that seems promising [17:22:51] andrewbogott: spdy is supposed to be faster, in general. SPDY+HTTPS is supposed to be faster than pure HTTP, for example [17:22:54] over 'typical loads' at least [17:23:13] nerus: that's legoktm's code :D [17:23:52] may be it was improper env [17:23:54] trying that [17:25:32] YuviPanda: ls -la qstat returns - usr/bin/qstat -> ../share/gridengine/gridengine-wrapper [17:25:43] and my script was often returning file not found [17:25:52] may be its because of the symbolic link [17:25:53] lets see [17:26:41] nerus: possible! check [17:29:29] YuviPanda: so that environ thing did the trick [17:30:26] * nerus phew... [17:30:51] nerus: :D [17:30:52] yay [17:31:00] Yay, of course! [18:39:03] (03PS1) 10Yuvipanda: All apps to -mobile! [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/88181 [18:39:16] (03PS2) 10Yuvipanda: All apps to -mobile! [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/88181 [18:39:23] (03CR) 10Yuvipanda: [C: 032 V: 032] All apps to -mobile! [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/88181 (owner: 10Yuvipanda) [18:48:35] YuviPanda: want to try out https://wikitech-test.wmflabs.org/wiki/Special:NovaProxy and see what bugs you can shake out? [18:48:41] * YuviPanda clicks [18:48:57] It communicates with the real proxy-dammit but uses a fake dns database. [18:49:10] So routing won't work at all -- you'll have to watch the proxy logs to see what's happening. [18:52:45] andrewbogott: nice so far! [18:52:58] andrewbogott: no 'go back to page' after creating proxy? [18:53:09] OK, I should add that. [18:53:18] I didn't implement an edit/modify option because it seems… unneeded? [18:53:23] Does that bother you? [18:53:26] no [18:53:29] hmm [18:53:30] actually [18:53:36] andrewbogott: would a delete and re-add cause DNS delays? [18:53:46] if it does we should have an edit, if not it's ok [18:55:05] YuviPanda, I don't know. If you delete and readd the same thing then caching doesn't matter, right? [18:55:19] And if you delete and add something different, well… a 'modify' would be subject to the same lag. [18:55:22] andrewbogott: when you delete a proxy, does it also delete the DNS entry? [18:55:29] It does, yes. [18:55:31] hmm, true that. [18:55:53] so unless something caches the intermediate 'no entry' state, for few seconds or whatever... [18:56:04] Right, I'd be surprised if it were noticeable. [18:56:07] andrewbogott: I'm okay with it for now, but would be nice to have if it isn't too hard. [18:56:13] but definitely nice-to-have [18:56:17] OK, we can see if it turns out to matter. [18:58:16] I also think maybe we want to care about regions at some point… have a proxy instance per region [18:58:46] andrewbogott: yeah, but by then I guess the proxy will have evolved too [18:58:52] andrewbogott: so no point complicating it now [18:58:53] That would be a reason to have a rest api for the gateway IP. Everything else region-wise can be handled by keystone [18:58:56] * andrewbogott nods [19:00:22] andrewbogott: invisible-unicorn also needs to be actually 'deployed' in some form. it's currently running in a screen... :D [19:00:26] the API, that is [19:00:33] Yeah, that seems important! [19:00:47] Ryan_Lane mentioned git deploy [19:00:59] we should probably put that behind uwsgi [19:05:52] biab [21:14:49] [bz] (8ASSIGNED - created by: 2Krinkle, priority: 4Normal - 6enhancement) [Bug 49350] Tool Labs: Change logo - https://bugzilla.wikimedia.org/show_bug.cgi?id=49350 [22:06:34] hey alll... I would like to volunteer as an ops/admin how do i go about that [22:07:01] hi anth1y [22:07:06] welcome :) [22:07:19] thanks :) it's been a while [22:07:21] anth1y: wikitech.wikimedia.org has general information on the wikimedia infrastructure. [22:07:28] right [22:07:35] I already have a shell account [22:07:37] anth1y: and all our configuration is done via puppet, which is in the operations/puppet.git repo [22:07:38] oh [22:07:39] right [22:07:52] oh ok cool [22:08:10] labs is a 'playground' environment of sorts, where people can test things out [22:08:16] right [22:08:25] I once wanted to test out ceph [22:08:34] there's also toollabs, which is a replacement for the toolserver (see tools.wmflabs.org) [22:08:34] but I never got around to doing it [22:08:42] heh, yeah. labs is what you'll probably be looking for [22:09:28] I'd suggest poking aorund the operations/puppet.git repository and seeing if there's things you can refactor / fix [22:09:33] ok so what the steps I need to take to being admin [22:09:35] oh ok [22:09:42] will do [22:09:43] thanks [22:09:45] :) [22:11:07] brb [22:20:21] back [22:23:19] thanks @RagePanda