[07:07:12] [bz] (8NEW - created by: 2Pietrodn, priority: 4Unprioritized - 6normal) [Bug 52370] Replicated DB fawiki_p is missing revision table - https://bugzilla.wikimedia.org/show_bug.cgi?id=52370 [08:08:02] [bz] (8NEW - created by: 2Tyler Romeo, priority: 4Normal - 6enhancement) [Bug 52354] Run Minion testing instance for security testing - https://bugzilla.wikimedia.org/show_bug.cgi?id=52354 [08:22:13] !ping [08:22:13] !pong [08:25:35] @channels [08:28:59] !puff [08:29:17] !poof [08:29:18] *POOF* "Wadda need?" *POOF* "Wadda need?" *POOF* "Wadda need?" [08:49:04] [bz] (8RESOLVED - created by: 2se4598, priority: 4High - 6normal) [Bug 50498] "Error opening index" for File: search on Labs - https://bugzilla.wikimedia.org/show_bug.cgi?id=50498 [08:49:54] [bz] (8RESOLVED - created by: 2Krinkle, priority: 4High - 6enhancement) [Bug 34250] [beta project] Set up search (tracking) - https://bugzilla.wikimedia.org/show_bug.cgi?id=34250 [08:50:10] [bz] (8RESOLVED - created by: 2Chris McMahon, priority: 4Normal - 6normal) [Bug 46459] [OPS] lucene-search-2 uses too much memory on labs - https://bugzilla.wikimedia.org/show_bug.cgi?id=46459 [08:56:35] [bz] (8NEW - created by: 2Antoine "hashar" Musso, priority: 4Lowest - 6enhancement) [Bug 45122] migrate all beta instance from Lucid to Precise - https://bugzilla.wikimedia.org/show_bug.cgi?id=45122 [08:57:08] !log deployment-prep Deleting the old squid instance since we run varnish cache for text nowadays [08:57:11] Logged the message, Master [08:57:44] !log deployment-prep Deleting deployment-cache-upload03 , replaced by the fully puppetized instance deployment-cache-upload04 [08:57:47] Logged the message, Master [09:02:56] !log deployment-prep rebooted both memcached instances to be able to log on them. Apt upgrading both of them [09:02:58] Logged the message, Master [09:19:12] hashar: ping [09:19:19] yup [09:19:26] hashar: how complicated it is to set up a unit testing server for mediawiki? [09:19:38] what do you want to do ? [09:19:46] I managed to get one oracle box here, and I would like to start unit testing of mediawiki there [09:19:47] ori has created a vagrant setup to run unit tests [09:20:05] mediawiki seems to have lot of troubles when it comes to oracle [09:20:12] I wasn't even able to install it :D [09:20:23] there is some mistake in SQL for main page creation [09:21:00] https://bugzilla.wikimedia.org/show_bug.cgi?id=52094 [09:21:09] that is more or less maintained by freakolowsky as a best effort project [09:21:19] I think he look at it whenever we cut a new stable branch [09:21:45] ok but it would be still cool to get notified about a commit that breaks stuff in oracle [09:22:22] how these unit tests actually work? are they all re-executed on each commit? or just periodically done on head? [09:22:40] petan: https://bugzilla.wikimedia.org/show_bug.cgi?id=20343 [09:22:55] the unit tests are run on each patchset submitted [09:23:13] ok [09:23:16] that makes sense [09:23:22] and whenever someone vote CR+2 , then if the tests pass then the patchset is automatically merged by jenkins [09:23:26] else it is rejected [09:23:36] hmm [09:23:52] I understand that wmf can't host oracle testing box, but here it's not a problem [09:30:57] [bz] (8NEW - created by: 2Antoine "hashar" Musso, priority: 4Unprioritized - 6normal) [Bug 52378] beta memcached instances only have 90MB of memory - https://bugzilla.wikimedia.org/show_bug.cgi?id=52378 [09:34:06] [bz] (8NEW - created by: 2Chris McMahon, priority: 4Unprioritized - 6normal) [Bug 52237] URL confusion: commons.wikimedia vs commons.wikipedia - https://bugzilla.wikimedia.org/show_bug.cgi?id=52237 [09:37:24] !ping [09:37:24] !pong [09:37:26] ~ping [09:37:40] petan: what is the command to ping the NFS writes ? [09:37:48] &ping [09:37:49] Pinging all local filesystems, hold on [09:37:50] Written and deleted 4 bytes on /tmp in 00:00:00.0006360 [09:37:56] :( [09:37:59] it doesn't work [09:38:03] :-D [09:38:10] I can't login myself due to NFS [09:38:16] must be some process being wild [09:38:31] yes it is writing to nfs now... [09:38:36] 4 bytes XD [09:39:25] Written and deleted 4 bytes on /data/project in 00:01:36.0901400 [09:39:28] hashar ^ [09:39:31] great [09:39:37] &ping [09:39:37] Pinging all local filesystems, hold on [09:39:38] Written and deleted 4 bytes on /tmp in 00:00:00.0002180 [09:39:39] Written and deleted 4 bytes on /data/project in 00:00:00.0063260 [09:40:41] &ping [09:40:41] Pinging all local filesystems, hold on [09:40:42] Written and deleted 4 bytes on /tmp in 00:00:00.0008300 [09:40:43] Written and deleted 4 bytes on /data/project in 00:00:00.0071260 [09:41:55] petan, what's causing webserver 1 on tools to be spiking [09:42:00] nfs [09:42:28] I really need get in vacations [10:03:17] addshore: ping [10:03:23] pong :> [10:04:02] addshore: had a question regarding addbot [10:04:21] fire away! [10:04:24] addshore: at http://en.wikipedia.org/w/index.php?title=Wikipedia%3ATwinkle%2FPreferences&diff=566493418&oldid=558602674 [10:04:40] it didn't migrate the simple one there [10:04:56] hmm [10:05:47] rather odd, it should catch it on the next pass I gues [10:05:59] if it doesnt and I see other cases like this I will take more of a look [10:06:22] when is next pass? [10:06:26] it is probably because simple is an odd case :> [10:06:31] possibly [10:06:37] I just noticed it [10:06:43] well, it is still doing the first pass now, its on mrwiki I think [10:06:56] mr. wiki? [10:07:00] yup [10:07:12] http://tools.wmflabs.org/addshore/addbot/status/ [10:07:19] it only migrates one each time? [10:07:28] Too see where it is check www.wikidata.org/wiki/Special:Contributions/Addbot and which links it is adding [10:08:17] It attempts to remove interwiki links from wiki articles wiki by wiki starting with the wiki that was checked logest ago [10:08:28] where it can it imports more interwikis to wikidata from any given article [10:08:38] and it will also remove all possible interwikis (except for simple apparently..) [10:09:02] k [10:09:30] wikidata should be renamed into wikinerd [10:09:37] hmmm in the removal function it has $iwPrefix = $site->getIwPrefix(); [10:09:43] https://github.com/addshore/addwiki/blob/master/includes/Page.php#L383 [10:10:02] rather off how it didnt remove simple [10:10:03] I've never used simple [10:10:12] unless the prefix isnt getting set right [10:10:28] but as you can see here [10:10:28] https://github.com/addshore/addwiki/blob/master/includes/Site.php#L102 [10:10:33] I tried to account for simple ;p [10:10:46] hehe [10:11:30] * AzaToth wonders why people write bots in php [10:11:52] that's like writing a embedded computer for a car in visual basic [10:12:04] hah! [10:12:30] well, my thinking is if people want to write bots for wikipedia they may as well do it in php :P then when they also want to start hacking mediawiki there is less to learn xD [10:12:36] Hence I started wiritng this php framework [10:12:50] Id love other people to help out of course ;p [10:13:28] addshore, there's only room for one sophisticated PHP framework, and that's Peachy. You're supposed to help me. :p [10:13:36] Cyberpower678: I dont like peachy ;p [10:13:53] Id just prefer to write my own so I know how it works ;p [10:13:55] addshore, oh? You liked it a couple of weeks ago. [10:14:10] yus :P but then I managed to write this framework that did exactly what I wanted ;p [10:14:33] Peachy can do exactly wht you want too. [10:14:43] can it edit wikidata yet? :O [10:14:57] It can edit any wiki, [10:15:10] mhhm, im not sure it knows about the wikidata api yet ;p [10:15:17] unless you have coded it recently :P [10:15:36] Cyberbot is editing Wikidata right now. [10:15:44] :O [10:15:45] *checks* [10:16:05] cant fidn it :O [10:17:12] what the .... [10:17:16] User:Cyberbot I. [10:17:21] 15 of my jobs just fell off the grid :P [10:17:29] :DDDDDD [10:17:45] I hacked your account as punishment for not using Peachy. ;P [10:17:47] heh Cyberpower678, can it edit entities though [10:18:01] It will, soon. [10:18:23] Peachy in it's final release will support all extensions. [10:18:43] Cyberpower678: all? [10:19:03] serious? [10:19:04] Right now, I'm still removing the rust on it's core. [10:19:09] zhuyifei1999, yes. [10:19:15] hehe Cyberpower678 see, I started with a fresh core ;p [10:19:40] even a third party extension? [10:20:21] That's my goal. Which is why I'm not on Wikipedia a lot. I've been doing almost only bot work. [10:20:59] Extensions available on the MediaWiki site. Anybody can contribute to writing extension support. [10:21:06] addshore, let me see the framework. [10:21:17] heh https://github.com/addshore/addwiki [10:21:33] lots of holes currently, I have only been using it for under a week [10:21:39] lots to do [10:22:59] CP678: nick [10:23:06] Thumbs down. Being bot and Wikimedia specific is a no-no. [10:23:26] addshore, mistakenly put my computer to sleep. [10:23:40] haha [10:23:57] Peachy is still better at this point. The core is almost fresh again. [10:24:17] addshore, I'm putting the paint on it now. :p [10:25:41] addshore: your framework looks a little too like peachy [10:25:49] really? O_o [10:26:49] I havnt had enough time to look at peachy much :< [10:27:28] addshore, the structure is that of Peachy. [10:28:09] The coding is different, but is generally setup like Peachy. Except that your framework isn't designed to support plugins. :p [10:28:23] Cyberpower678: indeed, thats on my todo ;p [10:28:46] currently plugins are just suported as part of core :P [10:28:50] Your todo list is much bigger than mine. :p [10:29:01] yup :p [10:29:07] :D [10:29:19] my problem with peachy is, you dont you github properly :p [10:29:38] I probably would have tried to contribute some code, but you never show what the actualy state is xD [10:29:54] addshore, ?? [10:30:33] addshore, what are you saying? [10:30:45] it was hard enough trying to get you to put it on github in the first place origionaly :P [10:31:10] It's on GitHub now. What you see up there are my latest updates. [10:32:30] addshore, ^ [10:32:36] :O [10:32:36] Have a look. [10:32:51] I'm working on Alpha 4 now. [10:34:14] addshore, when it's able to support Wikidata, will you test Peachy? [10:34:30] Ill have a go at moving my scripts over maybe [10:34:41] might even look at the code later and try and add the basic wikidata support [10:35:11] addshore, cool. [10:36:12] :) [10:40:21] (03CR) 10Yuvipanda: [C: 032 V: 032] "Merging because I guess this is just a tool" [labs/tools/gerrit-to-redis] - 10https://gerrit.wikimedia.org/r/76764 (owner: 10Yuvipanda) [10:50:31] AzaToth +1 :P [10:50:34] regarding bots in php [10:50:53] but TBH bots in python are not much better [10:52:11] petan, at least PHP uses braces. :p [10:52:49] well, from syntax perspective it's clearly better [10:53:00] yes it is. [10:53:36] * YuviPanda adds that to quips for being funny [10:55:09] python's syntax truly sucks as nothing [10:57:21] [bz] (8NEW - created by: 2Antoine "hashar" Musso, priority: 4Unprioritized - 6enhancement) [Bug 52382] automatically import some content from production (tracking) - https://bugzilla.wikimedia.org/show_bug.cgi?id=52382 [10:57:22] [bz] (8NEW - created by: 2Željko Filipin, priority: 4Unprioritized - 6normal) [Bug 47205] Sandbox gadget not at en.wikipedia.beta.wmflabs.org - https://bugzilla.wikimedia.org/show_bug.cgi?id=47205 [10:57:23] [bz] (8NEW - created by: 2Matthew Flaschen, priority: 4Normal - 6enhancement) [Bug 49791] sync-site-resources should sync all Labs wikis - https://bugzilla.wikimedia.org/show_bug.cgi?id=49791 [10:57:24] [bz] (8NEW - created by: 2Antoine "hashar" Musso, priority: 4Normal - 6enhancement) [Bug 49779] sync articles from production wikis (css/gadgets) - https://bugzilla.wikimedia.org/show_bug.cgi?id=49779 [11:02:08] [02dispatcher-labs] 07benapetr pushed 031 commit to 03master [+0/-0/±2] 13http://git.io/1LwXxA [11:02:10] [02dispatcher-labs] 07benapetr 033e6746d - some more debugging output to make stuff clear [11:12:19] (03PS2) 10Yuvipanda: Try to erase file whenever app exits. Also never exit cleanly [labs/tools/gerrit-to-redis] - 10https://gerrit.wikimedia.org/r/76837 [11:12:30] [02dispatcher-labs] 07benapetr pushed 031 commit to 03master [+0/-0/±2] 13http://git.io/I6pVqQ [11:12:31] [02dispatcher-labs] 07benapetr 0308faf76 - fixed Load() [11:13:46] (03CR) 10Yuvipanda: [C: 032 V: 032] Try to erase file whenever app exits. Also never exit cleanly [labs/tools/gerrit-to-redis] - 10https://gerrit.wikimedia.org/r/76837 (owner: 10Yuvipanda) [11:14:12] @q Not-002 [11:16:42] (03PS2) 10Yuvipanda: Do the key generation for registering on the server [labs/tools/gerrit-to-redis] - 10https://gerrit.wikimedia.org/r/76845 [11:18:16] (03CR) 10Yuvipanda: [C: 032 V: 032] "Post commit review, yo :)" [labs/tools/gerrit-to-redis] - 10https://gerrit.wikimedia.org/r/76845 (owner: 10Yuvipanda) [12:15:26] &ping [12:15:26] Pinging all local filesystems, hold on [12:15:27] Written and deleted 4 bytes on /tmp in 00:00:00.0005330 [12:15:28] Written and deleted 4 bytes on /data/project in 00:00:00.0070970 [12:21:53] &whoami [12:21:53] You are unknown to me :) [12:21:59] &trusted [12:21:59] I trust: test (2trusted), [12:22:08] &ping [12:22:08] Pinging all local filesystems, hold on [12:22:09] Written and deleted 4 bytes on /tmp in 00:00:00.0005810 [12:22:10] Written and deleted 4 bytes on /data/project in 00:00:00.0073160 [12:22:19] petan, what does it do? [12:22:54] [bz] (8ASSIGNED - created by: 2Amir E. Aharoni, priority: 4Normal - 6enhancement) [Bug 52222] fill http://he.wikipedia.beta.wmflabs.org/ with some useful data from he.wikipedia.org - https://bugzilla.wikimedia.org/show_bug.cgi?id=52222 [12:23:31] [bz] (8NEW - created by: 2spage, priority: 4High - 6normal) [Bug 51580] configure beta labs for SUL2 - https://bugzilla.wikimedia.org/show_bug.cgi?id=51580 [12:24:20] [bz] (8RESOLVED - created by: 2Željko Filipin, priority: 4Unprioritized - 6normal) [Bug 47360] at en.wikipedia.beta.wmflabs.org Special:UserLogin opens after logging in instead of Main_Page - https://bugzilla.wikimedia.org/show_bug.cgi?id=47360 [12:25:44] [bz] (8NEW - created by: 2Željko Filipin, priority: 4Normal - 6enhancement) [Bug 47205] sync Sandbox gadget from production to en.wikipedia.beta.wmflabs.org - https://bugzilla.wikimedia.org/show_bug.cgi?id=47205 [12:27:42] [bz] (8NEW - created by: 2Chris McMahon, priority: 4High - 6enhancement) [Bug 50335] support dvwiki in beta labs - https://bugzilla.wikimedia.org/show_bug.cgi?id=50335 [12:35:20] These webserver spikes are getting out of control. [12:35:55] Labs has become the next generation toolserver. [12:36:21] People wanted something identical to toolserver, and they got it. Instabilities and all. :p [12:41:36] quick question about the labsdb's, are they also query able from non-labs machines from within one of our dc's? [12:42:12] drdee, no. You can only query while SSH'd into labs. [12:42:23] 100% sure? [12:42:29] Yes. [12:42:32] ty! [12:42:41] Labs is internal access only. [12:42:50] Toolserver has external access. [12:46:21] !log deployment-prep enwikivoyage's search index finished building over night. dewikivoyage seems to have stalled out. I'm going to profile it. simplewiki is still running and will need some love to finish more quickly. [12:46:24] Logged the message, Master [12:55:32] !access | T13|needsCoffee [12:55:33] T13|needsCoffee: https://wikitech.wikimedia.org/wiki/Access#Accessing_public_and_private_instances [13:15:52] andrewbogott_afk ping [13:43:34] petan: "nothing" doesn't suck per def [13:43:44] or does it? [13:43:48] maybe :) [13:43:56] brainfuck did by definition AFAIK [13:44:15] and there is little difference between python and brainfuck [13:44:16] :D [13:44:29] heh [13:45:05] brainfuck is a low-level python [14:19:57] hi, guys. I am developing a bot with http://dumps.wikimedia.org/other/pagecounts-raw/ for KOWP. but I think you already have one which lists popular articles or suddenly popular articles. [14:21:18] Actually I have some issues with utf-8 decoding. [14:51:13] who do I ask for sudo on a project? I'm looking at integration-jenkins2 and working on tarball releases [14:54:20] nm... got it [14:55:05] hexmode: I was about to say "whoever is project admin" [14:55:35] * Coren conceptually throws uwsgi to the floor and stomps on it. Bad! Bad! [15:02:01] ryuch___: I assume you are writing it in python? [15:09:35] !log integration unbroke puppet on integration-jenkins2 [15:09:37] Logged the message, Master [15:18:35] Yeah, so, I've been working in and around python for a few weeks now. Impressively, the more I work with Python the more I hate it. [15:20:47] heh, same here [15:21:10] just that I am working with python not so long :D [15:21:31] there is only 1 thing I like about it [15:22:19] I can put semicolon on end of line and it's not a syntax error, which in fact is a bad thing, because it's not strict [15:47:57] Do I need special rights to authenticate to graphite.wikimedia.org? I tried my labs login as mentioned in the auth prompt but am being 401'd [15:48:55] yes [15:49:25] unfortunately there are private data in the full graphite views so we can't open this up to the whole world [15:49:39] paravoid: who do I need to poke to get access? [15:49:47] https://gdash.wikimedia.org/ is public though [15:50:49] oh wait, you're staff [15:50:55] I just realized :) [15:51:02] paravoid: oh. yes. n00b staff [15:51:27] paravoid: no worries. I don't think I have my official cloak yet [15:51:47] paravoid: also, hello I'm the new guy in robla's group [15:51:55] yes, I remembered the name :) [15:52:05] I'm Faidon from the operations team [15:52:49] what's your labs username? [15:52:59] BryanDavis [15:53:38] sec [15:55:31] [bz] (8NEW - created by: 2Pietrodn, priority: 4Unprioritized - 6normal) [Bug 52370] Replicated DB fawiki_p is missing revision table - https://bugzilla.wikimedia.org/show_bug.cgi?id=52370 [15:57:06] bd808: try again [15:57:41] paravoid: I'm in! thanks [15:57:49] you're now part of the "wmf" group [15:58:01] we have a few other places where we have access controls based on that group [15:58:04] e.g. ishmael [15:58:32] there are plans to move this to a NDA group [15:58:37] to open that up to volunteers [15:59:18] I suppose at some point "ishhmael" will mean something to me other than _Moby Dick_ [15:59:21] :) [15:59:22] haha [15:59:28] https://ishmael.wikimedia.org/ [15:59:43] https://wikitech.wikimedia.org/wiki/Ishmael.wikimedia.org are the docs about it [16:02:31] oh cool. Somebody actually had sent me an old email thread about a tool for slow query analysis. [16:02:44] * bd808 slowly learns things that will be important later [16:12:30] what are you looking for in graphite? [16:12:34] out of curiosity mostly [16:13:38] just poking around. I'm working on a small project to add some tracking of cache purges and will be pushing data there eventually [16:13:48] https://www.mediawiki.org/wiki/Multimedia/Cache_Invalidation_Misses [16:13:49] what kind of tracking? [16:13:52] oh [16:14:31] we want to get some visibility into purge packet loss [16:14:42] we already have such means [16:14:59] o_O [16:15:06] it's kinda strange you're working on this, our team is also working on that [16:15:18] the vhtcpd stats? [16:15:21] yes [16:15:36] You [16:15:46] are not in the *same* team, aren't you? :-) [16:15:50] we are not :) [16:16:04] well, broadly speaking we *all* are in the same team [16:16:07] ;) [16:16:29] we have also discussed pgm/0mq a bit [16:16:41] this is not related to multimedia at all btw [16:16:47] this is how we do purges everywhere [16:16:59] Brandon seems to be against 0mq somewhat [16:17:09] oh you've spoken to brandon already, that's good [16:17:20] a little. one set of emails [16:18:11] I'm starting with the idea of checking Age headers for URIs that the db has marked as overwritten [16:18:49] all urls you mean? [16:18:49] robla thinks this may let us be more proactive about finding things that are borked [16:19:07] that messes with LRU, I don't think this is such a good idea [16:19:44] I think for the short term checking vhtcpd stats is a good measure and for the longer term the solution would be to switch to a lossless transport [16:19:47] hmmm that makes sense. would make things hot arbitrarily [16:20:02] but I think bblack is working on all that? [16:20:32] he has stats in the daemon. I don't think they are being graphed yet. [16:20:42] which may be the best thing for me to add first [16:20:58] I was under the impression he was planning for a ganglia plugin [16:21:31] we should have a larger discussion. what mailing list is good for stuff like this? [16:21:41] or a hangout or something [16:21:55] well, to me this sounds like right within ops' realm [16:22:00] so I'd say ops@ [16:22:10] I think robla and bawolff sould be in on it [16:22:27] otoh, you've filed whis under mediawiki.org/Multimedia, so it's getting a little strange :) [16:23:33] haha. well it landed there because I'm currently on loan to the MM team and they are concerned about bug 49362 [16:23:44] but yeah, varnish & vhtcpd are definitely ops [16:24:02] ops@lists.wikimedia.org [16:24:03] being n00b and a designated utility player will make lots of things I do weird [16:24:30] I need to get on that list. is it open or staff only? [16:24:50] it's staff only, started as ops-only but this is definitely not the case anymore [16:26:40] paravoid: I just signed up. I'll work on an intro email about the problem and see if we can get a reasonable plan underway for how we can all work together [16:27:49] !log deployment-prep building search index for commonswiki and the other wikis that aren't in the main section of http://deployment.wikimedia.beta.wmflabs.org/wiki/Special:SiteMatrix [16:27:52] Logged the message, Master [16:28:45] bd808: it would be more correct to say that I don't think the 0mq+pgm solution is worth it [16:29:00] bd808: I was actually the guy who came up with that idea in the first place, so ... :P [16:29:11] bblack: noted. Not trying to put words in your mouth [16:29:57] I'm glad I poked my head in here randomly. It think this is a great discussion [16:30:43] (I'm glad too, usually #-operations is a better forum for such discussions though) [16:31:11] as in, you have higher chances of getting meaningful answers/discussions :) [16:31:54] bd808: the problem with the 0qm+pgm thing is, if you really want it to be reliable, architecting it becomes really complicated. You need some redundant hubs that handle publish/subcribe between the PHP sender nodes and the vhtcpd receivers, and then you can't lose pubs and subs either, or lose the retrans when one of them crashes [16:32:17] it's not like we don't have a hub to make multicast->unicast right now though :) [16:32:23] (for esams) [16:32:24] from the 1000-ft view, I think you either end up poorly implementing that and adding to the failures, or implementing it really well but the complexity cost isn't worth the results [16:33:06] grrr- labs machines keep freezing on me.... [16:33:33] bblack: I've never done it, but it's theoretically possible to stick RabbitMQ into a 0MQ network to act as a reliable relay [16:33:46] I'd rather see us first monitor vhtcpd better (which is an ops task), detect whether there are any remaining failure modes that matter enough to care. I suspect we'll see a few instances of multicast relay failure or bursts of dropped multicast here or there, but that there will be ways to reduce those issues as they're debugged [16:34:10] stick X into Y and wave magic wand and things are reliable never actually works out so easily in practice :) [16:34:22] I think a reliable transport is something we should plan for the mid term though [16:34:39] we have persistent caches, this means that if we need to reboot for a kernel upgrade or the motherboard fails or something [16:34:46] we'll lose N minutes of purges [16:35:07] but still serve out of cached content [16:35:18] so this has worked so far but it's a bit suboptimal I think [16:35:24] I still don't have a full grasp of the whole purging picture, but I suspect there may be other things we can hack around on that lessen the impact of missed purges in the first place. [16:35:32] bblack: can I help the process by writing the ganglia plugin for your stats? [16:36:25] if you really want to, feel free :) [16:36:29] root@cp1040:~# cat /tmp/vhtcpd.stats [16:36:30] start:1375268516 uptime:106410 inpkts_recvd:35459921 inpkts_sane:35459921 inpkts_enqueued:35459921 inpkts_dequeued:35459921 queue_overflows:0 [16:36:53] start is a unix gmt timestamp, uptime is how long the daemon has been alive, the file is overwritten every so often (30s I think?) [16:37:12] we have similar monitoring of udp2log log streams [16:38:11] inpkts_recvd is multicast that hit the daemon, _sane means it parsed correctly and wasn't garbage, _enqueued means it made it into the sending queue, and _dequeued means it was sent as TCP PURGE to all applicable varnish instances [16:38:28] queue_overflows means the queue backlogged by such a huge volume of memory that the daemon gave up and wiped out all queued requests [16:39:02] we also have nagios alerts for udp2log streams I think [16:39:07] so you just have to track those counters over time into a graph, basically [16:39:23] but these come with a seq number so it's easy to measure packet loss [16:39:42] bblack: sounds easy in theory [16:39:58] graphing is easy, detecting anomalies isn't :) [16:40:15] I mean, sure, it's easy to measure "0", but how about 1-2% packet loss? :) [16:40:19] some easy/obvious triggers would be any of the rates being zero for a few samples in a row [16:40:28] or non-zero queue_overflow numbers [16:40:44] I've never really understood why more people don't use the predictive stuff in rrdtool... [16:41:39] graphite has holt-winters and some other stuff that might be useful [16:41:56] the hard part is figuring out the expected rates in all this [16:41:56] because that's for detecting general trends, not precise measures such as 1-2% packet loss [16:42:13] I wonder if we could instrument the outbound side in mw? [16:43:45] re the LRU cahce tampering inherent in polling for Age headers: is there an HTTP verb that would give us data without corrupting the state? [16:44:34] you could engineer that within varnish, but honestly I think it's silly to hit purged URLs to see if the purge succedeed [16:45:33] I think bawolff's idea here is to do statistical sampling to get an idea about that possible 1-2% loss rate [16:45:55] and some idea if it's constant or peaks with other events [16:46:12] it's probably a lot smaller than 1% too [16:46:27] I think it's an entirely wrong layer to approach this [16:46:39] if we have 1-2% packet loss, it's either network packet loss or some issue with vhtcpd [16:46:40] it's just one of those things in an eventually consitant ssytem that makes content authors crazy [16:46:47] vhtcpd has detailed stats so we can detect this there [16:46:57] network packet loss... well, we can easily measure that can't we :) [16:47:55] without ever checking http [16:48:46] well, probably the most interesting and ripe area for loss would be in the sending and receiving machine's kernel-level network buffers [16:49:17] on the receiving end, vhtcpd is pretty damned efficient at keeping the queue empty and setting the queue size to "super huge", but do the sending machines ever drop on outbound queues? [16:49:55] the amount of htcp on the sending side is much lower than on receiving side though :) [16:50:03] 1/N where N is the amount of appservers [16:50:14] and appservers are much less busy in general [16:50:15] linux kernel has (had?) some udp rate issues. Facebook hit them when they switched to UDP for memcache [16:50:32] they're worse machines but not much worse [16:50:40] ok [16:50:50] well, not _that_ worse [16:51:12] noted. worse, but not _that_ worse, but not much worse :) [16:51:27] lol :) [16:52:11] that's my impression anyway, have a look, maybe you'll see something that I won't [16:52:11] I think the long term solution for images is to change the thumb url structure to self version. That won't fix anon conent cache staleness but that's less likely to be noticed by editors [16:52:32] I've heard this idea before [16:52:35] I don't like it much tbh [16:52:48] bd808: stepping out a few layers again, I think some of the reaction to this problem is still grounded in the rate of problems reported in the past. we knew the major fault was the relay daemon, and we replaced it with something better. Need more data now to determine exactly how much of a problem still exists, to know what kinds of solutions are "worth it" [16:53:19] I don't like creating second class citizens wrt caching [16:53:32] mhhm, how easy is it to get a repo on gerrit? [16:53:59] bblack: agreed. we need to know what's broke and how often before we can reliably evaluate a fix [16:54:06] bblack: do you know if there are indications when the kernel drops udp? [16:54:06] *googles* [16:54:14] /proc/sockstat or something similar? [16:54:30] bblack: otherwise we are just poking things and hoping something different happens [16:54:41] /proc/net/sockstat even :) [16:55:31] well /proc/net/snmp has [16:55:31] Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors [16:55:31] Udp: 1104530187 4270 876485 117501449 123041 0 [16:55:46] RcvbufErrors == overflow? [16:56:53] /proc/net/udp has a "drops" field too [16:56:55] but without a way to reset those, it's tricky. those stats go back to reboot, and we've done a lot of restarting things and fixing things since then, undoubtedly some of that had the daemon not picking up packets for brief intervals, etc [16:57:12] well, rate is what we'll measure, won't it? [16:57:19] ganglia maybe [16:57:33] we monitor drops for udp2log procs [16:57:55] good point [16:57:59] 15021: 00000000:12DB 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 2821439826 2 ffff8807f95f0e00 0 [16:58:10] ottomata: :) [16:58:11] :12DB is the multicast listen port, no drops on this particular machine [16:58:29] ottomata: I pointed out the similarity of these problems with udp2log monitoring a bit earlier [16:58:41] aye ja, saw that, [16:58:58] i think the udp2log /proc/net/udp ganglia plugin right now should be genericzied [16:59:03] perfect [16:59:08] it is written to aggregate multiple udp2log instance [16:59:19] i think it should be pluggible to just give the port you want to monitor [16:59:35] Anyone here that can help me figure our how to connect to Bastion with PuTTY in Windows? [17:00:27] so monitoring /proc/net/udp on both senders & receivers, vhtcpd stats and network loss should be enough [17:00:37] bblack: you were saying about overengineering? :) [17:00:53] T13: https://wikitech.wikimedia.org/wiki/Help:Putty [17:03:19] paravoid's plan would be great for the packets lost on the wire aspect [17:03:23] Ah-ha... I only made a 1024 key instead of a 2048... that looks like the exact documentation I needed bd808 , thanks. [17:04:05] bd808: I should mention, there are others that will have an opinion about this and much more experience with past failures than I, you should definitely raise it on the list [17:04:11] mark & asher for sure [17:04:18] either ops@ or wikitech@ [17:04:27] bawolff reminded me in side chat that there are other things we want to get a handle on. Apparently tehre have been php issues in the past that kept packets from being sent at all [17:05:37] the php side borks may have been the root of idea to poll Age headers [17:05:40] (wikitech@ being completely public) [17:06:25] Age has multiple other issues too [17:06:33] our caching is tiered for one [17:08:08] true, but the thumbs bawolff is interested in are overwrites that should have gone all the way back to the bottom [17:09:17] great discussion. I'll add stuff to the wiki and try to get a discussion going on ops@ (as soon as I'm moderated in) [17:09:31] * bd808 has standup to attend [17:10:46] * Coren screams in frustration! [17:14:17] * YuviPanda gives Coren ice cream [17:14:18] there there [17:14:22] uwsgi driving you nuts? [17:14:35] wsgi generally. It is a blight upon this world. [17:14:51] I see [17:15:24] at least better than having to do things with mod_wsgi [17:15:45] I think I shall abandon it entirely in favor of fastcgi. All the toolkits support it anyways. [17:16:07] And at least fastcgi has sane, well-defined semantics. [17:16:22] :'( [17:16:57] * Coren points out that, if you really need wsgi, you can always set your fastcgi to simply invoke uwsgi with a fastcgi socket. [17:18:06] fastcgi -> wsgi -> app has more points of failure than wsgi -> app [17:18:18] Coren: what exactly is the problem, if there are specific things? [17:18:21] with just setting it up [17:19:50] Coren: also see: http://www.peterbe.com/plog/fcgi-vs-gunicorn-vs-uwsgi/ [17:20:06] There are a number of extraordinarily unfelecious things in the whole way this is created. While you can have an uwsgi that starts multiple apps, they each have their own socket and no clean way to dispatch between them. Unless you modify the app to register with /another/ uwsgi that fetches stuff from its cache. [17:21:05] 'dispatch between them'? Each of them have their own socket and we've to 'route' between them somehow, no? [17:22:34] Yes. And don't give me fastrouter -- you either have to hardcode the apps in it or set up a registrar which is complicated for the end user, completely unsecure /and/ brittle. [17:23:06] how would fastcgi be any different? [17:23:14] How many Python web thingies are we talking about, BTW? [17:23:24] scfc_de: It's the fad-of-the-moment [17:23:24] (= affected users) [17:23:38] python is fad of the moment? :) [17:24:36] YuviPanda: Fastcgi is simple and straightforward. I simply add a spawn-fcgi equivalent that sends stuff out to the grid and bam! Instant fastcgi support. [17:24:49] YuviPanda: Yes. [17:25:04] now you're just talking like petan, but nevermind. [17:25:09] :-) [17:25:28] (perhaps related, I just came from seeing http://worrydream.com/dbx/ :) ) [17:27:09] scfc_de: do you have time to help me setup tools-redis? Would be nice to get that done before wikimania :) [17:27:13] and should be rather trivial too [17:27:53] * sumanah jumps on fad of the moment, sells novelty t-shirts [17:28:18] YuviPanda: What do you need? [17:28:27] Oh, privileges! :-) [17:28:37] scfc_de: 1. delete current tools-redis (which, uh, idk what that is :P) 2. create instance, 3. apply puppet class 4. bam! [17:29:33] scfc_de: I know that current tools-redis is pretty much a borked instance that does nothing. if such an instance exists at all [17:31:48] YuviPanda: It exists, but it was created on July 17th without any SAL entry. I don't see anything important, so ... 3, 2, 1 ... [17:32:09] scfc_de: I believe that was petan trying to create it and then got distracted. or something. [17:32:11] kill it! [17:32:34] "Deleted instance i-0000081f (tools-redis)." [17:32:48] \o/ [17:33:03] "ACTIVE (deleting)", so might take a moment. [17:33:17] yeah, I think I had some trouble with that a while ago [17:33:22] when experimenting with toolsbeta-redis [17:33:23] And it's gone. [17:33:32] scfc_de: also can we get an instance bigger than small? [17:33:40] let me look at the list [17:33:43] Instance type? [17:33:46] Ah, okay. [17:34:02] I assume you see the exact same UI for toolsbeta? [17:34:32] scfc_de: yeah [17:34:33] moment [17:34:39] it logged me out, looking for phone [17:36:19] scfc_de: m1.medium? Even m1.large if I can sneak that up :P [17:36:25] * T13 is still having a hard time with PuTTY getting into bastion to get to bots-labs [17:36:30] at least medium, perhaps large [17:36:37] Coren, petan, YuviPanda, whoever else has root on toolsbeta: BTW, "ssh_hba" must be set to "yes" in the "Configure instance" for all instances. [17:36:38] scfc_de: any objections to a large? [17:37:07] YuviPanda: For me, it's just a click, I think Ryan_Lane is keeping the score :-). [17:37:32] I've created my keys, got pageant running and set up... [17:37:52] followed all of the directions on https://wikitech.wikimedia.org/wiki/Help:Putty and https://wikitech.wikimedia.org/wiki/User:Wikinaut/Help:Access_to_instances_with_PuTTY_and_WinSCP#cite_note-1 [17:37:55] YuviPanda: Security groups default *and* redis? [17:38:12] security group default, and then add redis role (and NFS role) after creation [17:38:17] that's what Coren told me when I was creating it on toolsbeta [17:38:40] Can't security groups not only be added when creating, i. e. *not* afterwards? [17:38:54] YuviPanda: You're confusing puppet roles with security groups [17:39:03] Security groups *must* be set at instance creation. [17:39:10] yeah, but does redis have a security group? [17:39:13] I don't see one in toolsbeta [17:39:20] There should be one if there isn't. [17:39:29] scfc_de: do you see one on tools? [17:39:49] Coren, YuviPanda : There is one ("redis"). Should I enable this *and* default, or only "redis"? [17:39:54] I'm down to the "Type “ssh ”. " bullet on https://wikitech.wikimedia.org/wiki/Help:Putty and have entered "ssh bots-labs" into the text field labeled “Remote command:” in the Connection → SSH category [17:40:00] Always add default. [17:40:04] k [17:40:06] but still no go. [17:40:25] Prompts me for "login as:" [17:40:34] Technical_13: I don't think there is an instance named 'bots-labs', and the bots project is obsolete and shouldn't be used. [17:40:48] What should the instance be? [17:40:59] YuviPanda: "Created instance i-00000876 with image "ubuntu-12.04-precise" and hostname i-00000876.pmtpa.wmflabs." [17:41:00] https://wikitech.wikimedia.org/wiki/Nova_Resource:I-00000816 [17:41:06] Technical_13: ... it depends on what you're trying to do. [17:41:19] Ah, so there was an instance named bots-labs. [17:41:29] petan added me to wm-bot so I can restart it when it is netsplit. [17:41:31] scfc_de: \o/. now add NFS and redis roles. Do check every step with Coren, though :) [17:41:36] Ah. [17:41:42] Did he add you to the project itself? [17:42:02] we're actually running into space issues [17:42:02] Coren: Petrb added you to project Nova Resource:Bots 08:11 [17:42:09] is what it says in my notifications [17:42:36] mostly thanks to nova's insistence on using raw disk images as the base images for the /mnt images [17:42:43] so each host is eating 350G of wasted space [17:42:46] scfc_de: did you do a large? [17:42:47] Then it should work. Mind you, I'm not putty expect. I suppose "login as:" is because you haven't set a default username? In which case you should be able to just type it there. [17:43:15] expert* [17:43:15] Shouldn't that just be my wikitech username? [17:43:26] Technical_13: No. Your shell username. [17:43:40] ahh.. that might be it... let me try that. [17:43:50] YuviPanda, Ryan_Lane: Yes. Should I use a smaller one? [17:43:55] scfc_de: no no I think it is fine :D [17:44:24] Coren: That did something but it flashed by so quick I couldn't read it all... [17:44:29] Ryan_Lane: So the number of instances is the problem, not their configuration? [17:44:48] I accepted the connection for bastion, I should be able to get to bots-labs now... I hope. [17:45:05] Coren: NFS = enable role::labsnfs::client, run puppetd, reboot? [17:45:14] scfc_de: Basically. [17:45:21] scfc_de: I'm going to limit Redis to 7G, let me put up a patchset [17:45:42] (currently limited to 1G) [17:45:44] YuviPanda: But I can proceed for now? [17:45:45] lol. does !monkey refer to the admins or to the one reporting problems? [17:45:50] scfc_de: yeah! [17:46:49] !monkey del [17:46:49] Successfully removed monkey [17:47:03] there :) [17:47:13] *Argl*, tools-redis's key changed due to the delete/create. [17:47:37] Aha! Eureka! [17:47:39] Coren: scfc_de https://gerrit.wikimedia.org/r/77152 [17:47:57] !smart [17:48:06] :( [17:48:20] tools-redis [17:48:20] load_one: down [17:48:20] Last heartbeat 911s ago [17:48:55] Ryan_Lane: Is the memory size a variable that Puppet can access? (Cf. YuviPanda's https://gerrit.wikimedia.org/r/77152.) [17:49:11] hmm, I think there was something that let you configure that from the wikitech interface [17:49:18] I remember vaguely reading docs about that [17:49:18] let me look [17:49:31] beta labs is breaking my heart this week [17:51:30] scfc_de: yes [17:51:34] scfc_de: via facter [17:51:42] scfc_de: run: facter [17:51:54] those variables are accessible in puppet as global variables [17:51:56] hmm, so I can find memory size via facter and then just do a '- 1G'? [17:52:08] I don't want to limit redis to 8G on a 8G machine, that sounds not so great [17:52:39] YuviPanda: yep [17:52:49] I'll probably just add an 'input field' [17:52:49] YuviPanda: or even better do a percentage ;) [17:53:19] Ryan_Lane: input field in wikitech vs percentage? I'm leaning towards input field [17:53:32] ah, you mean a variable? [17:53:37] a variable is fine too [17:53:38] yeah [17:54:33] YuviPanda: It looks as if Puppet can do some arithmetics: http://docs.puppetlabs.com/guides/language_guide.html, but I don't know whether it allows that for variables/parameters as well. [17:54:44] scfc_de: oh, it definitely does variables [17:56:01] YuviPanda: Even though -- are facters strings or numbers? [17:56:18] they're strings, but puppet is really secretly ruby, so.. .:) [17:56:23] i'm just going to do a variable [17:56:34] scfc_de: you see the inputboxes in the wikitech page for adding pupet roles, right? [17:56:46] YuviPanda: Probably better for hosts that serve multiple roles as well. [17:56:49] yeah! [17:56:52] doin it, moment [17:57:02] YuviPanda: Yes. BTW, -redis is puppeted and rebooted and now ready. [17:57:13] scfc_de: ooo, nice! \o/ [17:57:16] let me just fix the class [17:58:51] i HATE json -.- guasjhdkjasd [17:59:31] wrong channel... -.- [18:01:18] Ryan_Lane: andrewbogott is it 'class toollabs::redis ( $var = "default" ) inherits toollabs' or 'class toollabs::redis inherits toollabs ( $var = "default" )'? [18:01:25] the former, I think. but i couldn't find too many examples [18:01:55] I think the former too, although that's ugly. [18:02:00] yeah [18:02:08] let me just submit and see what jenkins tells me :P [18:06:53] YuviPanda: Redis has absolutely no AAA right? [18:06:59] Coren: AAA? [18:07:03] auth? [18:07:14] authentication, authorization and accounting [18:07:15] Coren: you can specify a 'password' if you want, but it's over cleartext and mentioned in the config file [18:07:36] it does have forms of 'accounting' where you can figure out which keys are taking up how much space, but we've it disabled. [18:07:48] * Coren ponders. [18:08:01] So, not usable for a registry. [18:08:16] Coren: are you thinking of using it as a registry for mapping urls to services running on the grid? [18:08:38] Yeah, amongst other things. I'm going to default back to /data/project/.system [18:08:53] Coren: that's what hipache does :D but yeah, you've to treat redis as transient [18:09:14] because it is trivial to make it so - spam it with milllions and billions of entries and your old keys will get dropped when it runs out of RAM [18:09:23] it should never be a single point of knowledge in an open environment [18:10:38] YuviPanda: But you are patching the basic redis class? I think a variable/parameter might be helpful there as well. [18:11:03] scfc_de: the 'base' redis class already has a parameter for it. It's just not used in toollabs::redis [18:14:51] YuviPanda: Ah. [18:15:33] scfc_de: I'm just going to create redis::small and redis::large [18:15:33] as roles [18:15:53] Coren: I seem to be connecting, but putty is closing out instead of leaving me a command prompt... any putty people you can suggest? [18:16:22] Hm, I think petan uses Windows primarily, I expect he has good puty-fu [18:16:49] Coren: so, we probably want to do something about the disk usage before wikimania :) [18:17:00] or during [18:17:03] Ryan_Lane: _before_? It's that bad? [18:17:09] most hosts are at 85% or more [18:17:13] Ryan_Lane: DevCamp seems like a good opportunity. [18:17:16] some are at 95% [18:17:23] Ow [18:17:29] it's due to the hosts using 160G ephemeral disks [18:17:40] because they have a base disk that's 160G [18:18:06] we probably need to patch the code to use qcow2 base disks for ephemeral [18:18:18] we'll need to shutdown instances using the raw base [18:18:32] and move them to the new base [18:18:37] then delete the base [18:19:32] * YuviPanda makes all your base joke [18:20:08] YuviPanda: You can't because they belong to us, not you. :-) [18:20:38] hey, stereotypically speaking I'm better at broken english! [18:24:12] scfc_de: can you hop on to -operations? [18:48:39] ok, good. I asked vishy about adding a config option for ephemeral disk types [18:48:45] he said it would be fine [18:49:13] I'm going to change the hardcoded value in production (via puppet) and push a change in for havana [18:49:45] then we can modify current images and save about 320G per host [18:50:26] we need to look at expanding the cluster soon, though [18:51:23] anyway, this live hack needs to happen now :) [18:51:24] * Ryan_Lane does it [19:19:47] is a 320G instance really using the 320 G or is that dynamically allocated? [19:19:54] (same question for memory / CPU ) :D [19:20:26] i think on most of beta instances we don't need /dev/vdb at all [19:20:38] and some could get less memory allocated probably [19:25:02] hashar: My biggest problem is that most of my instances have local disk space they have no use for because it was the only way to get the ram/CPUs. [19:25:16] yeah that is my concern [19:25:24] Ryan_Lane: Can't we provide a couple of (more ram and CPUs w/ no disk) options? [19:25:37] hashar: please don't create a 320G instance right now [19:25:40] we also created 2 16GB memcached instances which then comes with ton of disk/cpu [19:25:43] unless you want to break things [19:26:05] I guess I just pushed in that qcow2 change [19:26:27] so maybe it won't actually cause issues [19:26:32] maybe wait an hour or so :) [19:30:28] Ideally, RAM, CPU and disk space would be separate options in the UI. [19:30:35] &ping [19:30:36] Pinging all local filesystems, hold on [19:30:37] Written and deleted 4 bytes on /tmp in 00:00:00.0002540 [19:31:29] Written and deleted 4 bytes on /data/project in 00:00:52.7516270 [19:31:30] NFS seems to be hanging again. BTW, Coren, the warnings from the SGE accounting file copying cron job indicate that the failures exceed five minutes. [19:32:22] scfc_de: Only looks like a bit over 2 to me: http://ganglia.wikimedia.org/latest/graph.php?r=1hr&z=xlarge&h=labstore3.pmtpa.wmnet&m=cpu_report&s=descending&mc=2&g=cpu_report&c=Labs+NFS+cluster+pmtpa [19:34:24] Coren: The errors come from a "cp A B && mv B C" script. The failure occurs when two scripts are executed concurrently as then one of the mvs steals the other's source. As the cron calls are spaced out by five minutes, some IO must hang for more than this time period. [19:35:15] The file is pretty big, I've no doubt that a 2-3 minute stall could make the run end up lasting over 5 minutes. [19:35:21] Also, why not rsync? [19:36:16] Coren: scfc_de can I get added to the redis security group? [19:36:38] Coren: 200 MByte? rsync would be an option, but I don't know how that changes anything. [19:36:52] Coren: I'll look up some times when the script took more than five minutes, moment. [19:36:53] scfc_de: It will actually grow the file rather that copy it. :-) [19:37:37] Coren: I think rsync creates a temporary copy of the target file, then syncs that, then renames the temp to target. So it would probably increase the disk IO :-). [19:38:28] scfc_de: You want --inplace adn --append-verify [19:40:35] Coren: But then you might end up with an incomplete file as a source for the other copy job. "cp && mv" (or sync without --inplace) is atomic. [19:41:01] Besides, rsync would have to read the target file, so there shouldn't be *less* disk IO?! [19:41:07] Which is not an issue since the file is only ever appended to, and --append-verify waches for mismatch. [19:41:19] scfc_de: Reads are cacheable. [19:41:36] scfc_de: Trust me. You want rsync --append-verify --inplace for this. :-) [19:42:19] You might even get away with just --append --inplace given that it's a log file. [19:42:30] And that'd be even faster. [19:42:33] YuviPanda, I can actually pay attention to your puppet stuff now, if you haven't already sorted everything. [19:42:37] scfc_de: can you ssh into tools-redis, do 'redis-cli' and see if that works? [19:42:43] andrewbogott: ottomata helped us get that merged :D [19:42:48] 'k [19:42:53] Did he also fix the docs? :p [19:43:00] andrewbogott: I am not sure :D [19:43:25] Coren: I *want* reliable disk IO :-), rsync instead of cp && mv is a crutch :-). [19:43:52] No, it's the efficient way to remotely sync two files, which is what you want. [19:44:22] rsync without --inplace even does cp-mv more safely. [19:44:47] Coren: Tuesday: 10:12:33, 20:38:05, 20:57:11, 23:22:35 [19:44:49] But given that we know it's a file that only grows, --inplace adn --append makes it even better. [19:44:54] Coren: yesterday: 18:22:16 [19:44:59] Coren: today: 02:07:24, 05:32:40 [19:45:50] Coren: Maybe, but for me that falls under microoptimization :-). Copying 200 MByte should be a no-brainer. [19:46:24] YuviPanda: "redis 127.0.0.1:6379>" [19:46:45] hmm? [19:46:52] aaah [19:46:53] prompt [19:46:54] nice :D [19:46:58] scfc_de: works then :) [19:47:19] scfc_de: \o/ ty :D [19:47:25] scfc_de: shall I write an email to labs-l? [19:47:39] YuviPanda: Perfect. Re tools-mc, I'll set redis_something to "1GB", then "puppetd -tv", then remove it, then Puppet again to see what happens. [19:47:45] scfc_de: yeah! [19:47:56] scfc_de: puppetd -tv shouldn't *restart* redis, so it should be fine [19:48:37] YuviPanda: Even if the configuration file changes? [19:48:41] scfc_de: yup! [19:48:43] andrewbogott: what docs? [19:49:09] YuviPanda: Okay. tools-mc is the backbone for grrrit-wm ATM? [19:49:28] scfc_de: yeah, but that's trivial to change. addshore is also using it atm, I think [19:49:29] ottomata, not sure if it's stuff you wrote or I wrote, but Yuvi was confused because this page uses the word 'parameter' but isn't refering to a class parameter but, rather, a global. https://wikitech.wikimedia.org/wiki/Help:Self-hosted_puppetmaster [19:49:45] ah yes [19:49:48] danke i did right that [19:49:52] fixing [19:50:29] oh, there are other places that it calls those parameters too :) [19:50:31] fixing [19:52:03] YuviPanda: Set to "1GB", run Puppet, no changes. Now for "" ... [19:52:10] heh :D [19:52:18] ottomata, thanks [19:53:18] scfc_de: I'll set 6th of September to kill tools-mc [19:53:48] YuviPanda: And again: No changes. However I don't feel that's "as designed", but more "accidental". If you can, I think you should pursue ottomata's proposal of a proper default. [19:54:03] hmm, that still feels a bit... weird [19:54:29] ja so, that is kinda a stupid [puppet thing [19:54:54] your class specifically sets the maxmemory parameter on the class [19:55:04] so, it will override whatever the default is that the class sets [19:55:19] the defaults on the class only apply when you omit passing the parameter when including the class [19:55:56] is there anyone that can help me with the dewikivoyage install on beta? [19:56:31] !log tools Added redis_maxmemory to wikitech Puppet variables [19:56:33] Logged the message, Master [19:57:15] !log tools Created new instance tools-redis with redis_maxmemory = "7GB" [19:57:16] Logged the message, Master [19:57:38] scfc_de: emailed! now fixing docs [19:58:23] BBL [19:58:44] scfc_de: ty for the help :) [20:06:28] the web server is unresponsive at times. is this the non-nfs problem again Coren? [20:07:02] JohannesK_WMDE: Yes. I'm going to be switching hardware at the return from Wikimania. [20:09:01] so it's broken, but it's going to be fixed. after the important gathering. somehow this reminds me of the toolserver... [20:19:12] JohannesK_WMDE: The alternative, possibly completely breaking a working system just before said large gathering seems less advisable. [20:20:46] Spikes, spikes. Nothing but spikes. :( [20:53:16] This time of day really really sucks for NFS [21:11:07] Coren: well, we could try switching to labstore4 ;) [21:11:11] to rule out hardware [21:11:34] I note that there is a 12.10.6 A12 firmware out, marked 'urgent' [21:11:56] http://www.dell.com/support/drivers/us/en/19/DriverDetails?driverId=HT64R# [21:13:40] Also for H700: http://www.dell.com/support/drivers/us/en/19/DriverDetails?driverId=C3X7D [21:16:27] well, we could upgrade the firmware on one and try it [21:16:37] hm [21:17:07] "Release 12.10.5-001 has been pulled from the website as it did not include several important changes. Please load firmware version 12.10.6-0001 to include those changes." [21:17:10] you suck dell [21:17:19] that's *not* a list of fixes and enhancements [21:19:10] "Fixes an issue where the Perc battery doesnít always charge up properly, causing VDís changing from WB to WT." [21:19:11] :D [21:19:28] "Removes system events that relate to partial Patrol completion on physical drives. " [21:19:44] "Corrects a potential scenario where a drive may repeatedly fail [21:19:44] during i/o." [21:20:07] Yeah, they change log is... not the best I've seen. :-) [21:20:30] you're sugar coating that :) [21:20:44] "Fixes a problem where sometimes something may go wrong." [21:20:49] :D [21:21:21] Coren: what did we do this month? [21:21:24] I can never remember [21:21:29] I should keep notes on this [21:21:36] we should make a monthly etherpad for it [21:22:10] Ryan_Lane: Not that much, actually. At least I didn't. Lots of docs (kma500's work for the most part), some fire fighting. I wrestled with WSGI and tried to beat some sense into NFS mostly. [21:22:19] oh, right, doc sprint [21:22:23] * Ryan_Lane adds that [21:22:46] ability to add service groups to service groups [21:22:49] A lot of user requests got hammered out; this really was the month where we've had the most newbies on tool labs and they brought lots of new use cases. [21:22:54] oh, that got merged? [21:23:07] it didn't get merged? [21:23:12] I thought it was merged the next day [21:23:14] I... don't know? [21:23:25] YuviPanda: Yeah; Ryan still doesn't like it but it got merged since there are valid use cases. [21:23:31] ah, okay [21:23:38] an email to labs-l, perhaps? :) [21:23:40] Coren: please update tools stats: https://www.mediawiki.org/wiki/Wikimedia_engineering_report/2013/July [21:24:29] For that matter, I don't like it all that much either, but it's clearly useful in some scenarios. [21:24:41] heh [21:24:44] it's complicated [21:24:53] and it makes external reuse of the groups more difficult [21:25:10] but not impossible, depending on how things are looked up [21:27:09] * Coren needs to leave, his SO is taking him out for dinner. Yum. Japanese. [21:27:51] * Ryan_Lane waves [21:27:54] Bon appetit! [21:28:01] Coren: please update that page when you get a chance [21:28:26] Ryan_Lane: I just bumped the numbers; the prose will wait until my return. :-) [21:28:31] ah [21:28:32] cool [21:28:36] thansk [21:40:41] Ryan_Lane: what powers instanceproxy now, btw? [21:40:49] nginx [21:40:59] aaah! just a static configuration? [21:41:24] Ryan_Lane: btw, re: hipache, http://blog.dotcloud.com/under-the-hood-dotcloud-http-routing-layer [21:41:42] Ryan_Lane: they apparently moved from hipache to nginx + lua after nginx got websocket support, and saw a 10x increase in performance :) [21:42:07] and that code is open too. I'll try to adapt that instead of the nodejs hipache, since we already have nginx in our infra [21:42:20] heh [21:42:24] that sounds good [21:42:43] we still need to enable http 1.1 for backend in production too [21:42:49] haven't had a really strong reason to [21:43:44] Ryan_Lane: are there any issues with instanceproxy as such? I guess hipache-nginx would be very useful for running services on the grid, but other than that? [21:43:50] would replacing instanceproxy be useful? [21:44:20] well, the issue is that it requires the use of subdomains of instance-proxy [21:44:29] it would be nice for it to use any dns name [21:44:40] ah, right. [21:44:54] but that'd need some configuration interface [21:45:00] should be easy enough to add it to wikitech, perhaps [21:45:13] yep [21:45:44] wikitech could talk directly to redis at first [21:45:46] Ryan_Lane: I think I'll make this my 'official' Wikimania devcamp project :) Should I test / run this on its own project, or on an instance in toolsbeta? [21:45:48] &ping [21:45:48] Pinging all local filesystems, hold on [21:45:49] Written and deleted 4 bytes on /tmp in 00:00:00.0002840 [21:45:51] and later we could add an API that supports keystone [21:45:57] Ryan_Lane: yeah, that's fine for now [21:46:04] let's make it a projext [21:46:34] *project [21:47:01] since it will eventually be used for all projects [21:47:07] true [21:47:17] actually, we can use the instance-proxy project for it [21:47:24] sounds good enough [21:47:25] add me? [21:47:29] sure [21:47:38] don't mess with the current instance, of course ;) [21:47:43] indeed :) [21:47:47] create a new instance! [21:47:54] Ryan_Lane: would also need a public IP [21:47:55] Written and deleted 4 bytes on /data/project in 00:02:06.5017100 [21:48:08] ah, it's the project-proxy project [21:48:37] Ryan_Lane: btw, if the thing is configurable only from wikitech, it can't be used for tools (since they can't guarantee a host/port combination), but I guess we can solve that later. [21:49:21] YuviPanda: we can modify the config automatically when tools are added [21:49:41] oh, wait. I see what you mean [21:49:58] yeah, grid engine can move them around across restarts or whatever. [21:50:31] we can add it as an option to jsub, of course :) But might require some form of auth [21:52:14] yeah [21:52:20] OAuth to wikitech [21:52:22] when available [21:52:34] keystone makes this a pain in the ass [21:53:25] YuviPanda: added you to the project [21:53:30] let me give it another IP [21:53:31] Ryan_Lane: sweeet! [21:54:09] allocated [21:54:16] didn't even need to increase quota [21:54:28] heh [21:54:54] you should make a security group for redis before you create an instance [21:55:02] so that you can limit redis access [21:55:11] security group for redis where? [21:55:14] in general? [21:55:18] in that project [21:55:20] ah [21:55:26] redis should be local [21:55:28] to that instance [21:55:29] then when you create the instance add the redis security group as well as default [21:55:41] YuviPanda: not if wikitech needs to talk to it ;) [21:55:46] aaah [21:55:48] righto [21:56:25] heh, project project-proxy [21:57:05] heh [21:57:14] yeah, it's very meta [21:57:23] YuviPanda: look at your groups, too [21:57:49] 50254(project-project-proxy) [21:57:59] hehe :D [21:58:21] mosh means my ssh connections live loooong [21:58:29] :D [21:58:49] Ryan_Lane: so I need to create a security group called 'redis', which makes the redis port available to... which range? [21:58:58] none right now [21:59:15] but later you'll add /32 cidrs for specific instances [21:59:15] ah, but eventually? if I Just want wikitech to talk to it? [21:59:30] hmm, so right now it's just an 'empty' group [22:00:45] yep [22:00:48] that's fine [22:01:00] but you can't add a group to an instance after it's already created [22:01:08] yeah [22:01:12] should I add proxy too? [22:01:13] an annoying openstack (and ec2) limitation [22:01:24] there's a proxy security group? [22:01:26] yeah [22:01:40] it's got some... interesting configuration in it [22:01:43] like port ranges from -1 to -1 [22:02:56] oh [22:02:56] right [22:02:58] don't add that [22:03:22] alright [22:03:30] I don't know why a rule like that exists, it's weird [22:03:42] created! [22:03:46] oh [22:03:47] wait [22:03:55] default has -1 to 1 for icmp [22:03:56] which is normal [22:04:02] proxy just has 80 [22:04:02] oh [22:04:08] ah, misread. [22:04:14] it should also have 443 [22:04:20] but it doesn't [22:04:24] * Ryan_Lane adds [22:04:31] so, yeah, probably want to add that rule to the instance too [22:04:36] err [22:04:37] group [22:04:44] &ping [22:04:44] Pinging all local filesystems, hold on [22:04:45] Written and deleted 4 bytes on /tmp in 00:00:00.0002140 [22:04:56] Written and deleted 4 bytes on /data/project in 00:00:11.3916360 [22:08:56] Dang you scfc_de [22:09:09] I thought petan was here.. [22:10:46] T13: wm-bota is very trusting to strangers :-). [22:11:23] &whoami [22:11:24] You are unknown to me :) [22:11:28] ... [22:12:01] I got a cloak change and am waiting for petan to update it for root. [22:12:53] I've moved up from Wikipedia to Wikimedia [22:13:53] T13: Well, you have only a few hours (or a fortnight) left :-). [22:14:39] I know... [22:15:26] YuviPanda: http://www.enovance.com/fr/blog/5858/role-delegation-in-keystone-trusts [22:16:41] oauth style delegation is also being added [22:17:04] so, it should be possible for a tools service to modify the hipache config via a keystone enabled API [22:17:33] but only after we upgrade and support this :D [22:17:43] yeah, 'eventually' :) [22:17:56] looks like oauth will land in havana [22:17:58] in 3 months [22:18:09] it hopefully is passwordless, since the only way I see this working is if we write that into a jsub type wrapper [22:18:24] then uwsgi, rack, node - everything becomes simple to support, since we no longer have to deal with goddamn apache :D [22:18:35] just put it on the grid, put it on a port and you're off! [22:18:46] https://gist.github.com/termie/5225817 [22:19:03] wawththefuck? [22:19:44] basically this adds oauth to keystone [22:19:50] which is what they should have just done to begin with [22:19:54] <3 termie [22:20:20] heh! [22:20:38] the pie stuff threw me off in the middle, since i've no idea what they're talking about [22:21:12] At memory location 0x4AC01F... [22:21:15] that was a good one :) [22:21:25] he's just being termie [22:21:43] you should meet this guy, you'd love him. he's sarcastic and a little crazy [22:22:34] best combination, I'd think [22:22:55] "Are you ready? Are you primed? ARE YOU PUMPED!?" [22:22:56] :D [22:23:00] Ryan_Lane: I'd have moved to SF in December, except I managed to fail Software Engineering for the third year running, and now have to wait for more months [22:23:07] uuugggghhhh [22:23:23] They could've at least failed me in a real subject! but nooooo [22:23:26] YuviPanda: dude, I'm revoking your commit access, you obviously don't know how to make software [22:23:27] :D [22:23:33] :D [22:23:44] clearly! [22:26:01] Ryan_Lane: can we have Ubuntu 13.04? a lot of the nginx packages are a lot more up to date there.... [22:26:02] I guess not :( [22:26:45] it would be better to backport nginx [22:27:01] or add the backports repo [22:27:02] and luajit, and the redis-lua packages? [22:27:10] i am not sure if they're on the backports repo [22:27:11] * YuviPanda checks [22:28:05] hmm, would those be in raring backports or precise backportS? [22:28:07] YuviPanda, failed Software Engineering? Is that something to do with a university or..? [22:28:20] Krenair: yeah, at my university. [22:28:29] Krenair: I've failed it for 3 years now, having written it 3 times [22:28:45] Krenair: I even bit my teeth and drew diagrams about 'waterfall'! [22:28:56] heh [22:29:15] * YuviPanda repeats to self 'Specs must be completed and approved by all key stakeholders before any programming can begin! Any changes must go through the requirements change management committee!' [22:29:30] Ryan_Lane: using PPAs is right out, I suppose? [22:30:33] :D [22:30:57] YuviPanda: PPAs are frowned upon, even in labs [22:31:05] the backport repos should have what you need [22:31:12] http://packages.ubuntu.com/precise-backports/allpackages [22:31:14] we eventually want to backport nginx anyway [22:31:21] hasn't got anything [22:31:35] neither nginx, nor lua. [22:31:39] bleh [22:31:53] well, I'd say backport to a local repo [22:32:02] and we'll look into backporting into the normal repo [22:32:15] I have to see if the logging module is actually needed or not [22:32:23] if so it makes things harder [22:32:54] can't I just use http://wiki.nginx.org/Install#Ubuntu_PPA for a while? isn't 13.10 an LTS? [22:33:04] 13.10 is not [22:33:07] 14.04 will be [22:33:16] bah, that's a lot more months to wait :( [22:33:23] yep [22:33:43] and I don't even have a proper local ubuntu install :| [22:33:47] :D [22:33:49] (vagrant only sortof coutns) [22:33:53] you can use the PPA for now [22:33:57] whee [22:33:59] good [22:34:01] it's fine in labs [22:34:11] yeah, this ain't going anywhere near production [22:34:19] of course, it's still a bad idea in general ;) [22:34:32] for long-term [22:34:42] well, 'long term' is definitely past 14.04... :P [22:35:04] heh [22:35:11] yeah, 14.04 will make a lot of this easier [22:35:29] that's like 8 months, though [22:35:31] yeah [22:35:36] but that's 'long term' enough :P [22:35:41] indeed [22:35:56] the newest nginx has all kinds of great stuff :) [22:36:03] websockets!!!!1 [22:36:11] our version of nginx has that [22:36:16] Ryan_Lane: so, do I do puppetmaster self? [22:36:28] or can I just create a puppet repo locally, and just run puppet on itself? [22:36:30] I would, yeah [22:36:37] that's what I do for my linode servers, and it works out okay [22:36:45] 'I would' for what? self or? [22:36:59] puppetmaster::self [22:37:04] you want this to be in a module [22:37:09] right [22:37:12] and in the puppet repo [22:37:13] eventually [22:37:23] right, but can live in its own git repo in the meanwhile, I guess [22:37:28] I guess it's possible to run puppet with a module location appended [22:37:35] as long as it's actually a full module [22:37:39] right [22:37:40] then you don't need puppetmaster self [22:37:55] the only thing I'd want from operations/puppet.git is our redis module [22:38:05] I can pick that up easily enough [22:38:08] * Ryan_Lane nods [22:39:04] let me write up a todo [22:40:30] Ryan_Lane: https://wikitech.wikimedia.org/wiki/User:Yuvipanda/Dynamic_http_routing [22:40:52] (Instance proxy-project on project project-proxy) [22:40:53] :D [22:42:45] YuviPanda, why nginx when there's varnish? [22:42:57] MaxSem: does varnish support an embedded language? [22:43:04] VCL [22:43:19] can VCL read from Redis? :D [22:43:33] * YuviPanda clicks [22:44:07] do you want to read from Redis on every HTTP request? that's webscale! [22:44:09] NOT [22:44:10] :P [22:44:31] MaxSem: actually, that isn't that slow if you cache it in memory + use a local redis instance :P [22:45:01] MaxSem: you *do* need something that dynamic to route to any host/port pair without wanting to restart [22:45:20] VCL can be reloaded on the fly [22:45:27] YuviPanda: sweet [22:45:48] MaxSem: I don't think this is a supported usecase for Varnish, is it? [22:45:51] MaxSem: it is for nginx... [22:46:00] also why not nginx? we also use that in production... [22:46:44] plus I'll probably have to write C to get that to work fine. While I don't mind that, nginx seems good enough.. [22:47:12] you can insert inline C into VCL;) [22:47:58] too late, MaxSem! :P it's nginx this time! [22:48:09] booring [22:59:11] MaxSem: varnish can't do SSL, so we'd at least need to put nginx in front of it :D [22:59:30] booring, as I already said:P [22:59:41] it would actually be possible to have a daemon that injects config into varnish, then have it switch when its changed [22:59:44] Ryan_Lane: I just realized [22:59:45] Original-Maintainer: Kartik Mistry [22:59:48] for package nginx [22:59:49] he works for us now [22:59:54] he does? [23:00:01] yeah, alolita never sent out the welcome email [23:00:04] hahaha [23:00:07] nice [23:00:23] he works at language engineering for a month or so now [23:00:33] assuming it's the same Kartik Mistry, of course. But the surname is uncommone nough... [23:00:34] so we can get a package with integrated bash scripting? [23:00:46] I'll just mail him to check [23:01:21] hmm, no arrbee or siebrand around. oh well [23:01:29] Ryan_Lane: perhaps we can convince him to do a backport :P [23:01:35] if it really is him [23:01:40] doing a backport is generally simple [23:01:53] you take the version from a newer ubuntu [23:02:02] and compile it for the LTS [23:02:27] MaxSem: a package with integrated bash scripting? [23:02:46] lua is not webscale!:P [23:02:55] heh [23:03:03] it's on Redis, it definitely must be! [23:03:09] though we use it in both redis and mediawiki [23:03:54] yeah, lua in redis is wonderful! [23:03:56] Ryan_Lane, that explains a lot;) [23:04:58] Ryan_Lane: https://github.com/yuvipanda/project-proxy-proxy-project :) [23:05:04] hahaha [23:05:53] i'm going to write shitty puppet, we can fix that later :P [23:05:58] heh [23:31:14] YuviPanda: updated your page with info on how to integrate with wikitech :) [23:31:43] Ryan_Lane: sweet! [23:31:49] Ryan_Lane: I should put that on gerrit... at some point. [23:32:01] I like this model where I do initial development on github, and then import it to gerrit once it is a little stable [23:32:08] heh [23:32:16] it's not a bad way to go about things [23:32:19] yeah [23:32:35] so I'm live editing in vim on the server, in /etc/puppet. that's a git repo [23:33:29] Ryan_Lane: are you any good with PuTTY? [23:33:45] I haven't used windows in about 4 years [23:33:58] T13: we have docs on putty [23:33:59] !putty [23:33:59] official site: http://www.chiark.greenend.org.uk/~sgtatham/putty/ | how to tunnel - http://oldsite.precedence.co.uk/nc/putty.html [23:34:04] seriously? [23:34:12] * Ryan_Lane grumbles [23:34:15] @search putty [23:34:15] Results (Found 1): putty, [23:34:16] I don't blame you.. I wish I could cut the cord. [23:35:31] !wikitech-putty is https://wikitech.wikimedia.org/wiki/Help:Putty and https://wikitech.wikimedia.org/wiki/Help:Access_to_instances_with_PuTTY_and_WinSCP and https://wikitech.wikimedia.org/wiki/Help:Access_to_ToolLabs_instances_with_PuTTY_and_WinSCP [23:35:32] Key was added [23:35:54] Ryan_Lane: do we have a nginx puppet thingy in our repo? [23:35:55] * YuviPanda looks [23:36:04] I really wish we'd merge those last two pages [23:36:11] it's not really a module... [23:36:22] YuviPanda: we have an nginx module [23:36:28] it may not be what you need, though :) [23:36:33] Ryan_Lane: I've gobe through all three of those pages. [23:36:34] Ryan_Lane: yeah, but the templates are all... elsewhere [23:36:38] modules/nginx [23:36:39] templates/nginx [23:36:41] files/nginx [23:36:43] T13: what issue are you having? [23:36:43] hurr durr [23:37:15] I seem to connect, but as soon as I do, the client closes in me instead of givibg me a prompt.