[07:07:12] [bz] (8NEW - created by: 2Pietrodn, priority: 4Unprioritized - 6normal) [Bug 52370] Replicated DB fawiki_p is missing revision table - https://bugzilla.wikimedia.org/show_bug.cgi?id=52370 [08:08:02] [bz] (8NEW - created by: 2Tyler Romeo, priority: 4Normal - 6enhancement) [Bug 52354] Run Minion testing instance for security testing - https://bugzilla.wikimedia.org/show_bug.cgi?id=52354 [08:22:13] !ping [08:22:13] !pong [08:25:35] @channels [08:28:59] !puff [08:29:17] !poof [08:29:18] *POOF* "Wadda need?" *POOF* "Wadda need?" *POOF* "Wadda need?" [08:49:04] [bz] (8RESOLVED - created by: 2se4598, priority: 4High - 6normal) [Bug 50498] "Error opening index" for File: search on Labs - https://bugzilla.wikimedia.org/show_bug.cgi?id=50498 [08:49:54] [bz] (8RESOLVED - created by: 2Krinkle, priority: 4High - 6enhancement) [Bug 34250] [beta project] Set up search (tracking) - https://bugzilla.wikimedia.org/show_bug.cgi?id=34250 [08:50:10] [bz] (8RESOLVED - created by: 2Chris McMahon, priority: 4Normal - 6normal) [Bug 46459] [OPS] lucene-search-2 uses too much memory on labs - https://bugzilla.wikimedia.org/show_bug.cgi?id=46459 [08:56:35] [bz] (8NEW - created by: 2Antoine "hashar" Musso, priority: 4Lowest - 6enhancement) [Bug 45122] migrate all beta instance from Lucid to Precise - https://bugzilla.wikimedia.org/show_bug.cgi?id=45122 [08:57:08] !log deployment-prep Deleting the old squid instance since we run varnish cache for text nowadays [08:57:11] Logged the message, Master [08:57:44] !log deployment-prep Deleting deployment-cache-upload03 , replaced by the fully puppetized instance deployment-cache-upload04 [08:57:47] Logged the message, Master [09:02:56] !log deployment-prep rebooted both memcached instances to be able to log on them. Apt upgrading both of them [09:02:58] Logged the message, Master [09:19:12] hashar: ping [09:19:19] yup [09:19:26] hashar: how complicated it is to set up a unit testing server for mediawiki? [09:19:38] what do you want to do ? [09:19:46] I managed to get one oracle box here, and I would like to start unit testing of mediawiki there [09:19:47] ori has created a vagrant setup to run unit tests [09:20:05] mediawiki seems to have lot of troubles when it comes to oracle [09:20:12] I wasn't even able to install it :D [09:20:23] there is some mistake in SQL for main page creation [09:21:00] https://bugzilla.wikimedia.org/show_bug.cgi?id=52094 [09:21:09] that is more or less maintained by freakolowsky as a best effort project [09:21:19] I think he look at it whenever we cut a new stable branch [09:21:45] ok but it would be still cool to get notified about a commit that breaks stuff in oracle [09:22:22] how these unit tests actually work? are they all re-executed on each commit? or just periodically done on head? [09:22:40] petan: https://bugzilla.wikimedia.org/show_bug.cgi?id=20343 [09:22:55] the unit tests are run on each patchset submitted [09:23:13] ok [09:23:16] that makes sense [09:23:22] and whenever someone vote CR+2 , then if the tests pass then the patchset is automatically merged by jenkins [09:23:26] else it is rejected [09:23:36] hmm [09:23:52] I understand that wmf can't host oracle testing box, but here it's not a problem [09:30:57] [bz] (8NEW - created by: 2Antoine "hashar" Musso, priority: 4Unprioritized - 6normal) [Bug 52378] beta memcached instances only have 90MB of memory - https://bugzilla.wikimedia.org/show_bug.cgi?id=52378 [09:34:06] [bz] (8NEW - created by: 2Chris McMahon, priority: 4Unprioritized - 6normal) [Bug 52237] URL confusion: commons.wikimedia vs commons.wikipedia - https://bugzilla.wikimedia.org/show_bug.cgi?id=52237 [09:37:24] !ping [09:37:24] !pong [09:37:26] ~ping [09:37:40] petan: what is the command to ping the NFS writes ? [09:37:48] &ping [09:37:49] Pinging all local filesystems, hold on [09:37:50] Written and deleted 4 bytes on /tmp in 00:00:00.0006360 [09:37:56] :( [09:37:59] it doesn't work [09:38:03] :-D [09:38:10] I can't login myself due to NFS [09:38:16] must be some process being wild [09:38:31] yes it is writing to nfs now... [09:38:36] 4 bytes XD [09:39:25] Written and deleted 4 bytes on /data/project in 00:01:36.0901400 [09:39:28] hashar ^ [09:39:31] great [09:39:37] &ping [09:39:37] Pinging all local filesystems, hold on [09:39:38] Written and deleted 4 bytes on /tmp in 00:00:00.0002180 [09:39:39] Written and deleted 4 bytes on /data/project in 00:00:00.0063260 [09:40:41] &ping [09:40:41] Pinging all local filesystems, hold on [09:40:42] Written and deleted 4 bytes on /tmp in 00:00:00.0008300 [09:40:43] Written and deleted 4 bytes on /data/project in 00:00:00.0071260 [09:41:55] petan, what's causing webserver 1 on tools to be spiking [09:42:00] nfs [09:42:28] I really need get in vacations [10:03:17] addshore: ping [10:03:23] pong :> [10:04:02] addshore: had a question regarding addbot [10:04:21] fire away! [10:04:24] addshore: at http://en.wikipedia.org/w/index.php?title=Wikipedia%3ATwinkle%2FPreferences&diff=566493418&oldid=558602674 [10:04:40] it didn't migrate the simple one there [10:04:56] hmm [10:05:47] rather odd, it should catch it on the next pass I gues [10:05:59] if it doesnt and I see other cases like this I will take more of a look [10:06:22] when is next pass? [10:06:26] it is probably because simple is an odd case :> [10:06:31] possibly [10:06:37] I just noticed it [10:06:43] well, it is still doing the first pass now, its on mrwiki I think [10:06:56] mr. wiki? [10:07:00] yup [10:07:12] http://tools.wmflabs.org/addshore/addbot/status/ [10:07:19] it only migrates one each time? [10:07:28] Too see where it is check www.wikidata.org/wiki/Special:Contributions/Addbot and which links it is adding [10:08:17] It attempts to remove interwiki links from wiki articles wiki by wiki starting with the wiki that was checked logest ago [10:08:28] where it can it imports more interwikis to wikidata from any given article [10:08:38] and it will also remove all possible interwikis (except for simple apparently..) [10:09:02] k [10:09:30] wikidata should be renamed into wikinerd [10:09:37] hmmm in the removal function it has $iwPrefix = $site->getIwPrefix(); [10:09:43] https://github.com/addshore/addwiki/blob/master/includes/Page.php#L383 [10:10:02] rather off how it didnt remove simple [10:10:03] I've never used simple [10:10:12] unless the prefix isnt getting set right [10:10:28] but as you can see here [10:10:28] https://github.com/addshore/addwiki/blob/master/includes/Site.php#L102 [10:10:33] I tried to account for simple ;p [10:10:46] hehe [10:11:30] * AzaToth wonders why people write bots in php [10:11:52] that's like writing a embedded computer for a car in visual basic [10:12:04] hah! [10:12:30] well, my thinking is if people want to write bots for wikipedia they may as well do it in php :P then when they also want to start hacking mediawiki there is less to learn xD [10:12:36] Hence I started wiritng this php framework [10:12:50] Id love other people to help out of course ;p [10:13:28] addshore, there's only room for one sophisticated PHP framework, and that's Peachy. You're supposed to help me. :p [10:13:36] Cyberpower678: I dont like peachy ;p [10:13:53] Id just prefer to write my own so I know how it works ;p [10:13:55] addshore, oh? You liked it a couple of weeks ago. [10:14:10] yus :P but then I managed to write this framework that did exactly what I wanted ;p [10:14:33] Peachy can do exactly wht you want too. [10:14:43] can it edit wikidata yet? :O [10:14:57] It can edit any wiki, [10:15:10] mhhm, im not sure it knows about the wikidata api yet ;p [10:15:17] unless you have coded it recently :P [10:15:36] Cyberbot is editing Wikidata right now. [10:15:44] :O [10:15:45] *checks* [10:16:05] cant fidn it :O [10:17:12] what the .... [10:17:16] User:Cyberbot I. [10:17:21] 15 of my jobs just fell off the grid :P [10:17:29] :DDDDDD [10:17:45] I hacked your account as punishment for not using Peachy. ;P [10:17:47] heh Cyberpower678, can it edit entities though [10:18:01] It will, soon. [10:18:23] Peachy in it's final release will support all extensions. [10:18:43] Cyberpower678: all? [10:19:03] serious? [10:19:04] Right now, I'm still removing the rust on it's core. [10:19:09] zhuyifei1999, yes. [10:19:15] hehe Cyberpower678 see, I started with a fresh core ;p [10:19:40] even a third party extension? [10:20:21] That's my goal. Which is why I'm not on Wikipedia a lot. I've been doing almost only bot work. [10:20:59] Extensions available on the MediaWiki site. Anybody can contribute to writing extension support. [10:21:06] addshore, let me see the framework. [10:21:17] heh https://github.com/addshore/addwiki [10:21:33] lots of holes currently, I have only been using it for under a week [10:21:39] lots to do [10:22:59] CP678: nick [10:23:06] Thumbs down. Being bot and Wikimedia specific is a no-no. [10:23:26] addshore, mistakenly put my computer to sleep. [10:23:40] haha [10:23:57] Peachy is still better at this point. The core is almost fresh again. [10:24:17] addshore, I'm putting the paint on it now. :p [10:25:41] addshore: your framework looks a little too like peachy [10:25:49] really? O_o [10:26:49] I havnt had enough time to look at peachy much :< [10:27:28] addshore, the structure is that of Peachy. [10:28:09] The coding is different, but is generally setup like Peachy. Except that your framework isn't designed to support plugins. :p [10:28:23] Cyberpower678: indeed, thats on my todo ;p [10:28:46] currently plugins are just suported as part of core :P [10:28:50] Your todo list is much bigger than mine. :p [10:29:01] yup :p [10:29:07] :D [10:29:19] my problem with peachy is, you dont you github properly :p [10:29:38] I probably would have tried to contribute some code, but you never show what the actualy state is xD [10:29:54] addshore, ?? [10:30:33] addshore, what are you saying? [10:30:45] it was hard enough trying to get you to put it on github in the first place origionaly :P [10:31:10] It's on GitHub now. What you see up there are my latest updates. [10:32:30] addshore, ^ [10:32:36] :O [10:32:36] Have a look. [10:32:51] I'm working on Alpha 4 now. [10:34:14] addshore, when it's able to support Wikidata, will you test Peachy? [10:34:30] Ill have a go at moving my scripts over maybe [10:34:41] might even look at the code later and try and add the basic wikidata support [10:35:11] addshore, cool. [10:36:12] :) [10:40:21] (03CR) 10Yuvipanda: [C: 032 V: 032] "Merging because I guess this is just a tool" [labs/tools/gerrit-to-redis] - 10https://gerrit.wikimedia.org/r/76764 (owner: 10Yuvipanda) [10:50:31] AzaToth +1 :P [10:50:34] regarding bots in php [10:50:53] but TBH bots in python are not much better [10:52:11] petan, at least PHP uses braces. :p [10:52:49] well, from syntax perspective it's clearly better [10:53:00] yes it is. [10:53:36] * YuviPanda adds that to quips for being funny [10:55:09] python's syntax truly sucks as nothing [10:57:21] [bz] (8NEW - created by: 2Antoine "hashar" Musso, priority: 4Unprioritized - 6enhancement) [Bug 52382] automatically import some content from production (tracking) - https://bugzilla.wikimedia.org/show_bug.cgi?id=52382 [10:57:22] [bz] (8NEW - created by: 2Željko Filipin, priority: 4Unprioritized - 6normal) [Bug 47205] Sandbox gadget not at en.wikipedia.beta.wmflabs.org - https://bugzilla.wikimedia.org/show_bug.cgi?id=47205 [10:57:23] [bz] (8NEW - created by: 2Matthew Flaschen, priority: 4Normal - 6enhancement) [Bug 49791] sync-site-resources should sync all Labs wikis - https://bugzilla.wikimedia.org/show_bug.cgi?id=49791 [10:57:24] [bz] (8NEW - created by: 2Antoine "hashar" Musso, priority: 4Normal - 6enhancement) [Bug 49779] sync articles from production wikis (css/gadgets) - https://bugzilla.wikimedia.org/show_bug.cgi?id=49779 [11:02:08] [02dispatcher-labs] 07benapetr pushed 031 commit to 03master [+0/-0/±2] 13http://git.io/1LwXxA [11:02:10] [02dispatcher-labs] 07benapetr 033e6746d - some more debugging output to make stuff clear [11:12:19] (03PS2) 10Yuvipanda: Try to erase file whenever app exits. Also never exit cleanly [labs/tools/gerrit-to-redis] - 10https://gerrit.wikimedia.org/r/76837 [11:12:30] [02dispatcher-labs] 07benapetr pushed 031 commit to 03master [+0/-0/±2] 13http://git.io/I6pVqQ [11:12:31] [02dispatcher-labs] 07benapetr 0308faf76 - fixed Load() [11:13:46] (03CR) 10Yuvipanda: [C: 032 V: 032] Try to erase file whenever app exits. Also never exit cleanly [labs/tools/gerrit-to-redis] - 10https://gerrit.wikimedia.org/r/76837 (owner: 10Yuvipanda) [11:14:12] @q Not-002 [11:16:42] (03PS2) 10Yuvipanda: Do the key generation for registering on the server [labs/tools/gerrit-to-redis] - 10https://gerrit.wikimedia.org/r/76845 [11:18:16] (03CR) 10Yuvipanda: [C: 032 V: 032] "Post commit review, yo :)" [labs/tools/gerrit-to-redis] - 10https://gerrit.wikimedia.org/r/76845 (owner: 10Yuvipanda) [12:15:26] &ping [12:15:26] Pinging all local filesystems, hold on [12:15:27] Written and deleted 4 bytes on /tmp in 00:00:00.0005330 [12:15:28] Written and deleted 4 bytes on /data/project in 00:00:00.0070970 [12:21:53] &whoami [12:21:53] You are unknown to me :) [12:21:59] &trusted [12:21:59] I trust: test (2trusted), [12:22:08] &ping [12:22:08] Pinging all local filesystems, hold on [12:22:09] Written and deleted 4 bytes on /tmp in 00:00:00.0005810 [12:22:10] Written and deleted 4 bytes on /data/project in 00:00:00.0073160 [12:22:19] petan, what does it do? [12:22:54] [bz] (8ASSIGNED - created by: 2Amir E. Aharoni, priority: 4Normal - 6enhancement) [Bug 52222] fill http://he.wikipedia.beta.wmflabs.org/ with some useful data from he.wikipedia.org - https://bugzilla.wikimedia.org/show_bug.cgi?id=52222 [12:23:31] [bz] (8NEW - created by: 2spage, priority: 4High - 6normal) [Bug 51580] configure beta labs for SUL2 - https://bugzilla.wikimedia.org/show_bug.cgi?id=51580 [12:24:20] [bz] (8RESOLVED - created by: 2Željko Filipin, priority: 4Unprioritized - 6normal) [Bug 47360] at en.wikipedia.beta.wmflabs.org Special:UserLogin opens after logging in instead of Main_Page - https://bugzilla.wikimedia.org/show_bug.cgi?id=47360 [12:25:44] [bz] (8NEW - created by: 2Željko Filipin, priority: 4Normal - 6enhancement) [Bug 47205] sync Sandbox gadget from production to en.wikipedia.beta.wmflabs.org - https://bugzilla.wikimedia.org/show_bug.cgi?id=47205 [12:27:42] [bz] (8NEW - created by: 2Chris McMahon, priority: 4High - 6enhancement) [Bug 50335] support dvwiki in beta labs - https://bugzilla.wikimedia.org/show_bug.cgi?id=50335 [12:35:20] These webserver spikes are getting out of control. [12:35:55] Labs has become the next generation toolserver. [12:36:21] People wanted something identical to toolserver, and they got it. Instabilities and all. :p [12:41:36] quick question about the labsdb's, are they also query able from non-labs machines from within one of our dc's? [12:42:12] drdee, no. You can only query while SSH'd into labs. [12:42:23] 100% sure? [12:42:29] Yes. [12:42:32] ty! [12:42:41] Labs is internal access only. [12:42:50] Toolserver has external access. [12:46:21] !log deployment-prep enwikivoyage's search index finished building over night. dewikivoyage seems to have stalled out. I'm going to profile it. simplewiki is still running and will need some love to finish more quickly. [12:46:24] Logged the message, Master [12:55:32] !access | T13|needsCoffee [12:55:33] T13|needsCoffee: https://wikitech.wikimedia.org/wiki/Access#Accessing_public_and_private_instances [13:15:52] andrewbogott_afk ping [13:43:34] petan: "nothing" doesn't suck per def [13:43:44] or does it? [13:43:48] maybe :) [13:43:56] brainfuck did by definition AFAIK [13:44:15] and there is little difference between python and brainfuck [13:44:16] :D [13:44:29] heh [13:45:05] brainfuck is a low-level python [14:19:57] hi, guys. I am developing a bot with http://dumps.wikimedia.org/other/pagecounts-raw/ for KOWP. but I think you already have one which lists popular articles or suddenly popular articles. [14:21:18] Actually I have some issues with utf-8 decoding. [14:51:13] who do I ask for sudo on a project? I'm looking at integration-jenkins2 and working on tarball releases [14:54:20] nm... got it [14:55:05] hexmode: I was about to say "whoever is project admin" [14:55:35] * Coren conceptually throws uwsgi to the floor and stomps on it. Bad! Bad! [15:02:01] ryuch___: I assume you are writing it in python? [15:09:35] !log integration unbroke puppet on integration-jenkins2 [15:09:37] Logged the message, Master [15:18:35] Yeah, so, I've been working in and around python for a few weeks now. Impressively, the more I work with Python the more I hate it. [15:20:47] heh, same here [15:21:10] just that I am working with python not so long :D [15:21:31] there is only 1 thing I like about it [15:22:19] I can put semicolon on end of line and it's not a syntax error, which in fact is a bad thing, because it's not strict [15:47:57] Do I need special rights to authenticate to graphite.wikimedia.org? I tried my labs login as mentioned in the auth prompt but am being 401'd [15:48:55] yes [15:49:25] unfortunately there are private data in the full graphite views so we can't open this up to the whole world [15:49:39] paravoid: who do I need to poke to get access? [15:49:47] https://gdash.wikimedia.org/ is public though [15:50:49] oh wait, you're staff [15:50:55] I just realized :) [15:51:02] paravoid: oh. yes. n00b staff [15:51:27] paravoid: no worries. I don't think I have my official cloak yet [15:51:47] paravoid: also, hello I'm the new guy in robla's group [15:51:55] yes, I remembered the name :) [15:52:05] I'm Faidon from the operations team [15:52:49] what's your labs username? [15:52:59] BryanDavis [15:53:38] sec [15:55:31] [bz] (8NEW - created by: 2Pietrodn, priority: 4Unprioritized - 6normal) [Bug 52370] Replicated DB fawiki_p is missing revision table - https://bugzilla.wikimedia.org/show_bug.cgi?id=52370 [15:57:06] bd808: try again [15:57:41] paravoid: I'm in! thanks [15:57:49] you're now part of the "wmf" group [15:58:01] we have a few other places where we have access controls based on that group [15:58:04] e.g. ishmael [15:58:32] there are plans to move this to a NDA group [15:58:37] to open that up to volunteers [15:59:18] I suppose at some point "ishhmael" will mean something to me other than _Moby Dick_ [15:59:21] :) [15:59:22] haha [15:59:28] https://ishmael.wikimedia.org/ [15:59:43] https://wikitech.wikimedia.org/wiki/Ishmael.wikimedia.org are the docs about it [16:02:31] oh cool. Somebody actually had sent me an old email thread about a tool for slow query analysis. [16:02:44] * bd808 slowly learns things that will be important later [16:12:30] what are you looking for in graphite? [16:12:34] out of curiosity mostly [16:13:38] just poking around. I'm working on a small project to add some tracking of cache purges and will be pushing data there eventually [16:13:48] https://www.mediawiki.org/wiki/Multimedia/Cache_Invalidation_Misses [16:13:49] what kind of tracking? [16:13:52] oh [16:14:31] we want to get some visibility into purge packet loss [16:14:42] we already have such means [16:14:59] o_O [16:15:06] it's kinda strange you're working on this, our team is also working on that [16:15:18] the vhtcpd stats? [16:15:21] yes [16:15:36] You [16:15:46] are not in the *same* team, aren't you? :-) [16:15:50] we are not :) [16:16:04] well, broadly speaking we *all* are in the same team [16:16:07] ;) [16:16:29] we have also discussed pgm/0mq a bit [16:16:41] this is not related to multimedia at all btw [16:16:47] this is how we do purges everywhere [16:16:59] Brandon seems to be against 0mq somewhat [16:17:09] oh you've spoken to brandon already, that's good [16:17:20] a little. one set of emails [16:18:11] I'm starting with the idea of checking Age headers for URIs that the db has marked as overwritten [16:18:49] all urls you mean? [16:18:49] robla thinks this may let us be more proactive about finding things that are borked [16:19:07] that messes with LRU, I don't think this is such a good idea [16:19:44] I think for the short term checking vhtcpd stats is a good measure and for the longer term the solution would be to switch to a lossless transport [16:19:47] hmmm that makes sense. would make things hot arbitrarily [16:20:02] but I think bblack is working on all that? [16:20:32] he has stats in the daemon. I don't think they are being graphed yet. [16:20:42] which may be the best thing for me to add first [16:20:58] I was under the impression he was planning for a ganglia plugin [16:21:31] we should have a larger discussion. what mailing list is good for stuff like this? [16:21:41] or a hangout or something [16:21:55] well, to me this sounds like right within ops' realm [16:22:00] so I'd say ops@ [16:22:10] I think robla and bawolff sould be in on it [16:22:27] otoh, you've filed whis under mediawiki.org/Multimedia, so it's getting a little strange :) [16:23:33] haha. well it landed there because I'm currently on loan to the MM team and they are concerned about bug 49362 [16:23:44] but yeah, varnish & vhtcpd are definitely ops [16:24:02] ops@lists.wikimedia.org [16:24:03] being n00b and a designated utility player will make lots of things I do weird [16:24:30] I need to get on that list. is it open or staff only? [16:24:50] it's staff only, started as ops-only but this is definitely not the case anymore [16:26:40] paravoid: I just signed up. I'll work on an intro email about the problem and see if we can get a reasonable plan underway for how we can all work together [16:27:49] !log deployment-prep building search index for commonswiki and the other wikis that aren't in the main section of http://deployment.wikimedia.beta.wmflabs.org/wiki/Special:SiteMatrix [16:27:52] Logged the message, Master [16:28:45] bd808: it would be more correct to say that I don't think the 0mq+pgm solution is worth it [16:29:00] bd808: I was actually the guy who came up with that idea in the first place, so ... :P [16:29:11] bblack: noted. Not trying to put words in your mouth [16:29:57] I'm glad I poked my head in here randomly. It think this is a great discussion [16:30:43] (I'm glad too, usually #-operations is a better forum for such discussions though) [16:31:11] as in, you have higher chances of getting meaningful answers/discussions :) [16:31:54] bd808: the problem with the 0qm+pgm thing is, if you really want it to be reliable, architecting it becomes really complicated. You need some redundant hubs that handle publish/subcribe between the PHP sender nodes and the vhtcpd receivers, and then you can't lose pubs and subs either, or lose the retrans when one of them crashes [16:32:17] it's not like we don't have a hub to make multicast->unicast right now though :) [16:32:23] (for esams) [16:32:24] from the 1000-ft view, I think you either end up poorly implementing that and adding to the failures, or implementing it really well but the complexity cost isn't worth the results [16:33:06] grrr- labs machines keep freezing on me.... [16:33:33] bblack: I've never done it, but it's theoretically possible to stick RabbitMQ into a 0MQ network to act as a reliable relay [16:33:46] I'd rather see us first monitor vhtcpd better (which is an ops task), detect whether there are any remaining failure modes that matter enough to care. I suspect we'll see a few instances of multicast relay failure or bursts of dropped multicast here or there, but that there will be ways to reduce those issues as they're debugged [16:34:10] stick X into Y and wave magic wand and things are reliable never actually works out so easily in practice :) [16:34:22] I think a reliable transport is something we should plan for the mid term though [16:34:39] we have persistent caches, this means that if we need to reboot for a kernel upgrade or the motherboard fails or something [16:34:46] we'll lose N minutes of purges [16:35:07] but still serve out of cached content [16:35:18] so this has worked so far but it's a bit suboptimal I think [16:35:24] I still don't have a full grasp of the whole purging picture, but I suspect there may be other things we can hack around on that lessen the impact of missed purges in the first place. [16:35:32] bblack: can I help the process by writing the ganglia plugin for your stats? [16:36:25] if you really want to, feel free :) [16:36:29] root@cp1040:~# cat /tmp/vhtcpd.stats [16:36:30] start:1375268516 uptime:106410 inpkts_recvd:35459921 inpkts_sane:35459921 inpkts_enqueued:35459921 inpkts_dequeued:35459921 queue_overflows:0 [16:36:53] start is a unix gmt timestamp, uptime is how long the daemon has been alive, the file is overwritten every so often (30s I think?) [16:37:12] we have similar monitoring of udp2log log streams [16:38:11] inpkts_recvd is multicast that hit the daemon, _sane means it parsed correctly and wasn't garbage, _enqueued means it made it into the sending queue, and _dequeued means it was sent as TCP PURGE to all applicable varnish instances [16:38:28] queue_overflows means the queue backlogged by such a huge volume of memory that the daemon gave up and wiped out all queued requests [16:39:02] we also have nagios alerts for udp2log streams I think [16:39:07] so you just have to track those counters over time into a graph, basically [16:39:23] but these come with a seq number so it's easy to measure packet loss [16:39:42] bblack: sounds easy in theory [16:39:58] graphing is easy, detecting anomalies isn't :) [16:40:15] I mean, sure, it's easy to measure "0", but how about 1-2% packet loss? :) [16:40:19] some easy/obvious triggers would be any of the rates being zero for a few samples in a row [16:40:28] or non-zero queue_overflow numbers [16:40:44] I've never really understood why more people don't use the predictive stuff in rrdtool... [16:41:39] graphite has holt-winters and some other stuff that might be useful [16:41:56] the hard part is figuring out the expected rates in all this [16:41:56] because that's for detecting general trends, not precise measures such as 1-2% packet loss [16:42:13] I wonder if we could instrument the outbound side in mw? [16:43:45] re the LRU cahce tampering inherent in polling for Age headers: is there an HTTP verb that would give us data without corrupting the state? [16:44:34] you could engineer that within varnish, but honestly I think it's silly to hit purged URLs to see if the purge succedeed [16:45:33] I think bawolff's idea here is to do statistical sampling to get an idea about that possible 1-2% loss rate [16:45:55] and some idea if it's constant or peaks with other events [16:46:12] it's probably a lot smaller than 1% too [16:46:27] I think it's an entirely wrong layer to approach this [16:46:39] if we have 1-2% packet loss, it's either network packet loss or some issue with vhtcpd [16:46:40] it's just one of those things in an eventually consitant ssytem that makes content authors crazy [16:46:47] vhtcpd has detailed stats so we can detect this there [16:46:57] network packet loss... well, we can easily measure that can't we :) [16:47:55] without ever checking http [16:48:46] well, probably the most interesting and ripe area for loss would be in the sending and receiving machine's kernel-level network buffers [16:49:17] on the receiving end, vhtcpd is pretty damned efficient at keeping the queue empty and setting the queue size to "super huge", but do the sending machines ever drop on outbound queues? [16:49:55] the amount of htcp on the sending side is much lower than on receiving side though :) [16:50:03] 1/N where N is the amount of appservers [16:50:14] and appservers are much less busy in general [16:50:15] linux kernel has (had?) some udp rate issues. Facebook hit them when they switched to UDP for memcache [16:50:32] they're worse machines but not much worse [16:50:40] ok [16:50:50] well, not _that_ worse [16:51:12] noted. worse, but not _that_ worse, but not much worse :) [16:51:27] lol :) [16:52:11] that's my impression anyway, have a look, maybe you'll see something that I won't [16:52:11] I think the long term solution for images is to change the thumb url structure to self version. That won't fix anon conent cache staleness but that's less likely to be noticed by editors [16:52:32] I've heard this idea before [16:52:35] I don't like it much tbh [16:52:48] bd808: stepping out a few layers again, I think some of the reaction to this problem is still grounded in the rate of problems reported in the past. we knew the major fault was the relay daemon, and we replaced it with something better. Need more data now to determine exactly how much of a problem still exists, to know what kinds of solutions are "worth it" [16:53:19] I don't like creating second class citizens wrt caching [16:53:32] mhhm, how easy is it to get a repo on gerrit? [16:53:59] bblack: agreed. we need to know what's broke and how often before we can reliably evaluate a fix [16:54:06] bblack: do you know if there are indications when the kernel drops udp? [16:54:06] *googles* [16:54:14] /proc/sockstat or something similar? [16:54:30] bblack: otherwise we are just poking things and hoping something different happens [16:54:41]