[06:44:27] PROBLEM - ToolLabs: Puppet failure events on labmon1001 is CRITICAL: CRITICAL: tools.tools-exec-cyberbot.puppetagent.failed_events.value (33.33%) [07:09:57] RECOVERY - ToolLabs: Puppet failure events on labmon1001 is OK: OK: All targets OK [11:34:03] something went wrong in the last 5 minutes ... http://tools.wmflabs.org/wikidata-todo/autolist2.php is no longer serviced it says [12:01:11] Magnus did the trick for me .... [13:56:00] 3Wikimedia Labs: dewiki-20140918-pages-meta-history{3,4}.bz2 dumps are missing - 10https://bugzilla.wikimedia.org/71967#c1 (10Sam Reed (reedy)) Missing from where? [15:00:16] 3Wikimedia Labs: dewiki-20140918-pages-meta-history{3,4}.bz2 dumps are missing - 10https://bugzilla.wikimedia.org/71967#c2 (10Giftpflanze) /public/dumps/public/dewiki/20140918 [16:16:30] Could someone post to https://wikitech.wikimedia.org/wiki/Analytics/Pagecounts-all-sites and inform them the statistical generation hasn't been performed since Oct. 10 ? [18:49:45] 3Wikimedia Labs / 3Infrastructure: role::labs::lvm::mnt ends up with make-instance-vg: failed to create new partition - 10https://bugzilla.wikimedia.org/71873#c8 (10Andrew Bogott) 5PATC>3RESO/FIX This should now be resolved for new instances. Note that the new partition scheme will make the /mnt partiti... [19:24:43] hi, i have labs account, and would like to get the shell account activated (on bastion project). can anyone help me with it or should i use the shell request form? [21:20:15] http://tools.wmflabs.org/wikidata-todo/autolist2.php is dead again [21:20:41] I am getting so tired about Labs becoming less reliable [21:26:55] GerardM-: I restarted it. Magnus should *really* set up his .bigbrotherrc to make sure it gets monitored. [21:27:22] thanks [21:28:20] it is still dead [21:28:40] Hm. There's obviously a problem because it ran for some time before going down. [21:29:48] * Coren watches it. [21:31:07] It's not obviously running out of memory, it uses ~1.9G of 4 allowed [21:31:44] GerardM-: And down it goes. [21:32:07] * Coren checks its logs, etc. [21:32:54] Ah, yes, that's what happens. It grows quickly in size and creashes when it runs out of memory. :-( [21:34:38] * Coren increases its limit a little. [21:35:41] GerardM-: I granted it an extra 2G of memory; but if it grows to that size I'll have to sit down with Magnus and find some way for his webservice to not grow so large. [21:36:14] coren .. it is an essential tool [21:36:31] it is used a lot by many people [21:37:00] as I understand it it can have multiple instances [21:37:18] Nevertheless, we don't have infinite resources. If it's that critical, perhaps it's time to start thinking about giving it dedicated infrastructure or moving it to prod. [21:38:17] But also, tools using several gigs of vmem usually need a rewrite because they're doing something wrong - things like reading large unbounded result sets in ram, etc, that have better ways of being done. [21:38:45] Ram is expensive, database cursors are cheap. :-) [21:39:22] * Nemo_bis recalls a time when WMF said RAM on Labs was practically infinite [21:39:55] Nemo_bis: We have a *lot* of it. The issue is we have a *lot* of users, so if all of then use tons of it we'll run out too. :-) [21:40:19] a lot << practically infinite [21:40:23] Nemo_bis: Labs is using a bit over 2T of ram. [21:41:43] Nemo_bis: We still have room to spare; that doesn't mean it's reasonable to waste it. [21:42:25] Nemo_bis: Any job/tool that grows to well over 4G has a problem. [21:42:42] I was only commenting about *past* claims :) [21:43:21] If anyone said "practically infinite", it was hyperbole - though not by all that much. [21:43:33] Besides, certainly one wonders e.g. what those unused 1.6 TB RAM here https://ganglia.wikimedia.org/latest/?c=Analytics%20cluster%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [21:43:43] ...are doing/there for [21:43:48] coren .. Wikidata very much relies on AutoList2 for millions of edits [21:44:10] Nemo_bis: Right, we're around 60%-ish usage right now. [21:44:48] GerardM-: Then it needs dev time put into it to improve its reliability. That sounds like a really good topic for an IEG. [21:45:42] GerardM-: Or to make a case to teh WMF or WMDE, say, to allocate a dev to it. [21:46:25] GerardM-: Either way, just throwing more resources at it is not a tenable option even in the medium term. [21:48:11] How much memory is it consuming in total? [21:48:32] how much does 20G of memory cost ? [21:49:18] * Coren sighs. [21:50:14] GerardM-: That's not the point. If it can grow over several gigs it is doing something wrong and needs to be fixed, because that will only get worse as time progresses. [21:51:05] GerardM-: If I had a penny every time there was a tool that takes an entire unbounded result set in ram, I could retire and singlehandedly finance the WMF. [21:52:08] and I have a need for this functionality and I have noone looking at the code to optimise it [21:53:17] GerardM-: Then the obviously correct thing to do is to get someone to look at it. In the meantime, I've upped its memory limit by 50%, but - as I've said - that's not sustainable. [21:53:57] GerardM-: If the tool is that critical then speak to the WMF to either allocate resources or dev time to it. Possibly WMDE can help too. [21:54:23] I suggest you don't repeat this argument ;) [21:54:48] Nemo_bis: What else do you expect me to say? I can't summon a dev to magically appear and fix the tool. [21:54:51] Oh, I didn't know labstore had its own cluster now https://ganglia.wikimedia.org/latest/?c=Labs%20NFS%20cluster%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [21:55:54] I just mean https://lists.wikimedia.org/pipermail/wikidata-l/2014-October/004739.html [21:58:16] Anyway, throwing an intern or grantee at a software problem has costs that I quantify in the tens thousands $ (considering administration, mentoring etc.). Let's keep that in mind when we try to defend hardware [21:59:53] «Basically what I'm saying is that based on our resources, we should easily be able to handle the load from the TS community. If it turns out that we can't, we'll add more capacity.» [22:00:03] «Moore's law and all. It could be that we need to add more servers. If that's the case, we'll do so.» [22:00:40] yes [22:00:47] No worries, I didn't believe that for a second, two years ago [22:17:32] GerardM-: you still around? [22:19:34] Nemo_bis: current labs can easily do anything the TS could (with an exception of the OSM group as they had their own config [22:59:30] !paste [22:59:30] http://tools.wmflabs.org/paste/ [23:21:18] 3Wikimedia Labs / 3deployment-prep (beta): Beta: Gadgets tab is missing - 10https://bugzilla.wikimedia.org/71988 (10se4598) 3NEW p:3Unprio s:3normal a:3None On (dewiki) beta labs the gadget tab is suddenly missing in the preferences, but the extension is still listed at Special:Version. Possible cau... [23:23:30] 3Wikimedia Labs / 3deployment-prep (beta): Beta: Gadgets tab is missing - 10https://bugzilla.wikimedia.org/71988 (10se4598) [23:41:10] Nemo_bis: We already have about 60-70% more load than the ts ever had. [23:41:57] Nemo_bis: And the trend is increasing steadily.