[00:24:49] <Svetlana>	 Coren: around still?
[00:25:45] <Svetlana>	 what are the differences in performance of extensions querying a db and of wmlabs tools querying a db?
[00:26:31] <Svetlana>	 is it slower on-wiki because the wiki software artificially slows down queries from extensions? slower on-wiki because the db is in active use? or something else?
[00:27:02] <Svetlana>	 and how fresh is the wmlabs's copy of the wikimedia projects db, how often is it updated?
[01:25:19] <Coren>	 Svetlana: (quickly, about to head to bed); the performance should be comparable, the databases aren't as loaded but there aren't as many of them either.  As for freshness, it's generally a few seconds behind production at most but there are occasional bursts of lag when some tool is doing heavy writing to temporary tables and such (though most of those are fixed as they are found)
[01:25:57] <Svetlana>	 ok
[01:26:23] <Svetlana>	 Coren: https://www.mediawiki.org/wiki/Thread:Extension_talk:DynamicPageList_(Wikimedia)/Performance_concerns_regarding_the_Intersection_extension
[01:26:35] <Svetlana>	 any thoughts? does it make sense for me to write a similar tool at wm labs by hand?
[01:26:50] <Svetlana>	 or would it have the same performance issues at big wikis as DPL did?
[01:27:58] <Coren>	 You may want to look into catscan http://tools.wmflabs.org/catscan2/catscan2.php it publishes an API that allows you to query the category graph.
[01:28:37] <Coren>	 (Though I've linked you to the web interface - not sure where that API is documented but I know other tools use it)
[01:28:43] <Svetlana>	 ok
[01:28:46] <Svetlana>	 ta
[03:15:16] <wikibugs>	 3Wikimedia Labs / 3deployment-prep (beta): Add RecentActivityFeed extension to beta labs - 10https://bugzilla.wikimedia.org/69785#c12 (10Andrew Green) Reviewed most of the code, except the javascript. The first issue to deal with is the repetition of code from core. An option that I suggested is to extend Sp...
[03:28:15] <wikibugs>	 3Wikimedia Labs / 3tools: Expose revision.rev_content_format on replicated wikidatawiki - 10https://bugzilla.wikimedia.org/54164 (10Marc A. Pelletier) 5PATC>3RESO/FIX
[03:29:42] <Svetlana>	 https://bugzilla.wikimedia.org/show_bug.cgi?id=70881 do we have a single awake folk here who agreed with the last comment?
[03:29:47] <Svetlana>	 agrees
[03:44:42] <legoktm>	 Svetlana: yes
[03:44:53] * warpath does, 'notice' is quite irritating .no one likes random flashing of their clients just cause most are on multipel channels..
[03:49:15] <warpath>	 maybe they could change the colour of the "topic" of each bug to 4RED for emphasis and BOLD it..
[03:56:16] <Svetlana>	 idk, i will just script my client to convert it to notices then
[03:56:19] <Svetlana>	 ta
[04:15:29] <legoktm>	 Svetlana: what client do you use?
[05:57:26] <Svetlana>	 legoktm: irssi or quassel or erc, unfortunately quassel lacks scripting so..
[06:23:17] <einsbor>	 I have little problem with Intersect Contribs: http://tools.wmflabs.org/intersect-contribs
[06:23:53] <einsbor>	 there is no possibility to select a wiki, there is just "select a wiki", modification of URL doesn't help
[06:24:02] <einsbor>	 any ideas anyone?
[06:25:32] <Svetlana>	 https://meta.wikimedia.org/wiki/User_talk:Pietrodn?action=edit&section=new&preloadtitle=Intersect-contribs%20wiki%20selection%20box%20is%20not%20functional
[06:26:01] <Svetlana>	 he'll help you from there - tell him whether it worked before or not
[06:26:03] <einsbor>	 Svetlana: thx :)
[06:43:26] <icinga-wm>	 PROBLEM - ToolLabs: Puppet failure events on labmon1001 is CRITICAL: CRITICAL: tools.tools.puppetagent.failed_events.value (33.33%)  
[06:46:03] <cydd>	 ya
[06:54:00] <wikibugs>	 3Wikimedia Labs / 3tools: Missing or wrong information in meta_p.wiki table - 10https://bugzilla.wikimedia.org/54962#c6 (10Pietrodn) 5RESO/FIX>3REOP The meta_p.wiki tables now contains only one row: centralauth.  MariaDB [meta_p]> select * from wiki; +-------------+------+------+-------------+------+----...
[06:56:36] <cydd>	 hi
[06:56:44] <pietrodn>	 Coren, the meta_p.wiki table is empty: https://bugzilla.wikimedia.org/show_bug.cgi?id=54962#c6
[06:56:53] <cydd>	 how do you guys handle organizations that delete content favorably for their clients
[06:57:16] <cydd>	 on wikipedia
[06:58:21] <legoktm>	 cydd: I think you're looking for #wikipedia or #wikipedia-en?
[06:58:34] * cydd is looking for legoktm
[06:59:15] <legoktm>	 :)
[06:59:48] <pietrodn>	 Someone told me that the wiki selection dropdown on my tools isn't working… and actually, there aren't any wikis on the DB table :P
[07:00:32] <legoktm>	 that would probably explain why GUC and sulinfo are broken
[07:03:57] <icinga-wm>	 PROBLEM - ToolLabs: Excess CPU check: user on labmon1001 is CRITICAL: CRITICAL: tools.tools-shadow.cpu.total.user.value (70.00%)  
[07:30:28] <icinga-wm>	 RECOVERY - ToolLabs: Excess CPU check: user on labmon1001 is OK: OK: All targets OK  
[07:44:37] <icinga-wm>	 PROBLEM - ToolLabs: Excess CPU check: user on labmon1001 is CRITICAL: CRITICAL: tools.tools-shadow.cpu.total.user.value (77.78%)  
[08:09:36] <icinga-wm>	 RECOVERY - ToolLabs: Excess CPU check: user on labmon1001 is OK: OK: All targets OK  
[08:24:01] <icinga-wm>	 PROBLEM - ToolLabs: Excess CPU check: user on labmon1001 is CRITICAL: CRITICAL: tools.tools-shadow.cpu.total.user.value (70.00%)  
[08:48:42] <icinga-wm>	 RECOVERY - ToolLabs: Excess CPU check: user on labmon1001 is OK: OK: All targets OK  
[09:04:02] <icinga-wm>	 PROBLEM - ToolLabs: Excess CPU check: user on labmon1001 is CRITICAL: CRITICAL: tools.tools-shadow.cpu.total.user.value (70.00%)  
[09:27:57] <a930913>	 whym_away: Why are you using 25% of tools-dev's RAM in emacs? :/
[09:32:44] <icinga-wm>	 RECOVERY - ToolLabs: Excess CPU check: user on labmon1001 is OK: OK: All targets OK  
[09:43:45] <icinga-wm>	 PROBLEM - ToolLabs: Excess CPU check: user on labmon1001 is CRITICAL: CRITICAL: tools.tools-shadow.cpu.total.user.value (70.00%)  
[09:59:34] <whym>	 a930913: does it look good now?
[10:09:05] <icinga-wm>	 RECOVERY - ToolLabs: Excess CPU check: user on labmon1001 is OK: OK: All targets OK  
[10:24:28] <icinga-wm>	 PROBLEM - ToolLabs: Excess CPU check: user on labmon1001 is CRITICAL: CRITICAL: tools.tools-shadow.cpu.total.user.value (66.67%)  
[10:51:00] <icinga-wm>	 RECOVERY - ToolLabs: Excess CPU check: user on labmon1001 is OK: OK: All targets OK  
[10:55:57] <pietrodn>	 meta_p.wiki is still broken
[10:56:35] <icinga-wm>	 PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools.diskspace._var.byte_avail.value (33.33%)  
[11:05:17] <icinga-wm>	 PROBLEM - ToolLabs: Excess CPU check: user on labmon1001 is CRITICAL: CRITICAL: tools.tools-shadow.cpu.total.user.value (77.78%) WARN: tools.tools-submit.cpu.total.user.value (55.56%)  
[11:13:45] <icinga-wm>	 RECOVERY - ToolLabs: Low disk space on /var on labmon1001 is OK: OK: All targets OK  
[11:23:36] <Coren>	 YuviPanda labmon's checks of labmon keep being silly.
[11:30:34] <icinga-wm>	 RECOVERY - ToolLabs: Excess CPU check: user on labmon1001 is OK: OK: All targets OK  
[11:31:31] <YuviPanda>	 Coren: heh
[11:31:40] <YuviPanda>	 Coren: I'm working on shinken that'll eliminate that particular problem
[11:31:45] <YuviPanda>	 I've no idea why CPU is spiking tho
[11:32:15] <Coren>	 (Also, 70% cpu is critical?  Really?)
[11:36:49] <YuviPanda>	 over 5 minutes! :)
[11:36:57] <YuviPanda>	 hmm, I'll bump it up to 100%
[11:37:23] <YuviPanda>	 Coren: actually, I set it to 70 because they aren't totaled, it's user / system / iowait separately
[11:42:51] <wikibugs>	 3Wikimedia Labs / 3tools: Missing or wrong information in meta_p.wiki table - 10https://bugzilla.wikimedia.org/54962#c7 (10Pietrodn) 5REOP>3RESO/FIX Now the problem seems to be fixed.
[11:44:43] <icinga-wm>	 PROBLEM - ToolLabs: Excess CPU check: user on labmon1001 is CRITICAL: CRITICAL: tools.tools-shadow.cpu.total.user.value (77.78%)  
[11:48:05] <pietrodn>	 thank you Coren!
[12:08:05] <icinga-wm>	 RECOVERY - ToolLabs: Excess CPU check: user on labmon1001 is OK: OK: All targets OK  
[12:20:53] <anskar>	 Hi all
[12:21:02] <anskar>	 just a question
[12:21:36] <a930913>	 whym: Yeah, what were you doing? :o
[12:22:06] <anskar>	 i always do 'become <tool>' to do ans run some scropts, but now i can't do it
[12:22:26] <a930913>	 anskar: What's the error?
[12:22:46] <anskar>	 nothing only can't
[12:23:23] <anskar>	 my prompt no change from <user>@tools-login$
[12:23:26] <anskar>	 to
[12:23:44] <anskar>	 <tool>@tools-login$
[12:24:14] <icinga-wm>	 PROBLEM - ToolLabs: Excess CPU check: user on labmon1001 is CRITICAL: CRITICAL: tools.tools-shadow.cpu.total.user.value (77.78%)  
[12:25:07] <anskar>	 aps, this is my excess cpu?
[12:25:57] <a930913>	 anskar: No, that's just other stuff melting :p
[12:26:18] <a930913>	 anskar: When you become, your shell changes to the tool.
[12:26:32] <a930913>	 anskar: What is stopping you from running your scripts?
[12:26:38] <petan>	 omg
[12:26:46] <petan>	 why is icinga in this channel
[12:27:03] <a930913>	 petan: Because melty stuff?
[12:27:07] <petan>	 !channels
[12:27:07] <wm-bot>	 | #wikimedia-labs-nagios #wikimedia-labs-offtopic #wikimedia-labs-requests
[12:27:18] <petan>	 these exist to keep spam out of here ^
[12:27:31] <YuviPanda>	 petan: this isn't labsicinga, this is prod icinga
[12:27:35] <anskar>	 a930913: yes i know, i try to cancel user scripts
[12:27:51] <a930913>	 anskar: To cancel user scripts?
[12:27:55] <petan>	 yuvipanda: that is one more reason for it not to be here :P
[12:28:06] <YuviPanda>	 it only emits warnings about toollabs
[12:28:24] <anskar>	 I coet|cawiki 
[12:28:29] <anskar>	 *hi
[12:29:36] <anskar>	 coet|cawiki: you are becomed , thanks a930913 coet|cawiki are in same tool;
[12:30:49] <a930913>	 anskar: I don't understand?
[12:31:21] <anskar>	 sorry i can't talk very well in english
[12:31:57] <anskar>	 coet|cawiki: and i are in the same tool from labs, i can answer him
[12:34:11] <anskar>	 thanks a930913 ;)
[12:34:35] <whym>	 a930913: I don't really know, but maybe it was no good that I left a buffer with a high-traffic IRC channel.
[12:43:36] <anskar>	 a930913: solved my problem :)
[12:51:35] <icinga-wm>	 RECOVERY - ToolLabs: Excess CPU check: user on labmon1001 is OK: OK: All targets OK  
[13:05:45] <icinga-wm>	 PROBLEM - ToolLabs: Excess CPU check: user on labmon1001 is CRITICAL: CRITICAL: tools.tools-shadow.cpu.total.user.value (77.78%) WARN: tools.tools-submit.cpu.total.user.value (66.67%)  
[13:30:37] <icinga-wm>	 RECOVERY - ToolLabs: Excess CPU check: user on labmon1001 is OK: OK: All targets OK  
[13:46:22] <icinga-wm>	 PROBLEM - ToolLabs: Excess CPU check: user on labmon1001 is CRITICAL: CRITICAL: tools.tools-shadow.cpu.total.user.value (77.78%) WARN: tools.tools-submit.cpu.total.user.value (66.67%)  
[14:10:31] <icinga-wm>	 RECOVERY - ToolLabs: Excess CPU check: user on labmon1001 is OK: OK: All targets OK  
[14:25:31] <icinga-wm>	 PROBLEM - ToolLabs: Excess CPU check: user on labmon1001 is CRITICAL: CRITICAL: tools.tools-shadow.cpu.total.user.value (77.78%) WARN: tools.tools-submit.cpu.total.user.value (66.67%)  
[14:27:20] <YuviPanda>	 Coren: merge https://gerrit.wikimedia.org/r/160964? bumps up CPU limits
[14:27:22] <YuviPanda>	 for the checks
[14:39:53] <wikibugs>	 3Wikimedia Labs / 3deployment-prep (beta): Search is sometimes slow - 10https://bugzilla.wikimedia.org/70869#c7 (10Nik Everett) Did a bit of digging this morning.  Here is a graph of io load: http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1410964216.337&target=deployment-prep.deployment-elast...
[14:45:04] <YuviPanda>	 manybubbles: btw, labs ganglia doesn't really work
[14:45:06] <YuviPanda>	 and hasn't for quite a while
[14:51:35] <icinga-wm>	 RECOVERY - ToolLabs: Excess CPU check: user on labmon1001 is OK: OK: All targets OK  
[15:00:24] <wikibugs>	 3Wikimedia Labs / 3deployment-prep (beta): Search is sometimes slow - 10https://bugzilla.wikimedia.org/70869#c8 (10Yuvi Panda) Note that ganglia on labs has been dead for a long time, and will remain so for the foreseeable future. Do send metrics to graphite instead for labs :)
[15:05:37] <icinga-wm>	 PROBLEM - ToolLabs: Excess CPU check: user on labmon1001 is CRITICAL: CRITICAL: tools.tools-shadow.cpu.total.user.value (77.78%) WARN: tools.tools-submit.cpu.total.user.value (55.56%)  
[15:06:20] <YuviPanda>	 really?
[15:11:52] <wikibugs>	 3Wikimedia Labs / 3deployment-prep (beta): Search is sometimes slow - 10https://bugzilla.wikimedia.org/70869#c9 (10Chris McMahon) (In reply to Nik Everett from comment #7)  > I'm willing to chalk it up to a spike in requests to beta and intentionally > underpowered systems.  I just want to underline this.  "...
[15:17:39] <wikibugs>	 3Wikimedia Labs / 3deployment-prep (beta): Search is sometimes slow - 10https://bugzilla.wikimedia.org/70869#c10 (10Nik Everett) In this case I'm kind of blind because of lack of ganglia - its really a shame that we don't have it working and/or no one has found the time to port the ganglia monitoring to grap...
[15:22:52] <wikibugs>	 3Wikimedia Labs / 3deployment-prep (beta): Search is sometimes slow on the Beta Cluster - 10https://bugzilla.wikimedia.org/70869#c11 (10Greg Grossmeier) (In reply to Nik Everett from comment #10) > Maybe relevant: I see request spikes in production that don't translate into > huge load spikes because we use...
[15:26:03] <a930913>	 ERROR 1235 (42000): This version of MariaDB doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery' :@
[15:29:08] <wikibugs>	 3Wikimedia Labs / 3deployment-prep (beta): Install and configure pool counter - 10https://bugzilla.wikimedia.org/70940 (10Nik Everett) 3NEW p:3Unprio s:3normal a:3None We should have a pool counter setup in beta similar to how production is installed.  With smaller limits probably.  For background -...
[15:29:24] <wikibugs>	 3Wikimedia Labs / 3deployment-prep (beta): Install and configure pool counter - 10https://bugzilla.wikimedia.org/70940 (10Nik Everett)
[15:29:24] <wikibugs>	 3Wikimedia Labs / 3deployment-prep (beta): Search is sometimes slow on the Beta Cluster - 10https://bugzilla.wikimedia.org/70869 (10Nik Everett)
[15:31:02] <icinga-wm>	 RECOVERY - ToolLabs: Excess CPU check: user on labmon1001 is OK: OK: All targets OK  
[15:45:12] <icinga-wm>	 PROBLEM - ToolLabs: Excess CPU check: user on labmon1001 is CRITICAL: CRITICAL: tools.tools-shadow.cpu.total.user.value (77.78%) WARN: tools.tools-submit.cpu.total.user.value (55.56%)  
[15:45:50] <wikibugs>	 3Wikimedia Labs / 3deployment-prep (beta): Install and configure pool counter - 10https://bugzilla.wikimedia.org/70940 (10Sam Reed (reedy)) s:5normal>3enhanc
[15:48:13] <YuviPanda>	 andrewbogott: hmm, I think I'm just going to use the OS api directly and not bother with implementing a wikitech module, since monitoring will live on labmon1001 anyway
[15:48:19] <YuviPanda>	 andrewbogott: are there docs somewhere on how to access this?
[15:48:40] <YuviPanda>	 andrewbogott: also, are project IPs allocated in a particular subnet?
[15:51:51] <andrewbogott>	 YuviPanda: There aren't wikitech-specific docs, since the api is just the OS api.  I can link you to OS api docs, or you can dig through OpenStackManager code and see what it does.
[15:52:35] <andrewbogott>	 YuviPanda: This page has IP doc range info:  https://wikitech.wikimedia.org/wiki/IP_addresses
[15:52:41] <andrewbogott>	 I think the labs entries are roughly correct
[15:54:28] <andrewbogott>	 YuviPanda: I'm happy to help with the OpenStack stuff.  I'm slightly preoccupied at the moment with trying to kill off virt0 before tampa is shut down
[15:56:07] <YuviPanda>	 andrewbogott: will do
[16:10:16] <icinga-wm>	 RECOVERY - ToolLabs: Excess CPU check: user on labmon1001 is OK: OK: All targets OK  
[16:18:00] * tonythomas just went to https://wikitech.wikimedia.org and found that the cool 'Manage instance' was moved from the sidebox to Special pages.
[16:24:52] <andrewbogott>	 tonythomas: you're right, something's broken...
[16:25:04] <icinga-wm>	 PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools.diskspace._var.byte_avail.value (30.00%)  
[16:25:11] <tonythomas>	 oh. that was something broken ? I thought someone did that 
[16:25:24] <icinga-wm>	 PROBLEM - ToolLabs: Excess CPU check: user on labmon1001 is CRITICAL: CRITICAL: tools.tools-shadow.cpu.total.user.value (77.78%)  
[16:26:30] <andrewbogott>	 tonythomas: someone did do that:  me.  But by mistake
[16:26:48] <andrewbogott>	 wikitech is in a 'fix one bug, cause another' state lately :(
[16:27:37] <tonythomas>	 oh. it would be easy to find the commit that did this one right ?
[16:28:17] <andrewbogott>	 Maybe?  It's somewhere in operations-mediawiki-config
[16:50:07] <icinga-wm>	 RECOVERY - ToolLabs: Excess CPU check: user on labmon1001 is OK: OK: All targets OK  
[17:04:16] <icinga-wm>	 PROBLEM - ToolLabs: Excess CPU check: user on labmon1001 is CRITICAL: CRITICAL: tools.tools-shadow.cpu.total.user.value (77.78%) WARN: tools.tools-submit.cpu.total.user.value (55.56%)  
[17:29:35] <icinga-wm>	 RECOVERY - ToolLabs: Excess CPU check: user on labmon1001 is OK: OK: All targets OK  
[17:32:07] <icinga-wm>	 RECOVERY - ToolLabs: Low disk space on /var on labmon1001 is OK: OK: All targets OK  
[17:34:40] <Coren>	 YuviPanda: Dude, shut those alarms off please.  They seem completely unrelated to reality.
[17:34:44] <YuviPanda>	 yeah
[17:36:20] <Base>	 hola
[17:36:36] <Base>	 is there a nice complete list of useful tools
[17:36:37] <Base>	 ?
[17:38:13] <andrewbogott>	 Base: I'm not sure that there's an objective measure of 'useful' but this is a good place to start:  https://tools.wmflabs.org/
[17:39:55] <Base>	 andrewbogott: yeah i'm looking at it but it's autogenerated list which lists also tools which are actually bots. I need a list of tools in meanins we usually use in wiki thus I can find some nice to enhance my using practices :)
[17:45:55] <icinga-wm>	 PROBLEM - ToolLabs: Excess CPU check: user on labmon1001 is CRITICAL: CRITICAL: tools.tools-shadow.cpu.total.user.value (77.78%)  
[17:47:08] <YuviPanda>	 Coren: also, tools-shadow is spiking CPU use http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1410976017.204&target=tools.tools-shadow.cpu.total.user.value
[17:47:15] <YuviPanda>	 Coren: isn't it actually unused?
[17:48:38] * YuviPanda pokes Coren
[17:48:41] <YuviPanda>	 around?
[17:48:47] <Coren>	 It looks like that's puppet.
[17:48:52] * Coren tries to figure out why.
[17:49:13] <Coren>	 But also, -shadow isn't unused; it blindly does what -master does except dry runs unless -master fails.
[17:49:33] <YuviPanda>	 Coren: ah, cool
[17:49:37] <YuviPanda>	 Coren: does seem to be puppet: http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1410976154.453&target=tools.tools-shadow.cpu.total.user.value&target=tools.tools-shadow.puppetagent.time_since_last_run.value
[17:49:52] <Coren>	 I /know/ it's puppet.  Trying to figure out why.
[17:50:06] <YuviPanda>	 puppet running time also went up an order of magnitude
[17:50:07] <YuviPanda>	 http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1410976185.782&target=tools.tools-shadow.cpu.total.user.value&target=tools.tools-shadow.puppetagent.total_time.value
[17:51:26] <Coren>	 It seems to be spending most of its time spinning in apt-cache.
[17:53:14] <andrewbogott>	 tonythomas (and everyone) the sidebar issue should be fixed but you many need to (as always) log out and in again to reset your session
[17:53:34] <Coren>	 YuviPanda: looks like the apt lists were b0rked.
[17:53:41] <YuviPanda>	 ah
[17:55:14] <Coren>	 Hm. interesting.  They break again on next run.
[18:11:56] <icinga-wm>	 RECOVERY - ToolLabs: Excess CPU check: user on labmon1001 is OK: OK: All targets OK  
[18:12:14] <YuviPanda>	 Coren: the check *did* catch a real issue :)
[18:12:24] <YuviPanda>	 although it probably shouldn't have flapped like that
[18:12:34] <Coren>	 Indeed not.  But the issue is... bizzare.
[18:15:56] <Coren>	 It looks like the behaviour of apt-get update changed slightly.
[18:18:52] <Coren>	 The problem existed since... 9h UTC or so; every tool labs instance had apt-get spin a CPU on every attempt to check packages.
[18:22:40] <YuviPanda>	 Coren: ow
[18:23:00] <Coren>	 http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1410976017.204&target=tools.tools-submit.cpu.total.user.value
[18:23:05] <Coren>	 Shows fixed issue ^
[18:23:25] <Coren>	 Puppet remains a pain.
[18:27:24] <wikibugs>	 3Wikimedia Labs / 3deployment-prep (beta): Beta Cluster down due to (?) docroot change and hhvm interaction? - 10https://bugzilla.wikimedia.org/70948 (10Greg Grossmeier) 3NEW p:3Unprio s:3normal a:3None 14:17 < YuviPanda> greg-g: hmm, http://en.wikipedia.beta.wmflabs.org/ is dead 14:17 < YuviPanda> n...
[18:27:51] <wikibugs>	 3Wikimedia Labs / 3deployment-prep (beta): Beta Cluster down due to (?) docroot change and hhvm interaction? - 10https://bugzilla.wikimedia.org/70948 (10Greg Grossmeier) p:5Unprio>3Immedi s:5normal>3critic
[18:28:37] <wikibugs>	 3Wikimedia Labs / 3deployment-prep (beta): Beta Cluster down due to (?) docroot change and hhvm interaction? - 10https://bugzilla.wikimedia.org/70948 (10Greg Grossmeier) a:3Sam Reed (reedy)
[18:31:23] <Krinkle>	 Hm.. incinga says ssh is timing out to integration slaves
[18:31:27] <Krinkle>	 http://icinga.wmflabs.org/cgi-bin/icinga/status.cgi?hostgroup=integration&style=detail
[18:31:31] <Krinkle>	 false positive?
[18:32:40] <YuviPanda>	 Krinkle: icinga.wmflabs.org has been dead/dying/wrong for a lonnnnggtime
[18:34:28] <Krinkle>	 YuviPanda: I'm managing an increasing number of instances. I *need* proper monitoring that just works and notifies me in some way.... I don't even care what it asserts. Basic ping, diskspace, mem and cpu would be enough.
[18:35:13] <Krinkle>	 I can help implementing it, but it's been almost a year, and every other day I find out blantantly obvious problems that have gone on for days unnoticed.
[18:35:19] <YuviPanda>	 Krinkle: so, right now you can do that (temporarily) easily with prod icinga doing graphite checks. I'm currently working on a shinken installation for labs that'll make it easier for labs.
[18:35:59] <YuviPanda>	 Krinkle: see https://gerrit.wikimedia.org/r/#/c/159709/
[18:36:31] <YuviPanda>	 Krinkle: it's triival to add basic monitoring (diskspace, cpu, puppet). however, I presume cvn isn't puppetized, so unsure how other opsen will feel about just monitoring for it living in ops/puppet
[18:36:46] <Krinkle>	 integration is puppetised
[18:37:01] <Krinkle>	 that's the one I care about most, and opsen should too, since Jenkins depends on it
[18:37:12] <YuviPanda>	 Krinkle: oh, integration. right, that should be simple enough.
[18:37:37] <YuviPanda>	 Krinkle: you can add CPU checks similar to https://gerrit.wikimedia.org/r/#/c/159751/
[18:37:43] <Krinkle>	 I'd like to know whether my recent changes to trusty instances is causing /tmp to be filled up, and whether adding hhvm tests in addition to node is making it overload or not before it actually overloads etc.
[18:38:09] <YuviPanda>	 Krinkle: I can setup simple disk space + CPU checks for you if you want
[18:38:23] <Krinkle>	 that'd be very cool :)
[18:38:33] <YuviPanda>	 Krinkle: alright, let me write a patch
[18:38:45] <Krinkle>	 oh nice, separate notification groups
[18:39:36] <Krinkle>	 But I guess they all still go to -operations? that's fine for beta/tools/integration, but for cvn and other random labs projects maintained by the community should not. But those can wait until the new infra is there for labs itself.
[18:40:08] <mutante>	 Krinkle: icinga-wm has separate configs for 3 channels
[18:40:25] <mutante>	 ops will get all, -labs will get toollabs -qa will get beta
[18:40:35] <mutante>	 depending on the contact_group on a service
[18:40:51] <Krinkle>	 mutante: cool
[18:40:52] <Krinkle>	 https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/echoirc.pp#L14
[18:41:20] <Krinkle>	 https://github.com/wikimedia/operations-puppet/blob/e5a40005c5c13ebe5e549c8aeb65f880e8fd5b38/files/icinga/misccommands.cfg#L80
[18:41:22] <Krinkle>	 interesting
[18:41:55] <YuviPanda>	 mutante: can you add Krinkle as a contact in the private repo?
[18:42:01] <YuviPanda>	 Krinkle: I'll set alerts to you and hashar?
[18:42:08] <Krinkle>	 YuviPanda: integration => krinkle, hashar
[18:42:09] <Krinkle>	 yeah :)
[18:42:19] <Krinkle>	 YuviPanda: what protocol does this use? LDAP e-mail?
[18:42:35] <YuviPanda>	 Krinkle: for the email address? no, mutante adds an entry with your email + username in the private repo :)
[18:42:53] <Krinkle>	 right, it has its own definition, ok
[18:43:03] <Krinkle>	 makes sense in case ldap is down :P
[18:43:08] <mutante>	 Krinkle: yes, that and https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/echoirc.pp
[18:43:12] <mutante>	 YuviPanda: yes
[18:43:36] <Krinkle>	 hashar => amusso and krinkle => ttijhof @wikimedia.org in taht case
[18:43:52] <mutante>	 pretty sure hashar already exists.. sec
[18:44:15] <Coren>	 hashar is a figment of our imagination.  :-)
[18:44:19] <mutante>	         contact_name                    amusso
[18:44:19] <mutante>	         alias                           hashar
[18:44:42] <YuviPanda>	 mutante: ya, hashar exists
[18:46:10] <mutante>	 627         contact_name                    krinkle
[18:46:15] <mutante>	 done, added
[18:46:54] <mutante>	 Krinkle: i just copied the "defaults", that means 24x7 , per email and for all types (crit,recover,down..)
[18:46:56] <icinga-wm>	 RECOVERY - ToolLabs: Puppet failure events on labmon1001 is OK: OK: All targets OK  
[18:47:04] <YuviPanda>	 mutante: can you merge https://gerrit.wikimedia.org/r/#/c/161015/ now?
[18:47:20] <Krinkle>	 mutante: ok
[18:47:21] <mutante>	 if you wanted to you could have custom timezones instead of 24x7 but people just care when it creates SMS :p
[18:47:53] <Krinkle>	 yeah, my mail and phone noise are filtered locally
[18:49:18] <mutante>	 YuviPanda: merging that
[18:49:36] <YuviPanda>	 Krinkle: cool, mutante merged it. now it's gonna take a couple of puppet runs, and then you should be able to see status on https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=labmon
[18:50:24] <mutante>	 yea, just give it a while, puppet on neon can be a bit slow
[19:09:22] <wikibugs>	 3Wikimedia Labs: /var is running out of space on extdist2 - 10https://bugzilla.wikimedia.org/70952 (10Kunal Mehta (Legoktm)) 3NEW p:3Unprio s:3normal a:3Yuvi Panda Notes so far:  * /var/log/extdist was 571M, we should set up log rotate for it (had stuff from July) * /var/log/atop was 183M, Yuvi says it...
[19:14:03] <YuviPanda>	 Krinkle: at this point, I think graphite based icinga checks are somewhat 'self serve', in the sense submit a patch, get someone to merge it and ta!
[19:19:55] <legoktm>	 !log extdist cleared out a bunch of logs from /var on extdist2 since it was basically full, see bug 70952
[19:19:58] <labs-morebots>	 Logged the message, Master
[19:38:05] <spagewmf>	 I ran `labs-vagrant provision` and got errors from Mediawiki::Import_dump[labs_privacy] ending in "Error: mwscript importDump.php --wiki=wiki /vagrant/puppet/modules/labs/files/labs_privacy_policy.xml returned 1 instead of one of [0]"
[19:38:53] <bd808>	 spagewmf: lame. Can you run the command manually to see if you get more info?
[19:45:54] <spagewmf>	 bd808: thanks for caring :).  I think it's complaining that the script run has a  "PHP Warning:  include_once(/vagrant/mediawiki/extensions/Parsoid/php/Parsoid.php): failed to open stream: No such file or directory in /srv/vagrant/settings.d/puppet-managed/10-Parsoid.php", maybe if I resolve that then Mediawiki::Import_dump will stop complaining about unexpected return
[19:46:46] <gwicke>	 spagewmf: the extension is now in /vagrant/mediawiki/extensions/Parsoid/Parsoid.php
[19:47:01] <bd808>	 ^ that. and mwv doesn't know how to fix it
[19:47:22] <bd808>	 Actually the role is probably wrong now
[19:47:23] <spagewmf>	 gwicke: right, maybe a bug that the puppet-managed bit points to old location
[19:47:25] <gwicke>	 production is already updated, I guess something specific to vagrant isn't yet?
[19:47:40] <bd808>	 gwicke: yup. I bet nobody thought of that yet
[19:47:56] * gwicke wished we didn't have forked config systems
[19:48:50] <bd808>	 spagewmf: This line needs to be removed from the puppet code for mwv -- https://github.com/wikimedia/mediawiki-vagrant/blob/master/puppet/modules/mediawiki/manifests/parsoid.pp#L65
[19:49:22] <bd808>	 If you edit that locally and run vagrant provision it should fix things
[19:49:40] <bd808>	 And if it does, please upload a patch to gerrit
[19:51:08] <bd808>	 gwicke: I feel your pain. Configuring prod, beta, and MediaWiki-Vagrant from the same set of descriptions is pretty hard though. There are always going to be local differences.
[19:51:44] <bd808>	 If just having a deb magically fixed things we wouldn't need most of the code in our puppet repo
[19:52:15] <gwicke>	 yeah, especially if the defaults are sane too
[19:52:33] <gwicke>	 there's not much traction on packaging mw core & extensions though
[19:52:49] <gwicke>	 nobody seems to care
[19:53:13] <gwicke>	 spagewmf: better: remove the php/ prefix in that line
[19:53:15] <bd808>	 I have high hopes for Composer integration but haven't had any "free" time to work on the problem
[19:54:11] <bd808>	 gwicke: The default behavior for mediawiki::extension is to expect <extension>/<extension>.php as entry point
[19:54:20] <spagewmf>	 trying it...
[19:54:38] <gwicke>	 bd808: ahh, good point
[19:59:23] <wikibugs>	 3Wikimedia Labs / 3tools: LOAD INTO FILE permission on MySQL tools-db server for user created databases - 10https://bugzilla.wikimedia.org/70956 (10Dereckson) 3NEW p:3Unprio s:3normal a:3Marc A. Pelletier Currently, the FILE MySQL privilege isn't granted, so tools users can't use mysqlimport or state...
[20:02:25] <spagewmf>	 heh, "Why isn't my remote git command in the terminal scrollback?"  Select entire terminal scrollback, copy, paste" BOOM gnome-terminal crash :-/ 
[20:07:15] <spagewmf>	 bd808: good news: your fix avoids the PHP warning, https://gerrit.wikimedia.org/r/161036
[20:11:44] <spagewmf>	 bd808: bad news: still labs-provision failure, starting with "Notice: /Stage[main]/Role::Labs_initial_content/Mediawiki::Import_dump[labs_privacy]/Exec[import_dump_labs_privacy]/returns: PHP Warning:  DOMDocument::load(): I/O warning : failed to load external entity "/srv/vagrant/mediawiki/languages/data/plurals.xml" in /srv/vagrant/mediawiki/includes/cache/LocalisationCache.php on line 656"
[20:13:01] <ebernhardson>	 hmm, we have a security rule against allowing external entity loading
[20:13:25] <ebraminio>	 Hi. What should I do to make my oauth app approved?
[20:13:53] <bd808>	 Hmm... I think that has something to do with xml and hhvm. I thought it was just a warning though?
[20:14:05] <bd808>	 spagewmf: ^
[20:14:08] <spagewmf>	 ebernhardson , bd808: the file is present, it has an  interesting '<!DOCTYPE supplementalData SYSTEM "../../common/dtd/ldmlSupplemental.dtd">'
[20:15:08] <bd808>	 ebraminio: You have submitted a request and are just waiting for approval?
[20:15:16] <ebraminio>	 bd808: Yes
[20:15:56] <ebernhardson>	 spagewmf: which server is this?
[20:15:57] * bd808 pokes Deskana|Away and robla about oauth requests waiting for approval
[20:16:15] <bd808>	 ebraminio: That message ^ may get you some attention
[20:16:31] <ebraminio>	 bd808: Thank you
[20:16:45] <bd808>	 There is still a pretty small number of reviewers for the requests I think
[20:18:20] <spagewmf>	 ebernhardson: I'm trying to get flow-tests running so danny can see the new Shahyar dialog
[20:18:26] <YuviPanda>	 ebraminio: you can poke Reedy as well for oauth i think
[20:19:52] <spagewmf>	 bd808: maybe it's just a warning, but I'm guessing import_dump_labs_privacy is unhappy that it returned this rather than 0, so  labs-provision thinks it failed
[20:20:07] <bd808>	 ebraminio: Any of these folks should be able to help -- https://www.mediawiki.org/w/index.php?title=Special:ListUsers&group=oauthadmin
[20:20:37] <bd808>	 spagewmf: Ah. Extra lame
[20:20:59] * bd808 is in a 1991 timewarp today
[20:21:15] <YuviPanda>	 bd808: I wonder if the labs-vagrant error is because I think imports are broken on hhvm?
[20:21:19] <YuviPanda>	 were, at least
[20:21:35] <ebraminio>	 It seems just I can authorize myself on it but as someone else want to help on the development it would be nice if I could get my tool approved to show him also
[20:21:59] <ebraminio>	 bd808: Well I guess if I highlight all of them I would kicked :)
[20:22:06] <bd808>	 YuviPanda: They were/are broken via the special page but I think cli just gives warnings.
[20:22:27] <YuviPanda>	 maybe it's failing on the warnings?
[20:22:31] * YuviPanda is making wild guesses
[20:22:48] <bd808>	 ebraminio: :) Of the 5 there, I pinged 2 and 1 is out on leave.
[20:23:38] <bd808>	 so only anomie and Eloquence haven't been bugged yet to check for pending OAuth requests on mw.o
[20:23:40] <bd808>	 :)
[20:23:53] <ebraminio>	 bd808: Thank you :)
[20:24:51] <wikibugs>	 3Wikimedia Labs / 3deployment-prep (beta): Beta Cluster down due to (?) docroot change and hhvm interaction? - 10https://bugzilla.wikimedia.org/70948#c1 (10Sam Reed (reedy)) [21:17:26] <Reedy> The apache configs are pointing at the new /srv based docroots [21:18:19] <Reedy>     <Directory "/srv/mediawiki/doc...
[20:24:59] <bd808>	 spagewmf: I can tell you a hack to keep things moving. Edit puppet/hierdata/environment/labs.yaml and comment out the "classes: ..." line
[20:26:42] <bd808>	 spagewmf: Update your bug with the new info and I'll try to find time to look into it "soon". Or maybe we can trick YuviPanda into investigating deeper.
[20:27:12] <chippy>	 I'll shortly be asking for oauth approval also - should I ask via bugzilla or ping via here when ready?
[20:28:36] <wikibugs>	 3Wikimedia Labs / 3deployment-prep (beta): Beta Cluster down due to (?) docroot change and hhvm interaction? - 10https://bugzilla.wikimedia.org/70948#c2 (10Sam Reed (reedy)) p:5Immedi>3Normal s:5critic>3normal http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page works fine too  Some weird redirect or...
[20:28:55] <bd808>	 chippy: good question. I really don't know the "normal" workflow. I would hope that the reviewers get an email or something when a new request comes in on-wiki.
[20:30:09] <chippy>	 i had to make a couple of extra ones thought for development and testing to change the callbacks etc... I think most of those added were similarily development only
[20:30:16] <bd808>	 The on-wiki docs seem to think all you have to do is ask via the special page -- https://www.mediawiki.org/wiki/OAuth/For_Developers#Application_Approval
[20:31:05] <bd808>	 I may be doing something with OAuth soon as well and I'll try to update the docs when I find things out.
[20:31:25] <chippy>	 thanks bd808 
[20:31:47] <chippy>	 I hope that as it gets more accepted and settled the flow will improve
[20:32:05] <chippy>	 it's certainly better than libraries using http basic usernames and passwords :)
[20:33:42] <bd808>	 I think it will get better. The group that built the backend bits has been spread pretty thin since it launched. There was a desire to have a settling in period before asking stewards to mange the requests.
[20:34:05] <bd808>	 Maybe when the endpoint gets moved from mw.o to meta they will add more folks to approve things
[20:34:56] <chippy>	 yep
[20:36:10] <marxarelli>	 spagewmf: still getting the error?
[20:37:01] <marxarelli>	 i had the same problem yesterday but didn't have time to write up a bug
[20:37:47] <marxarelli>	 spagewmf, bd808: i believe it's due to www-data not having write access to /srv/vagrant/mediawiki/cache
[20:37:52] <spagewmf>	 marxarelli, bd808: I filed bug 70959.  Nope, # commenting out that classes line in puppet/hieradata/environment/labs.yaml did not help
[20:38:43] <spagewmf>	 I'll try to paste the entire labs-vagrant provision output into an attachment (which crashed gnome-terminal last time :-) )
[20:38:48] <marxarelli>	 try `chmod g+w /srv/vagrant/mediawiki/cache` before running `labs-vagrant provision`
[20:38:54] <bd808>	 www-data really does need to own that cache dir
[20:39:25] <ebraminio>	 bd808: oh reputable developers, we want just basic user info and my friend is +2 on pywiki development
[20:39:36] <marxarelli>	 i'll comment in the report
[20:39:40] <bd808>	 spagewmf: PUPEPT_DEBUG=1 labs-vagrant provision | tee somefile.log
[20:40:13] <bd808>	 Then attach the log file to the bug for great success
[20:40:22] <marxarelli>	 bd808: that backtrace is massive, too. it took me a while to even find the real culprit
[20:40:43] <marxarelli>	 and then i didn't file a bug...
[20:40:48] <bd808>	 heh
[20:40:49] * marxarelli feels an intense guilt
[20:41:24] <bd808>	 labs-vagrant is even a lower class citizen than mw-vagrant on windows :)
[20:41:40] <marxarelli>	 oh man, that's low
[20:41:53] <YuviPanda>	 yeah
[20:42:06] <bd808>	 The docker stuff I started might even the playing field but it needs more automation on the ops/puppet.git side
[20:42:59] <bd808>	 I fix it up some whenever I spin up a new instance for some testing project but that doesn't keep up with all the changes
[20:44:30] <spagewmf>	 marxarelli: eh, I might have missed your bug report anyway. bd808 but the single-instance mediawiki role has its own problems (/srv/mediawiki always owned by root, etc.)
[20:45:49] <bd808>	 single-instance should be deprecated and then killed IMHO. But labs-vagrant is tough to keep in sync without running a "real" container on the labs host.
[20:46:03] <bd808>	 The docker stuff makes that possible
[20:46:06] <spagewmf>	 speaking of instances, https://wikitech.wikimedia.org/wiki/Special:NovaInstance seems broken, doesn't show any instances for my projects editor-engagement and tools
[20:46:19] <bd808>	 but using docker as a full vm has it's own set of quirks
[20:46:48] <spagewmf>	 bd808: docker does sound excellent.  Why run it as a full VM, can't it just be a container on an Ubuntu instance
[20:47:14] <bd808>	 spagewmf: Have you tried logging out and logging back in? The memcache server for wikitech was restarted this morning and that can cause permissions issues.
[20:47:25] <spagewmf>	 supposedly vagrant can work with a docker container, I don't understand enough to try it.
[20:47:45] <bd808>	 spagewmf: https://github.com/wikimedia/mediawiki-vagrant/tree/master/support/docker
[20:49:47] <spagewmf>	 bd808: good news x2; logging out fixed Special:NovaInstance; and re-running provision completed with no errors
[20:50:00] <bd808>	 cool beans!
[20:50:15] <bd808>	 So the directory permission change fixed it?
[20:54:40] <spagewmf>	 bd808: that was the last change I made, yes.  bug 70959 updated
[20:57:11] <spagewmf>	 next: web requests return status 503 because database or disk is full (/var/run/hhvm)
[20:58:18] <ebraminio>	 anomie: Thank you very much! :)
[20:59:43] <spagewmf>	 /run/hhvm/hhvm.hhbc is 201MB and tmpfs           201M  201M  4.0K 100% /run
[21:02:42] <marxarelli>	 bd808: ^ once https://gerrit.wikimedia.org/r/#/c/160571/ is merged we can override the hhvm repo path for labs
[21:02:46] <marxarelli>	 oh glorious hiera
[21:03:22] <bd808>	 all hail hiera
[21:04:55] <bd808>	 spagewmf: I swear I have a labs-vagrant host running hhvm and did not have to work around things that I didn't patch upstream. :( I haven't updated it for a couple of weeks though.
[21:06:27] <marxarelli>	 bd808: oh wait, the hhbc should be in /var/run/hhvm according to the vagrant template. does it get provisioned differently in labs?
[21:06:57] <bd808>	 It shouldn't. Maybe spagewmf has a full /var?
[21:07:11] <bd808>	  /var in labs is tiny
[21:07:47] <bd808>	 ah /var/run is a symlink to /run
[21:08:32] <marxarelli>	 i checked that, and it doesn't seem to be for me
[21:08:41] <marxarelli>	 at least on the instance i spun up yesterday
[21:08:51] <bd808>	 It is for me on sul-test.eqiad.wmflabs
[21:09:12] <marxarelli>	 hmm... does it depend on the instance size maybe?
[21:09:20] <marxarelli>	 i'm using a medium
[21:09:38] <bd808>	 maybe? Let me see what size I built
[21:10:52] * bd808 does logout / login dance
[21:11:36] <marxarelli>	 i mean /var is still small, but not prohibitively so like the /run tmpfs
[21:11:39] <bd808>	 My instance is m1.small with role::labs::vagrant as the only applied config
[21:11:54] <bd808>	 but for me /var/run -> /run
[21:12:05] <marxarelli>	 for me: /dev/vda2                                          1.9G  474M  1.3G  27% /var
[21:12:26] <marxarelli>	 oh, wtf...
[21:12:28] <marxarelli>	 nm :)
[21:12:35] <bd808>	 ls -ald /var/run
[21:12:42] <spagewmf>	 bd808, marxarelli flow-tests's /run is only 201MB :(, 10% of memory size?
[21:13:12] * marxarelli makes note of the subtle difference between ls -ld /var/run/ and ls -ld /var/run
[21:13:49] <marxarelli>	 yeah, so we will need to move the hhbc then
[21:14:01] <bd808>	 spagewmf: same for me but luckily df says: tmpfs  201M  125M   77M  62% /run
[21:14:54] <spagewmf>	 bd808: how big is your /run/hhvm/hhvm.hhbc ?  Mine is 201MB, hence sadness
[21:15:15] <bd808>	 Mine is currently 124M
[21:15:37] <bd808>	 spagewmf: is this an instance that has been around for a while?
[21:15:51] <bd808>	 hhvm does something I don't love when the binary is upgraded
[21:15:58] <marxarelli>	 mine's at 13M but i haven't hit the wiki with my tests yet
[21:16:15] <bd808>	 it changes the schema version for the hhbc but doesn't purge old data
[21:16:25] <spagewmf>	 bd808: flow-tests was created after the migration to eqiad.  Can I delete hhbc?
[21:16:42] <bd808>	 yeah. and then `sudo service hhvm restart`
[21:16:48] <bd808>	 it will make a new one
[21:17:00] <jeremyb>	 andrewbogott: seeing SGE failures mail...
[21:17:15] <jeremyb>	 Coren: ^
[21:17:22] <bd808>	 This exact problem is going to bite us in production at some point
[21:17:26] <jeremyb>	 4x so far
[21:17:29] <mutante>	 <+icinga-wm> PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL -
[21:17:33] <mutante>	 ?
[21:18:10] <jeremyb>	 mutante: yeah, maybe ldap screwed? i can't login by ssh
[21:18:11] <marxarelli>	 bd808: would /var/cache make more sense as a location for the hhbc?
[21:18:16] <marxarelli>	 i.e. https://en.wikipedia.org/wiki/Filesystem_Hierarchy_Standard
[21:18:45] <jeremyb>	 mutante: can't to tools-login. can login to deployment-bastion and bastion1
[21:18:54] <bd808>	 marxarelli: That is a FHS lawyer question. :)
[21:19:02] <marxarelli>	 haha
[21:19:11] <marxarelli>	 "i just work here"
[21:19:18] <andrewbogott>	 jeremyb: that's may be a result of a change I just made, give me a minute to revert
[21:19:24] <spagewmf>	 bd808: hhvm can't make a new one, "Failed to initialize central HHBC repository:\n  Failed to initialize schema in /var/run/hhvm/hhvm.hhbc: RepoQuery::step(repo=0x7fe5eb067000) error: 'COMMIT;' --> (13) database or disk is full\n  Failed to open /var/www/.hhvm.hhbc: 14 - unable to open database file"
[21:19:36] <jeremyb>	 andrewbogott: k
[21:20:18] <mutante>	 marxarelli: what's the current path?
[21:20:33] <bd808>	 spagewmf: try stopping, deleting and then starting again. Open filehandles probably put it back
[21:20:39] <marxarelli>	 mutante: /var/run/hhvm/hhvm.hhbc
[21:20:48] <jeremyb>	 mutante: /var/run/hhvm; as spagewmf just pasted
[21:21:10] <bd808>	 The problem is really that labs instances are not prod sized. 10% of 2G is not much
[21:21:34] <jeremyb>	 hah
[21:21:37] <bd808>	 Also hhvm leaves garbage in the hhbc cache
[21:21:53] <jeremyb>	 bbl
[21:22:24] <bd808>	 which is not a problem at FB because they use repo authoritative and replace the whole hhbc with each deployment
[21:22:59] <marxarelli>	 yeah, if /var is filling up that's a greater issue
[21:23:06] <mutante>	 marxarelli: i think /var/cache sounds right, but it could also be /srv :p
[21:23:09] <spagewmf>	 bd808: yeah, something must have a handle open. I stopped hhvm, removed the 0-length hhvm.hhbc, and still 100% full even though du -k reports nothing over 4kB.  I'll reboot the instance3
[21:23:23] <bd808>	 bigger hammer theory
[21:24:17] <spagewmf>	 how can you find who has a handle to an invisible deleted file?  lsof magic?
[21:24:38] <spagewmf>	 yup, php does!
[21:24:48] <bd808>	 jobrunner
[21:24:55] <spagewmf>	 ding ding
[21:25:05] * bd808 shakes fist at that
[21:25:18] <bd808>	 we should run the jobrunner explicitly with php5
[21:25:25] <marxarelli>	 mutante: /srv would be the pragmatic choice, looking at what's available in labs
[21:25:26] <spagewmf>	 how do I halt jobrunner, "unrecognized service"
[21:25:39] <mutante>	 marxarelli: http://www.pathname.com/fhs/pub/fhs-2.3.html#SRVDATAFORSERVICESPROVIDEDBYSYSTEM
[21:25:50] <mutante>	 "data served" shrug :)
[21:26:03] <marxarelli>	 makese sense
[21:26:05] <bd808>	 spagewmf: It should be in /etc/init and controlled by `service jobrunner stop`
[21:26:08] <mutante>	 and "site-specific" 
[21:26:11] <andrewbogott>	 jeremyb: ok, that ought to quiet things down...
[21:26:11] <marxarelli>	 but so does "application cache"
[21:26:19] <andrewbogott>	 ...maybe
[21:26:27] <marxarelli>	 they're both pretty nebulous
[21:26:46] <bd808>	 spagewmf: If not `kill PID` should work too
[21:26:53] <mutante>	 marxarelli: yes, and i always like how the standard has sentences like this "The methodology used to name subdirectories of /srv is unspecified as there is currently no consensus on how this should be done. "
[21:27:20] <spagewmf>	 bd808: thanks, I always mistype `sudo service stop jobrunner` :)
[21:27:24] <andrewbogott>	 jeremyb: is your login working again now?
[21:28:03] <bd808>	 mutante: /srv/<project>/<whatever>; why have global consensus for a location that is entirely site specific?
[21:28:44] <bd808>	 "Here's a place to do whatever you want, but you must follow these rules when doing so"
[21:30:43] <bd808>	 Personally I'd endorse /srv/<project>/<FHS-like path> for higher discoverability and lower wtfs-per-minute
[21:31:25] <bd808>	 So /srv/hhvm/var/run/hhvm.hhbc would make sense to me
[21:31:45] <mutante>	 bd808: well it can be site-specific and still have a consensus like  /srv/org/wikimedia  or not
[21:31:51] <bd808>	 Although it should really be /srv/mediawiki/.... and /srv/jobrunner/...
[21:32:08] <mutante>	 technically it should be /srv/org/wikimedia/mediawiki/ :p /me hides
[21:32:18] <bd808>	 But we just squatted on /srv/mediawiki as $IP for the cluster code
[21:32:35] <jeremyb>	 andrewbogott: no, and still getting SGE fail spam
[21:32:56] <mutante>	 bd808: /srv/<project>/<FHS-like path>   also sounds good
[21:33:15] * marxarelli regrets bringing FHS into the equation
[21:33:16] <andrewbogott>	 jeremyb: hm, I see it too -- I can log in to a box in a different project but not to tools-login
[21:34:00] <jeremyb>	 andrewbogott: [21:18:45] <jeremyb> mutante: can't to tools-login. can login to deployment-bastion and bastion1
[21:34:02] <jeremyb>	 :)
[21:34:07] * jeremyb will bbl again
[21:34:17] <andrewbogott>	 jeremyb: ok, I see another candidate
[21:34:35] <andrewbogott>	 This is that old "two ldap servers break twice as often as one" problem
[21:35:22] <andrewbogott>	 jeremyb: fixed now, yes?
[21:40:42] <andrewbogott>	 !log tools caused a brief auth outage while messing with codfw ldap
[21:40:45] <labs-morebots>	 Logged the message, dummy
[21:49:14] <jeremyb>	 andrewbogott: i got job failed timestamped 21:42:19. so either taking time to recover or still broken
[21:49:22] <jeremyb>	 but i could log in
[21:49:31] <jeremyb>	 i guess you get these mails too
[21:49:43] <andrewbogott>	 yeah… seems like they've eased off but maybe I'm fooling myself
[21:50:01] <jeremyb>	 rebye
[22:22:34] <spagewmf>	 bd808: flow-tests wiki worked after reboot. But now puppet is at 99.5% of CPU, is that expected after a reboot?  strace shows it checking ruby gems and trying to stat git::clone in all kinds of ruby directories.
[22:23:25] <spagewmf>	 never a dull moment :)  BTW I filed bug 70967 about hhvm.hhbc filling tmpfs
[22:24:01] <bd808>	 spagewmf: the puppet started by `labs-vagrant provision` or the puppet agent process that talks to the LAbs puppetmaster?
[22:24:46] <bd808>	 If the former, I haven't see that behavior. If the later, dunno.
[22:26:09] <spagewmf>	 umm, "/usr/bin/ruby /usr/bin/puppet agent".  The labs provision finished fine after your two fixes, then I successfully deleted hhvm.hhbc, then I rebooted, and now it seems OK but that puppet has been running flat out for 10 minutes of CPU time
[22:27:39] <bd808>	 "puppet agent" is the process that talks to the labs puppetmaster. Sounds like it's gone nuts on you.
[22:28:12] <bd808>	 labs-vagrant does a "puppet apply" when you ask it to provision and should not leave a process running
[22:28:58] <bd808>	 I see that puppet agent is eating 99% cpu on my testing server too
[22:29:08] <bd808>	 at least on and off
[22:29:17] <bd808>	 seems to have gone quite for me now
[22:29:20] <spagewmf>	 bd808: it keeps stating the same set of paths for git::clone.rb over and over
[22:30:11] <bd808>	 You can kill with `sudo service puppet stop` and then restart it and see if it heals itself.
[22:30:55] <bd808>	 Oh labs runs it via cron I think
[22:30:59] <bd808>	 hmmm...
[22:31:14] <bd808>	 That probaby means a patch I wrote last night is wrong
[22:31:58] <bd808>	 spagewmf: /etc/cron.d/puppet starts it 3 times an hour
[22:33:27] <spagewmf>	 bd808: killed, I'll watch for it restarting.  Do I file the bug in Wikimedia Labs > Infrastructure ?
[22:34:37] <bd808>	 spagewmf: Maybe? you could try running the command in the crontab interactively to see if it was sunspot or is repeatable
[22:42:37] <spagewmf>	 bd808: the :34 cron job had started and was maxed away; starting by hand is also at 99% CPU.  So repeatable for me :)
[22:43:05] <spagewmf>	 maybe I should take flow-tests.wmflabs behind the barn and put it out of its misery :)
[22:56:08] <wikibugs>	 3Wikimedia Labs / 3Infrastructure: puppet agent on labs-vagrant instance using 99% of CPU, looking for git::clone.rb - 10https://bugzilla.wikimedia.org/70971 (10spage) 3NEW p:3Unprio s:3normal a:3None After successfully provisioning the labs-vagrant instance flow-tests.wmflabs (see bug 70967), I rebo...
[23:06:42] <bd808>	 spagewmf: You could get a bit better info for the puppet agent bug by adding `--debug` when you run it from the cli and tee'ing the output to a log file.
[23:23:22] <wikibugs>	 3Wikimedia Labs / 3Infrastructure: puppet agent on labs-vagrant instance using 99% of CPU, looking for git::clone.rb - 10https://bugzilla.wikimedia.org/70971#c1 (10spage) $ sudo timeout  -k 300 1800 puppet agent --onetime --debug --verbose --no-daemonize --no-splay --show_diff  2>&1 | tee /tmp/puppet.out  sh...
[23:24:25] <spagewmf>	 bd808: it seems to be at "Exec[migrate legacy files](provider=posix)"
[23:25:20] <bd808>	 spagewmf: Damn it. :) That is code for the labs-vagrant role.
[23:25:40] <bd808>	 https://github.com/wikimedia/operations-puppet/blob/23f71c00eb9e5cd84e25dc78006189275eb51d7d/modules/labs_vagrant/manifests/init.pp#L19-L23
[23:27:07] <bd808>	 spagewmf: It is trying to use tar to copy /mnt/vagrant to /srv/vagrant -- https://github.com/wikimedia/operations-puppet/blob/production/modules/labs_vagrant/templates/migrate_legacy.erb
[23:27:52] <bd808>	 Which could take a while I suppose and would eat cpu and disk iops as it did
[23:29:12] <spagewmf>	 bd808 but there's no /mnt/vagrant, `/usr/bin/test -d /mnt/vagrant` returns 1.  Is it 0 or 1 that onlyif tests for?
[23:31:34] <spagewmf>	 the Git::Clone probably relates to all the stat()ing going on for git::clone.rb
[23:42:05] <bd808>	 spagewmf: yes, the onlyif should keep the migration tool from running if the /mnt/vagrant directory is missing. I thought that with --debug puppet would spew enough output that you'd see it start whatever task made it go nuts and eat the cup though.
[23:44:36] <spagewmf>	 alas, no output after "Debug: Executing '/usr/bin/test -d /mnt/vagrant'".  So the migrate_legacy command shouldn't run (the test returns 1), yet the before => Git::Clone['vagrant'] sure looks like what's causing the strace output.  (/me pretends to grok puppet)
[23:45:08] <spagewmf>	 bd808 should I reboot again?  I kinda need to get this server working, one way or another
[23:45:37] <bd808>	 spagewmf: well... the before is a guarantee but not a promise that the next code will be that.
[23:46:18] <bd808>	 and the first thing that git::clone will do is check to see that /srv/vagrant/.git exists and exit I think...
[23:47:12] <bd808>	  /srv/vagrant/.git/config I guess -- https://github.com/wikimedia/operations-puppet/blob/production/modules/git/manifests/clone.pp#L125
[23:48:54] <bd808>	 spagewmf: Rebooting probably won't make things worse. If you have the energy I'd actually recommend building a whole new instance at this point. If it works shoot the old one in the head and forget about it.
[23:49:16] <bd808>	 "vms are cattle not pets"
[23:51:09] <spagewmf>	 bd808: flow-tests has /srv/vagrant/.git/config   So why does it go nuts and look for "git::clone.rb" ?  flow-tests only has /vagrant/puppet/modules/git/manifests/clone.pp, I don't see any git::clone.rb anywhere
[23:53:22] <bd808>	 spagewmf: What roles have you applied to the instance from wikitech? The puppet code that is executing is operations/puppet.git not MWV
[23:55:55] <spagewmf>	 bd808: https://wikitech.wikimedia.org/wiki/Nova_Resource:I-00000498.eqiad.wmflabs says base, role::labs::instance, sudo::labs_project, role::labs::vagrant
[23:57:38] <bd808>	 spagewmf: That matches my most recent instance too -- https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000004cd.eqiad.wmflabs