[00:11:26] <wikibugs>	 3Wikimedia Labs / 3Infrastructure: Add trebuchet user to wikidev group - 10https://bugzilla.wikimedia.org/62843 (10Bryan Davis) p:5Unprio>3Normal
[14:22:01] <hedonil>	 Krinkle: would you mind hit the <merge-button> on my micro pull request  https://github.com/Krinkle/intuition/pull/23
[14:22:38] <hedonil>	 Krinkle: I know it's not well-hung ;-), but without I can't completely switch to Intuition...
[14:25:11] <hedonil>	 scfc_de: hi, I digged deep into lighty, writing a longer note to wikitech talk page right now
[14:25:45] <hedonil>	 scfc_de: JFI. I'm running a custom lighty on -tomcat right now (with xtools)
[14:26:39] <scfc_de>	 hedonil: As long as you only break your own tools ... :-)  Always good when someone does some actual research.
[14:27:36] <valhallasw>	 hedonil: there already is some stuff on running your own web server. Would be good to extend it indeed :-)
[14:28:18] <hedonil>	 valhallasw: hey! I copied your stuff to do that ;)
[14:28:25] <valhallasw>	 good :-p
[14:28:31] <hedonil>	 hehe
[14:28:41] <valhallasw>	 also, if it's still lighttpd, it's probably OK to run it on the lighttpd webgrid hosts -- maybe check with Coren 
[14:29:34] <hedonil>	 scfc_de: no breaks, Señor! full testing stuff
[14:31:26] <hedonil>	 scfc_de: as said, tweaking some default settings. so far I'm at: -3- processes, -0- workers, -2- fcgi-childs
[14:31:38] <hedonil>	 looks like this: http://tools.wmflabs.org/xtools/server-statistics
[14:32:15] * hedonil  continues writing his thoughts to wikitech
[14:55:07] <Coren>	 hedonil: Moar Data!
[14:55:20] <hedonil>	 Coren: scale to PROD !
[14:55:29] <hedonil>	 :-D
[14:57:41] <wikibugs>	 3Wikimedia Labs / 3tools: Soften qdel behaviour from KILL - 10https://bugzilla.wikimedia.org/61102#c3 (10metatron) Concerning (non) termination of php-cgi processes:  http://redmine.lighttpd.net/projects/lighttpd/wiki/Docs_ModFastCGI  There is an option "kill-signal" in .lighttpd fcgi settings.  "kill-signal...
[15:05:52] <hedonil>	 Krinkle: thanks.
[15:34:41] <wikibugs>	 3Wikimedia Labs / 3tools: Soften qdel behaviour from KILL - 10https://bugzilla.wikimedia.org/61102#c4 (10Tim Landscheidt) The program flow is different at the moment: On qdel, SGE kills the master lighttpd process with SIGKILL.  Thus, lighttpd never has a chance to kill the php-cgi processes.  So kill-signal...
[15:40:20] <valhallasw>	 scfc_de: why are the worker processes not subprocesses of lighttpd? that would get them killed immediately
[15:41:05] <valhallasw>	 (I never get how process dependencies work in unix environments)
[15:43:04] * hedonil  hedonil tries to get answers from #lighttpd
[15:47:00] <hedonil>	 scfc_de: Hmm. I think this has crucial relevance. Either lighty ins not signaled correctly (from grid or human), or itself doesn't signal correctly its subs
[15:47:54] <scfc_de>	 valhallasw: They are?!  Take a look for example on tools-webgrid-01 at tools.disclaim (pid 1909ff.)  But the php-cgi processes are grandchildren of the master lighttpd process.  I don't know if they don't get killed because of that, or lighttpd worker processes just die when their parent dies.
[15:48:16] <valhallasw>	 huh. that's really weird :|
[15:48:37] <scfc_de>	 hedonil: As I wrote on the bug, lighttpd gets SIGKILL.  I can't do anything, because it is dead as soon as that signal is sent.
[15:56:18] <hedonil>	 scfc_de: valhallasw: $ ssh tools-webgrid-tomcat 'pstree -cp 21426'  
[15:56:49] <hedonil>	 shows the current tree of my test-instance, no workers (as some threads suggested)
[15:57:50] <hedonil>	 configure: 3 processes (proc) with (each) 2 fcgi-children
[16:00:24] <scfc_de>	 hedonil: You mean server.max-worker = 0 as per http://redmine.lighttpd.net/projects/1/wiki/Server_max-workerDetails ?
[16:00:35] <hedonil>	 scfc_de: yep
[16:01:26] <Krinkle>	 SSL certificate of noc.wikimedia.org is invalid according to curl on a labs instance
[16:01:59] <valhallasw>	 Krinkle: what's the error? Invalid name or invalid issuer?
[16:02:08] <Krinkle>	   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (60) SSL certificate problem, verify that the CA cert is OK. Details:
[16:02:08] <Krinkle>	 error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
[16:02:31] <scfc_de>	 Krinkle: http://www.sslshopper.com/ssl-checker.html#hostname=noc.wikimedia.org suggests that's a problem with noc.
[16:02:47] <scfc_de>	 File a bug and assign it to ... RobH?
[16:03:36] <valhallasw>	 should not be an issue for a somewhat-recent ubuntu, I'd think...
[16:03:43] <wikibugs>	 3Wikimedia Labs / 3tools: WMFLabs: bin/curl says noc.wikimedia.org has invalid certificate - 10https://bugzilla.wikimedia.org/64483 (10Krinkle) 3NEW p:3Unprio s:3normal a:3Marc A. Pelletier [16:00 UTC] cvn-apache5.eqiad.wmflabs$ bin/update + curl https://noc.wikimedia.org/conf/all.dblist   % Total...
[16:04:56] <wikibugs>	 3Wikimedia Labs / 3tools: WMFLabs: bin/curl says noc.wikimedia.org has invalid certificate - 10https://bugzilla.wikimedia.org/64483 (10Krinkle)
[16:07:50] <scfc_de>	 The intermediate certificate seems to be missing.
[16:33:39] <tonythomas>	 Hi! I couldn't find the default $wgSMTP settings defined in mediawiki/LocalSettings.php or the orig/ in my lab instance provisioned with the default mediawiki role 
[16:33:47] <tonythomas>	 its able to send mails though ! 
[16:41:25] <tonythomas>	 Nemo_bis:  Hi ! you around ?
[16:46:51] <scfc_de>	 tonythomas: Have you tried grepping the directories for that?  IIRC MediaWiki also has eval.php which loads *Settings.php and then gives you a shell or so to evaluate PHP expressions (for example $wgSMTP).
[16:51:27] <tonythomas>	 scfc_de: I find $wgSMTP = false in includes/DefaultSettings.php 
[16:56:43] <tonythomas>	 scfc_de: looks like I will have to add my own SMTP settings to override the PHP mail sending 
[17:00:00] <RPiotrowski>	 @seen petan
[17:00:01] <wm-bot>	 RPiotrowski: Last time I saw petan they were talking in the channel, they are still in the channel #huggle at 4/25/2014 3:40:02 PM (1d1h19m58s ago)
[17:01:45] <RPiotrowski>	 jeremyb online?
[17:24:26] <tonythomas>	 Coren: can you create a new user named 'wiki' in my instance 'box1verp' ? I'm not having the privelages, I think 
[17:26:47] <scfc_de>	 tonythomas: Aren't you root on that?  Doesn't "sudo -i" give you a root prompt?
[17:28:22] <tonythomas>	 https://www.irccloud.com/pastebin/YgI8IkN8
[17:28:32] <tonythomas>	 scfc_de: I get these errors 
[17:31:02] <RPiotrowski>	 ummn, irccloud :/
[17:31:58] <tonythomas>	 RPiotrowski:  irccloud :) 
[17:32:22] <RPiotrowski>	 tonythomas: irccloud = no privacy :(
[17:33:02] <tonythomas>	 RPiotrowski: but, bit ease with getting logged in and start. Which one do you prefer ?
[17:33:40] <RPiotrowski>	 pidgin, thunderbird
[17:35:00] <tonythomas>	 right ! anyway, not the issue here though :) Any thoughts on getting the user added ?
[17:44:56] <wikibugs>	 3Wikimedia Labs / 3tools: tools.wmflabs.org inaccessible via labs instances - 10https://bugzilla.wikimedia.org/54052#c12 (10DrTrigon) (In reply to Tim Landscheidt from comment #11) > You can easily work around that by using tools-webproxy as Coren wrote in > comment #1.  I don't understand how this should so...
[17:46:46] <hedonil>	 Coren: send here's my customized default (so far)   https://tools.wmflabs.org/paste/view/c54807f0
[17:46:56] <wikibugs>	 3Wikimedia Labs / 3tools: WMFLabs: bin/curl says noc.wikimedia.org has invalid certificate - 10https://bugzilla.wikimedia.org/64483#c1 (10Tim Landscheidt) IMHO this is not a problem with Labs, but noc.wikimedia.org; cf. for example http://www.sslshopper.com/ssl-checker.html#hostname=noc.wikimedia.org.  It sh...
[17:47:20] <hedonil>	 runs xtools since ~10:00 am
[17:47:28] <hedonil>	 without restart
[17:48:07] <hedonil>	 which is possibly record-breaking. hehe
[17:48:24] * hedonil  knocks on wood
[17:49:23] <hedonil>	 but the most important thing is, it respawned 1 died backend automatically (as it should be)  http://tools.wmflabs.org/xtools/server-statistics
[17:50:28] <hedonil>	 (the other died php-cgi  one was killed by hand, to see if it respawns) it did
[17:51:49] <scfc_de>	 tonythomas: The problem is that /home resides on a NFS server and this needs to be made aware of system users as well.  So you were right; Coren needs to add that user, but not on your local instance, but in LDAP, the central directory of users that the NFS server queries.  Alternatively, you could create that user with a local home directory outside of /home.  But Coren's usually pretty fast in adding those users to LDAP.
[17:52:44] <hedonil>	 Coren: and if grid engine sends now SIGTERM to lighttpd, SIGINT is forwarded to php-cgi's and all is terminated
[17:53:31] <scfc_de>	 hedonil: If SGE would send SIGTERM, the php-cgis are terminated in the default configuration as well.
[17:54:06] <scfc_de>	 *default = existing
[17:54:25] <hedonil>	 scfc_de: yes, a little inaccurate. IF SGE /would/ send SIGTERM now...
[18:05:21] <scfc_de>	 hedonil: No, I mean that's not a change compared to the existing configuration.  Or in other words: Where's the beef? :-)  I'd like to avoid voodoo debugging: "Use this config, 'cause it's *better*!"  We need to determine problems, then the causes for that and how to prevent them from happening again.
[18:06:13] <hedonil>	 scfc_de: take a look at the comments ;) it's written there
[18:14:01] <hedonil>	 scfc_de: btw. this kind of collaboration isn't called voodoo - it's calle constructive suggestion (a - untill now - working one) :P
[18:18:47] <scfc_de>	 hedonil: I know; but for example I'm pretty sure I set PHP_FCGI_MAX_REQUESTS to 500 some time ago.  Re max-procs, are we running out of memory?  server.max-connections is 20 (IIRC) at the moment, is it sensible to multiply it by 50 (!), but add a comment "(default = 1024) (=~max-fds/2) _enough_"?  Re kill-signal I already wrote something about that.  server.max-worker probably makes sense.  Biggest change (and probably previously a
[18:18:47] <scfc_de>	 cause for blockages): Reducing the keep-alive stuff to defaults.
[18:20:03] <hedonil>	 scfc_de: hehe. calm down
[18:21:52] <hedonil>	 scfc_de: you can setup a testinstance and load it with a bot, then you can see and don't have to guess
[18:23:14] <hedonil>	 that's what I'm doing right now (step 2) -- testing the newly acquired knowledge... 
[18:32:11] <wikibugs>	 3Wikimedia Labs / 3Infrastructure: serve a cert chain with dynamic proxy SSL certificate - 10https://bugzilla.wikimedia.org/60833#c10 (10FunPika) 5PAT>3RES/FIX Firefox isn't showing it as invalid now with a fresh profile, and http://www.sslshopper.com/ssl-checker.html#hostname=fastcci1.wmflabs.org is sho...
[18:33:43] <scfc_de>	 hedonil: Okay, I'll drop calling it voodoo debugging and switch to "500 monkeys" instead :-).  There are probably an infinite number of lighttpd configurations that "work".  So try and error isn't a useful approach.  You pointed out that our current configuration regarding keep-alive looks broken in the sense that users can apparently DOS a tool accidently.  So my suggestion would be to change that and see if it improves availability.
[18:35:14] <hedonil>	 scfc_de: you seem to be stressed, that's no good
[18:35:27] <hedonil>	 let's get back to normality
[18:36:13] <hedonil>	 there are some more issues eg. xtools _> recurrent overload, or wikihistory -> recurrent http 500
[18:37:52] <hedonil>	 but the number of requests are far from being that much to cause such an outage on resources
[18:38:33] <hedonil>	 if you have eg. only /one/ backend = main process and this one gets clogged, there's nothing left to switch over
[18:39:17] <hedonil>	 so no.1: proc = 1  and idle timeout (maybe) too long
[18:40:17] <hedonil>	 no 2. if signaling is not configured (or doesn't work properly) no respawn of that dead process, wich is no 3.
[18:40:41] <scfc_de>	 (1.) But that's not dependent on the number of procs, but on the number of php-cgi processes.
[18:40:59] <hedonil>	 that's your guess
[18:41:31] <hedonil>	 but wikihistory fails permanently, that's a fact, xtools, too, also a fact
[18:43:07] <hedonil>	 scfc_de: and as you may see in the stats page I posted previously, died is counted per backend and not per child process
[18:43:32] <hedonil>	 ...but that's all in the docs, and the links I've sent
[18:52:53] <scfc_de>	 I'm not saying that the current configuration is working and error-free.  But I've been around long enough to be *very* wary when I see somewhere: "Fixed", but no information about how.  I have no clue about whether that ".died" value has any relevance.
[18:53:26] <wikibugs>	 3Wikimedia Labs / 3tools: tools.wmflabs.org inaccessible via labs instances - 10https://bugzilla.wikimedia.org/54052#c13 (10Merlijn van Deen) (In reply to DrTrigon from comment #12) > I don't understand how this should solve the issue? Could you please explain?  Connect to tools-webproxy (i.e. using the inte...
[18:54:36] <hedonil>	 scfc_de: no one said fixed. And if you have no clue rtfm
[19:07:11] <scfc_de>	 hedonil: No problem.  Next time you don't know what arguments pkill expects, I'll leave it at "RTFM".
[19:08:24] <valhallasw>	 chill, both of you :-p
[19:11:37] * hedonil  sends ☯ to scfc_de and notes, that he's a good guy. It's only a hobby, nobody will die, if a php-cgi doesn't work !
[19:11:46] <hedonil>	 scfc_de: let's forget about this
[19:21:09] <rohit-dua>	 how do i commit changes to my project to gerrit, and assign them to NOT review (as in just minor changes.)
[19:21:59] <valhallasw>	 rohit-dua: just +2 the changes immediately after pushing?
[19:22:50] <valhallasw>	 I'm not sure if there is a way to auto-+2 (although you could just push directly, completely skipping gerrit, but that's not something you should do)
[19:25:19] <grrrit-wm>	 (03CR) 108ohit.dua: [V: 032] coming_soon [labs/tools/bub] - 10https://gerrit.wikimedia.org/r/129709 (owner: 108ohit.dua)
[19:27:52] <rohit-dua>	 is it possible to +2 multiple files.. (not 1 by 1)
[19:34:22] <valhallasw>	 rohit-dua: multple changesets you mean? Well, you can open multiple tabs
[19:34:33] <valhallasw>	 and it should be possible by ssh'ing to gerrit
[20:11:58] <hedonil>	 Coren: and to make the running configs more visible:  https://tools.wmflabs.org/paste/view/4fce5753
[20:13:59] <hedonil>	 all serve mostly php content, less static content
[20:28:42] <wikibugs>	 3Wikimedia Labs / 3tools: tools.wmflabs.org inaccessible via labs instances - 10https://bugzilla.wikimedia.org/54052#c14 (10DrTrigon) subster.py takes user-defined urls and retrieves their content - how can I tell pywikibot to genrally use tools-webproxy instead of tools.wmflabs.org?  According to [1] it wor...
[20:37:11] <wikibugs>	 3Wikimedia Labs / 3tools: tools.wmflabs.org inaccessible via labs instances - 10https://bugzilla.wikimedia.org/54052#c15 (10Merlijn van Deen) tools-webproxy is not a proxy server you should use, but it's the internal address for tools.wmflabs.org.   However, this does show there is a clear need for tools.wmf...
[20:41:27] <wikibugs>	 3Wikimedia Labs / 3tools: tools.wmflabs.org inaccessible via labs instances - 10https://bugzilla.wikimedia.org/54052#c16 (10metatron) +1  If OpenStack has (DNS)-related limits here, maybe a hosts-entry can fix this issue.At least on Grid-/Exec-nodes and both Bastions.
[22:56:22] <a930913>	 Is the redis still b0rked?
[23:07:56] <scfc_de>	 a930913: It shouldn't.  What's your experience?
[23:16:57] <a930913>	 scfc_de: Something is broken on the way to some of my stuff which uses redis.
[23:18:08] <a930913>	 And my head aches from a headache, so I'm not in a mood for investigating :(
[23:21:25] <scfc_de>	 The Redis server is running and has 63G free space, so it shouldn't clog up any time soon :-).
[23:23:05] <scfc_de>	 (Though for me headaches usually mean abstaining from anything brainy and instead lying down for a nap.)
[23:24:56] <a930913>	 scfc_de: My thoughts too.