[02:22:24] hi there [05:21:26] 3Wikimedia Labs / 3deployment-prep (beta): Make use of twemproxy - 10https://bugzilla.wikimedia.org/62836#c3 (10Andre Klapper) Reedy: ping comment 2? [07:58:56] 3Wikimedia Labs / 3deployment-prep (beta): Make use of twemproxy - 10https://bugzilla.wikimedia.org/62836#c4 (10Gerrit Notification Bot) Change 129641 had a related patch set uploaded by Reedy: WIP: Initial twemproxy configs for labs https://gerrit.wikimedia.org/r/129641 [08:11:56] 3Wikimedia Labs / 3deployment-prep (beta): Make use of twemproxy - 10https://bugzilla.wikimedia.org/62836#c5 (10Sam Reed (reedy)) Need variance for -labs, AND for hhvm... Using /etc/wikimedia-realm For which multiversion/MWRealm.sh exists! Though, I suspect it won't run 2 versions of twemproxy without some... [08:22:56] 3Wikimedia Labs / 3deployment-prep (beta): Make use of twemproxy - 10https://bugzilla.wikimedia.org/62836#c6 (10Gerrit Notification Bot) Change 129644 had a related patch set uploaded by Reedy: Variable twemproxy config location https://gerrit.wikimedia.org/r/129644 [09:10:41] 3Wikimedia Labs / 3deployment-prep (beta): Make use of twemproxy - 10https://bugzilla.wikimedia.org/62836#c7 (10Gerrit Notification Bot) Change 129644 merged by Alexandros Kosiaris: Vary twemproxy config location based on getRealmSpecificFilename() https://gerrit.wikimedia.org/r/129644 [09:41:26] 3Wikimedia Labs / 3deployment-prep (beta): Make use of twemproxy - 10https://bugzilla.wikimedia.org/62836#c8 (10Gerrit Notification Bot) Change 129641 merged by jenkins-bot: Initial twemproxy configs for labs https://gerrit.wikimedia.org/r/129641 [10:14:47] hashar: should dologmsg work from beta? [10:15:03] Reedy: no idea [10:15:09] I dont think it ever got configured [10:15:30] guess that explains why it seems to hang :) [10:15:36] the script is present on deployment-bastion [10:15:50] and it : echo "$*" | nc -q0 neon.wikimedia.org 9200 [10:15:51] :] [10:16:02] hahah [10:16:08] we can either use our own (and then adjust all the callers) [10:16:25] or adjust the dologmsg script to detect it is on beta and thus use a different host [10:16:30] but I dont think we have any IRC relay [10:16:33] replace neon with labs nagios? [10:16:37] in that nc [10:16:47] for log to irc? [10:16:48] I would rather not couple beta with the nagios project :] [10:17:14] just saying..if we had a labs host that is like neon..puppet wise [10:17:25] that would make some things easier [10:17:36] and there was some work on that [10:17:41] potentially we could get whatever puppet class is applied on neon to setup the irc relay [10:17:45] it shouldn't take too much to hold the relay [10:17:45] and apply it on the deployment-bastion [10:18:01] then have dologmsg to send to the beta bastion instance whenever it is running on labs [10:18:29] ah, i see [10:18:32] is it puppetised? :D [10:18:38] it's ugly, directly in site.pp [10:18:40] the message will have to be prefixed with !log deployment-prep and sent to #wikimedia-labs [10:18:43] $ircecho_logs = { '/var/log/icinga/irc.log' => '#wikimedia-operations' } [10:18:44] hehe [10:18:47] time for refactoring! [10:18:57] though I think Ori wrote some tcpircbot class somewhere [10:18:59] 1849 include role::echoirc [10:19:04] there, at least it's a role [10:19:19] i could likely do that [10:19:32] go go go!!! [10:19:43] yes, that is the thing [10:19:49] there is also tcpircbot [10:19:53] and it already has monitoring [10:20:05] ircecho seems to be already be part of deployment-prep [10:20:23] oh, no, it's global [10:20:45] it should not set those variables in site.pp .. step 1 [10:21:04] i can move it to the role and add a $realm case [10:24:54] nothing is ever straight forward! [10:24:58] please! :) [10:33:08] don't you like how role::echoirc includes ircecho :) [10:33:20] brb, had to check one thing for odder, making change now [10:33:27] echoircing [10:33:30] ircechoing [10:35:29] !echo is echo "echoirc includes ircecho and monitors ircecho is running" [10:35:30] Key was added [10:36:27] o-O [10:47:26] 3Wikimedia Labs / 3deployment-prep (beta): Make use of twemproxy - 10https://bugzilla.wikimedia.org/62836#c9 (10Gerrit Notification Bot) Change 129663 had a related patch set uploaded by Reedy: Vary twemproxy config location based on getRealmSpecificFilename() (take 2) https://gerrit.wikimedia.org/r/129663 [10:56:11] Reedy: https://gerrit.wikimedia.org/r/#/c/129664/1/manifests/role/echoirc.pp [10:58:30] Yay [10:59:43] i just see so many other puppet fails on neon :p [10:59:54] unrelated..but the closer you look ... [11:26:49] Hi [11:27:20] can anyone check why me (or hashar) not able to login to language-browsertest instance? [11:27:32] instance is running, I also rebooted. [11:27:58] mutante: any labs knowledge by any chance ? :D [11:28:04] language-browsertests [11:28:21] the instance language-browsertests.eqiad.wmflabs does n to let us in although other instances works just fine [11:28:26] might be puppet outdated there though [11:29:10] hashar: , kart_ gimme a minute [11:29:17] and i'll look [11:29:21] \O/ [11:29:22] need to add myself [11:30:28] !log deployment-prep Authentication is broken on the beta cluster. Well at least from commons.wikimedia.beta.wmflabs.org [11:30:30] Logged the message, Master [11:31:13] need my phone for 2factor auth. but it's down.. waiting for it too charge 1% :p [11:31:17] !log deployment-prep commonswiki-75388f96: 0.6183 19.5M SQL ERROR (ignored): Table 'commonswiki.revtag_type' doesn't exist (10.68.16.193) [11:31:18] Logged the message, Master [11:34:52] hashar: kart_ : "language-browsertests" is shown as type "pmtpa-2" to me [11:35:15] i'm not sure though if that means it can't be .eqiad.wmflabs [11:35:21] sounds pmtpa'ish though [11:35:42] language-mleb.eqiad.wmflabs is the same [11:35:44] but we can log on it [11:35:57] I guess they are copy pasted instances that uses some old nova profiles [11:35:59] hmm, indeed, it is just a name for the type of instance [11:36:14] yes [11:36:17] https://wikitech.wikimedia.org/wiki/Nova_Resource:I-00000265.eqiad.wmflabs [11:36:24] this is the one you want , right [11:36:35] virt1006, nods [11:36:57] yes [11:37:44] heh, i never saw that one before [11:37:49] "You cannot complete the action requested as your user account is not in the project language. " [11:38:01] still trying [11:39:06] !log language added self to project [11:39:07] Logged the message, Master [11:39:16] !log language made hashar and dzahn project admins [11:39:17] Logged the message, Master [11:45:04] Reedy: still around ? [11:45:13] Reedy: I can not login on beta cluster and I have no idea what is going on [11:46:13] hashar: i also cant login to that.. and i don't see why yet [11:46:21] mutante: :-( [11:48:25] i looked at console log for the instance on virt1006 itself.. but no clues [11:51:26] wonders if that load on virt1006 is normal [11:53:56] kart_: afraid i dunno what it is and i'd have to escalate to Coren and andrewbogott_afk [11:56:20] hashar: Logon to a wiki on the beta cluster? [11:56:32] Are sessions in memcached? [11:56:44] Reedy: can't remember :-/ [11:57:01] $wgObjectCaches['sessions'] = $wgObjectCaches["beta-memcached-{$wmfDatacenter}"]; [11:57:02] Yup [11:57:17] It's trying to use the twemproxy config [11:57:17] should I just restart memcached so ? [11:57:18] :D [11:57:21] nope [11:57:22] ahhhh [11:57:25] wonderful ! [11:57:33] I was hoping we'd get the fix into puppet [11:57:35] let me fix the config [12:02:29] mutante: thanks. no issue. [12:03:09] mutante: ping me anytime, if you need me :) [12:04:02] Reedy: I am out to get some groceries, will be back soon though [12:20:41] Reedy: looks like I can login again :] [12:20:50] yay [12:21:17] hopefully we get twemproxy back in again [12:21:18] you are my daily hero! [12:22:48] Reedy: whenever the puppet change is reviewed yeah [12:22:58] though we can well cherry pick it on the beta cluster puppet master to try it out [12:44:48] does anyone here remember having this irc bot called [12:45:00] labs-storage-wm [13:05:42] hashar: Still have issues with language-browsertests? [13:05:51] kart_: ^ [13:06:02] Coren: that was kart_ :-] I am assuming that is still the case :D [13:06:13] yeah confirmed. [13:06:18] Permission denied (publickey). [13:06:28] Oh! I see what the issue is. That's a migrated instance that still has its autofs. [13:06:29] i added myself to that project earlier and got the same, ack [13:06:34] I can connect on language-mleb.eqiad.wmflabs [13:06:34] [13:06:40] noticed the "type pmtpa2" [13:06:41] heh [13:08:14] mutante: For future ref; if you ever see an instance that shows 'No such files' in a df on every mount, that's the issue. An 'apt-get purge autofs5' fixes it. [13:08:26] 3Wikimedia Labs / 3tools: user_password_expires column is missing - 10https://bugzilla.wikimedia.org/64369 (10Sam Reed (reedy)) p:5Unprio>3Normal s:5normal>3minor [13:10:15] !log language purged autofs from language-browsertests and rebooted [13:10:17] Logged the message, Master [13:10:29] kart_: You should be all set on that instance now. [13:11:59] Coren: good to know! thanks. and you get on the instance itself as root user and that home is local? [13:12:28] mutante: Exactly. [13:13:48] Coren: i think my key isn't in root authorized_keys [13:13:58] i can login as my regular user but not as root [13:14:50] mutante: It'd be a good thing for it to be; but we don't use our prod keys for this. [13:16:20] Coren: yes, i'm aware, i keep re- and unloading prod vs. labs key [13:17:02] mutante: If you want, add your key to files/ssh/root-authorized-keys in labs/private [13:17:35] Coren: alright, that makes things easier, i kept adding myself to projects just to login to check something etc [13:18:24] It's easier indeed that way, and also that still works even if the box is mostly hosed. [13:18:43] very helpful to debug those things,,.. /me nods [13:19:51] Coren: and here's the change to remove that unused bot https://gerrit.wikimedia.org/r/#/c/129681/1 [13:21:20] mutante: Merged and pushed. [13:21:46] tyvm:) if it would have been up, i would have broken it with another change earlier today:) [13:29:28] (03PS1) 10Dzahn: add myself to root on labs instances [labs/private] - 10https://gerrit.wikimedia.org/r/129684 [13:31:25] (03CR) 10Dzahn: [C: 032 V: 032] "this is my labs key" [labs/private] - 10https://gerrit.wikimedia.org/r/129684 (owner: 10Dzahn) [13:39:05] Coren: mutante hasharAway: Thanks! [14:01:58] i just got my own Repository for my GSoC project. how do i commit files to it? [14:04:47] rohit-dua: you have the git review tool yet? [14:05:13] rohit-dua: https://www.mediawiki.org/wiki/Git/git-review#Installation [14:06:09] Coren: any idea what the private instance IP for the shared labs proxy ? I got as public address 208.80.155.156 [14:06:22] but could use the internal IP to rewrite some packets coming from labs [14:06:41] (use case, an instance hitting the DNS entry and being unable to communicate with the NAT public IP) [14:09:23] mutante: i did install git-review. how to connect to wikimedia(repository) using it. and thank you [14:11:02] yuvipanda: around still ? :] [14:11:11] hashar: yessir :D [14:11:16] hashar: it needs a rebase, jenkins says [14:11:18] rohit-dua: what is the name of your repository? https://wikitech.wikimedia.org/wiki/Git#Git.2FGerrit_and_the_repositories [14:11:30] yuvipanda: are you in charge of the dynamic proxy on labs? [14:11:37] hashar: I wrote it, yeah [14:11:49] I am trying to reach http://language-browsertests.wmflabs.org/ from another instance [14:12:17] hashar: ah. [14:12:19] yuvipanda: it has the public IP 208.80.155.156 I am wondering whether that is the shared proxy :] [14:12:32] mutante: labs/tools/bub [14:12:36] hashar: it is [14:12:46] if it is, I can use iptables to rewrite the destination ip from 208.80.155.156 to whatever the 10.0.0.0 IP is :] but I dont know which IP it is hehe [14:13:00] hashar: ah. I can manually find that for you now. [14:13:09] \O/ [14:13:26] hashar: can you merge https://gerrit.wikimedia.org/r/#/c/126000/ for me? :D [14:13:38] hop I can't merge it [14:13:41] NOP [14:13:42] sorry [14:13:46] hashar: oh. who can? [14:13:51] aahh and gotta rebase bah [14:13:51] hashar: any other opsen? [14:13:54] rohit-dua: git clone https://gerrit.wikimedia.org/r/p/labs/tools/bub [14:13:59] hashar: or anyone in particular/ [14:14:02] yuvipanda: any opsen. I am not ops! :] [14:14:12] yuvipanda: The proxy is just an instance in Project-proxy? [14:14:18] scfc_de: yeah [14:14:20] I only bother getting them merged once a month or so. [14:14:35] yuvipanda: 10.68.16.65 dynamic-proxy ? :] [14:15:06] rohit-dua: and then as your first commit you need to push a .gitreview file [14:16:00] rohit-dua: you can copy that from other repos.. .. [14:16:01] hashar: language-browsertests.eqiad.wmflabs [14:16:07] hashar: is where it points back to [14:16:12] yuvipanda: yeah yeah [14:16:28] hashar: 10.68.16.251 [14:16:30] yuvipanda: I am willing to know what the instance is being 208.80.155.156 :D [14:16:44] hashar: aah! [14:16:48] cause language-stage.wmflabs.org points to that same public IP [14:16:54] so I figured out it must be the shared proxy [14:17:05] hashar: yeah. the internal IP of the dynamic proxy is 10.68.16.65 yes [14:17:12] I found out 10.68.16.65 by hacking around willing to confirm [14:17:13] great! [14:17:18] yuvipanda: you are saving my day :] [14:17:39] hashar: it is dynamicproxy-gateway.eqiad.wmflabs, so dig should tell you [14:17:41] after tyhat [14:18:44] hugeee hack [14:18:45] https://gerrit.wikimedia.org/r/129687 :D [14:19:05] hehe :D [14:19:15] mutante: thank you. and the .gitreview file is already present. [14:21:17] yuvipanda: I will rebase the other change [14:21:30] yuvipanda: it is probably very trivial and caused by site.pp that got edited meanwhile [14:22:16] hashar: it might be jeninks fuckage. I tried git rebase gerrit/production and it said it is up to date [14:22:27] yeah [14:22:34] because it is not trivial enough for Gerrit [14:24:14] rohit-dua: great, then all you need is add some files, git commit -a, git review [14:25:08] yuvipanda: just push the rebased change to gerrit :] [14:25:20] yuvipanda: then you need to ask some opsen to review / merge it [14:25:55] hashar: yeah, I will do that in a bit! [14:25:57] hashar: thanks for the patch :D [14:32:10] yuvipanda: Will you be rewriting https://gerrit.wikimedia.org/r/#/c/125241/ as well to make use of hashar's patch? [14:32:53] scfc_de: I don't actually know. if I get it all done on contint I don't know if I'll need it on tools [14:33:29] yuvipanda: k [14:37:37] mutante: i did git review.. and it gives command failed : .git/hooks/commit-msg: No such file or directory [14:38:15] scfc_de: yuvipanda: please please only define the android sdk dependencies in a single place :] [14:38:33] if you get https://gerrit.wikimedia.org/r/#/c/126000/ merged in, that will be quite easy to maintain them [14:39:34] rohit-dua: try this wget -P .git/hooks https://gerrit.wikimedia.org/r/tools/hooks/commit-msg [14:39:53] rohit-dua: download that file from the above URL and put it into .git/hooks in your home [14:41:05] hashar: Yeah, sorry if it wasn't clear: If we would need Android SDK dependencies on Tools, your patch would be the way to go. [14:41:33] yeah, agreed [14:41:39] scfc_de: make sure to +1 it / comment on it :] [14:41:39] !hook is .git/hooks/commit-msg: No such file or directory? -> download it from https://gerrit.wikimedia.org/r/tools/hooks/commit-msg [14:41:40] Key was added [14:42:02] scfc_de: your concern about requiring ensure => latest was totally valid so I took it into account :] [14:42:29] ensure => latest on PHP used to be .. ..surprising [14:42:41] 3Wikimedia Labs / 3tools: Provide namespace IDs and names in the databases similar to toolserver.namespace - 10https://bugzilla.wikimedia.org/48625#c31 (10nosy) Did the first steps - most of the scripts already run and the dbs get filled. I still dont know how to test the data in regard of being valid. I'd b... [14:46:09] scfc_de: hashar mutante https://gist.github.com/yuvipanda/10408391 sets up an android sdk unattended. I do not think ops will be happy with putting that in puppet (retreiving from external google URL). Thoughts on how else to do it? [14:46:26] even if a package exists, the actual API eventually comes from google itself [14:49:28] I think there are lower standards for Labs :-). [14:49:38] I think contint install git-review by pip as well. [14:50:32] hashar: Where does package libswt-gtk-3.5-java come from? [14:52:25] mutante: is the username the same for gerrit review as of when we sign in https://gerrit.wikimedia.org, because i get auth failed for the same username/password [14:53:34] rohit-dua: Do you mean the URL à la ssh://scfc@gerrit.wikimedia.org:29418/operations/puppet.git? You need to use your shell username for that. [14:56:34] !log language set user email of Selenium_user to MY email to reset the password to the one used on the beta cluster [14:56:36] Logged the message, Master [15:01:22] scfc_de: can we puppetize it by just having puppet run the file? :D [15:04:08] yuvipanda: You would need something meaningful as "onlyif =>" :-). [15:04:36] rohit-dua: yes, same user name, but you should not be asked for a password, it should use ssh keys instead [15:05:08] rohit-dua: upload your ssh key here https://gerrit.wikimedia.org/r/#/settings/ssh-keys [15:08:58] mutante: thank you, but i already did upload my ssh public key. it asks for a password, and dosent accept the one i use for logging into the site via browser.. [15:10:47] rohit-dua: take a look at file .git/config when in your local clone . does it have a "remote "gerrit"" ? [15:11:42] mutante: it has [remote "origin"] [15:13:17] rohit-dua: try adding this to the file, below the existing lines [15:13:20] http://paste.debian.net/95694/ [15:13:25] then try using git review again [15:13:39] replace "rohitdua" if it's not your actual username [15:15:10] Doesn't git review treat gerrit and origin as one since some time ago? [15:16:21] scfc_de: i think the case is this: [15:16:26] cloned from https [15:16:35] so the url has https, not ssh [15:16:48] and when he tries to push with git review it tries https [15:17:05] unless you have a separate remote or change the existing one, to be ssh [15:17:06] mutante: Ah, sure. [15:17:14] and then it asks him for the "http password" [15:17:24] which he _could_ probably set in gerrit and use to push [15:17:39] but i would only say that if he was blocked by firewall [15:17:47] to reach the high ssh port [15:17:59] and otherwise recommend to use the keys [15:18:50] but maybe "git review -s" would have created that remote ? i forget [15:37:26] 3Wikimedia Labs / 3tools: Provide documentation: How to use PGAdmin as a frontend - 10https://bugzilla.wikimedia.org/63380#c1 (10Alexandros Kosiaris) 5NEW>3RES/FIX Done. Please have a look at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Configuring_PGAdmin_for_OSM_access [15:37:57] mutante: sorry for late reply. but still no luck after adding that to the file. it asks for the password.. :-( [15:43:22] rohit-dua: sorry, i have a meeting, i need to run. [15:43:29] can somebody help rohit-dua ? [15:43:55] rohit-dua: one last thing.. did you ever run "git review -s" in the beginning to do the initial setup? [15:44:08] gotta talk to you later..waves [15:48:56] 3Wikimedia Labs / 3tools: Provide namespace IDs and names in the databases similar to toolserver.namespace - 10https://bugzilla.wikimedia.org/48625#c32 (10Marc A. Pelletier) Give me a ring once you are satisfied with the result, I can rename the database to something more mnemonic for you. (Also, unless you... [16:11:04] how to find the shell-name.. idk if the username is my shellname [16:18:27] rohit-dua: If you go to wikitech, it's listed on Special:Preferences. [16:18:58] scfc_de: thank you and is the password same for shell and http? [16:20:38] For shell there are no passwords. What error message are you getting for what? Did you follow mutante's advice to run "git review -s"? [16:30:38] scfc_de: yes i did run "git review -s" but no luck. I did everything from begining as in https://wikitech.wikimedia.org/wiki/Git#Git.2FGerrit_and_the_repositories. [16:30:54] scfc_de: error: fatal: Authentication failed for 'https://gerrit.wikimedia.org/r/p/labs/tools/bub/' [16:31:09] it asks for username and password [16:31:14] on git review [16:34:07] (03PS1) 108ohit.dua: coming_soon [labs/tools/bub] - 10https://gerrit.wikimedia.org/r/129709 [16:36:26] scfc_de: update: IT worked! i changed the url for origin in config file from https to ssh and it worked. [16:56:40] scfc_de: onlyif is easy enough. It puts the things in a certain path [17:06:48] rohit-dua: Great! [17:09:45] yuvipanda: Yeah, but (for perfection :-)) you'll want to make sure that not only a fragment of the installation succeeded. So you'll need to have a "marker" that indicates the installation went through. [18:52:46] I can ssh to bastion just fine, but I can not access my instance: Permission denied (publickey). I haven't tried since the move to equid. The instance is active and I did use the -A to ssh to bastion. [19:00:24] slevinski: try ssh -v -v [19:00:38] and see if it works correctly (e.g. if it tries to use the forwarded key) [19:06:02] no luck. Same error. [19:06:28] slevinski: yes, but you should now get debug information, which can help solve the issue... [19:12:09] slevinski: what is the instance and project name? [19:12:43] project signwriting, instance signwriting-icon-server [19:13:18] ok… I can (for starters) confirm that that instance is up and running, and my key works there. [19:15:07] slevinski: try now, while I'm watching the log? [19:18:33] slevinski: I see attempts to log in as root; is that you? [19:19:06] should be. Do I need to add the -l option? [19:19:40] are you root on the box that you're connecting from? [19:19:43] Or are you using windows? [19:20:15] Anyway, yes, your username on that box is 'slevinski' so you need to log in as that user, definitely. [19:20:38] whoami reports user slevinski on bastion. I'm using an iMac. Adding -l slevinski fails as well. [19:21:55] what does ssh-add -l say? [19:22:59] The agent has no identities [19:24:41] ok, that's at least one problem :) Your key isn't getting forwarded to bastion. [19:25:15] I'd recommend using proxycommand rather than going via bastion. https://wikitech.wikimedia.org/wiki/Access#Accessing_instances_with_ProxyCommand_ssh_option_.28recommended.29 [19:27:14] Thanks for looking. I'll reconfigure and try again. [20:13:57] Hi. I am user sh3nhu, I am trying to access my instance for the first time using putty and I am not having luck. I am following instructions at https://wikitech.wikimedia.org/wiki/User:Wikinaut/Help:Access_to_instances_with_PuTTY_and_WinSCP#How_to_set_up_PuTTY_for_proxying_through_bastion.wmflabs.org_to_your_instance with no luck [20:14:27] I don't understand how my key will be used, the instructions have no mention of 'key' [20:15:51] Can anybody help? [20:41:04] Well, the instructions presume you have a key pair already setup. Do you know how to use SSH in general? [20:41:32] The second image also shows where to put your private key [20:42:00] * Damianz hugs his ssh-agent and native ssh clients while remembering the time he had to use putty [20:43:01] Damianz: The second image doesn't tell us to fill that out [20:43:24] We have a key [20:43:39] on our linux machine, it is stored in our .ssh directory [20:44:21] Technically it doesn't tell you to fill anything out [20:44:27] You'll need the private key accessible to your putty as well. [20:46:56] 3Wikimedia Labs / 3tools: xmllint program available from tools-login but not when running a job - 10https://bugzilla.wikimedia.org/62944#c3 (10Gerrit Notification Bot) Change 120187 merged by coren: Tools: Install package libxml2-utils for xmllint https://gerrit.wikimedia.org/r/120187 [20:47:39] is that id_rsa ? [20:49:21] Coren: scfc_de: stumbling over this lighty issue "sockets disabled, connection limit reached" [20:49:53] just saw that in our default server.max-keep-alive-idle = 60 [20:50:02] Guestuser: Yes, but IIRC putty doesn't understand openssh keys by default. Puttigen, however, gives you an opportunity to convert it. [20:50:30] the recommended defaults in redmine look like: server.max-keep-alive-idle = 5 [20:51:18] which means that file descriptors are held 60 seconds idle ? [20:51:30] hedonil: 5s seems a bit on the short side; but I'm pretty sure that if you reach your connection limit it's not because of the idle keepalive; which only works during the actual client request. Or at least, so it should. [20:51:56] hedonil: No, that's only the maximum if the client keeps it open. [20:52:09] Coren: hmm. read all the threads on readmine & co now [20:52:21] Normal clients close the socket once they're done connecting. [20:52:26] s/connecting/loading/ [20:53:09] http://redmine.lighttpd.net/projects/1/wiki/Server_max-keep-alive-idleDetails [20:53:50] if you have many requests, they can't be dropped quickly enough afai understand that [20:55:28] Coren: but I'll admit it seems to be a complex problem with no obvious solution [20:55:33] putty wants a .ppk file. I just have my id_rsa file [20:56:03] Guestuser: Check puttygen, it has an option/button to convert one into the other [20:57:00] using puttygen, I want to load my id_rsa file, but it doesn't see it! [20:57:47] Guestuser: If memory serves (I'm not a Windows user) you have a spot where you can /paste/ your key. [20:58:42] hedonil: Mind you, I could reduce the timout -- at worse it's just less efficient. [20:59:13] Coren: a testserver would be great for tweaking ;) [20:59:25] hedonil: But if you're hitting the connection limit because you have too many actual clients, that won't change squat. [20:59:54] Coren: yes. many factors to take into account [21:00:04] it says host does not exist. [21:00:09] I know it exists [21:00:30] hedonil: I can override how many server threads your tool gets at need. [21:00:31] but as other users are sufferiing from this problem too, there is a silent call for a resolution ;) [21:01:23] Coren: I don't think (know) if there's a problem with the number of processes [21:01:57] Coren: I tried this line from redmine, to check current open file descriptors [21:01:59] # cat /proc/`ps ax | grep lighttpd | grep -v grep | awk -F " " '{print $1}'`/limits |grep "Max open files" [21:02:23] hedonil: My default values are, by design, fairly conservative and mostly adequate for tools that get a limited amount of traffic (several/min, not several/sec) [21:02:25] but this line seems to need some adaptions [21:02:54] Coren: scale near prod :-) [21:03:08] Coren: your words? [21:03:13] haha [21:03:44] hedonil: I'm pretty sure that isn't my words. :-) But like I said, I can tune up the config on individual tools at need; I already did it for a couple that see a lot of use. [21:04:10] hedonil: It's just the defaults that are conservative; given that anyone can just make a tool. [21:04:19] Coren: I see [21:04:58] Guestuser: I'm not sure I can guess how puttygen could give you that error. Sorry I can't help much more, but I haven't used a Windows box (nor putty) in very many years. [21:06:25] Coren: can you tell me what's wrong with this cat /proc line? [21:07:07] jfi it sould output something like Max open files 2048 2048 files [21:08:14] Coren: puttygen wasn't giving us that error. it was putty. We were able to generate the key with puttygen. but now when we try to use putty to get into our instance, it just hangs. [21:10:25] hedonil: You're getting more than one pid in that `` part, so you're trying to cat somemething like "/proc/2131 2140 4208 4210 17264 30162 32255/limits" [21:10:43] Coren: ah. ok [21:11:40] hedonil: You want something like: for pid in $(ps ax | grep lighttpd | grep -v grep | awk -F " " '{print $1}'); do grep "Max open files" /proc/$pid/limits;done [21:12:11] Coren: as said, the forum threads list similiar problems, but nearly all state, that they are far from hitting the limits [21:12:26] Coren: just want to dig deeper [21:12:50] Max open files? Of course, that'd be almost impossible to hit. The problem isn't max open files, it's the max number of simultaneous clients lighttpd will /accept/ [21:14:00] Which is "server.max-connections * server.max-worker" [21:14:42] Ah, no, just server.max-connections; sorry. I forgot lighttpd has global max-connections not per worker. Same idea. [21:20:07] <^d> MrZ-man: Hey, you about? [21:20:42] !ping [21:20:43] !pong [21:20:44] ok [21:21:02] * ^d needlessly pings YuviPanda [21:21:25] * YuviPanda protests outside ^d's house [21:21:43] <^d> In the rain? Sucks to be you :p [21:23:55] ^d: :P [21:30:10] If I understand the max-keep-alive business correctly, in the default configuration five clients that are trying to keep a connection alive can block a tool? [21:30:35] Coren: I'm tracing php now, checking if this doesn't free up the resources and keeps them piling up [21:31:09] Coren: scfc_de: if this the problem, adding more workers/processes doesn't solve it [21:31:43] docs are talking about server.max-keep-alive-idle = 5, on high load lowered to 4 or 2 [21:31:56] 3Wikimedia Labs / 3tools: xmllint program available from tools-login but not when running a job - 10https://bugzilla.wikimedia.org/62944 (10Tim Landscheidt) 5PAT>3RES/FIX [21:32:04] or even 0 (insane mode) [21:32:18] but 60 seconds seems like high [21:32:53] *five clients every minute; so no intentional DOS attack. [21:35:32] scfc_de: if this really means 5 client per minute max, you could also write a letter :-D [21:35:59] * hedonil doesn't really know either [21:36:41] ^d: I'm here now [21:37:06] <^d> Howdy! I just had a user ping me on my talkpage about the popularpages tool, and how he could get something similar for eswiki. [21:37:12] scfc_de: speaking of "remaining php processes after lighty stop" [21:37:13] <^d> Mind if I send him your way since it's your tool? [21:37:43] sure, other language support is planned, but not currently being worked on [21:37:46] scfc_de: I tried to terminated them with pkill, which is by default SIGTERM afair [21:38:05] hedonil: Oh, that's a different issue if you mean that "webservice stop" or "webservice restart" leave those behind blocking requests, cf. ... [21:38:06] scfc_de: only pkill -9 did the job [21:38:15] scfc_de: yep [21:38:25] <^d> MrZ-man: Cool, I'll leave him a note saying that and pointing him to your talkpage. [21:38:30] <^d> Thanks! [21:38:35] scfc_de: I read your answer on bugzilla [21:38:48] hedonil: https://bugzilla.wikimedia.org/show_bug.cgi?id=61102 [21:39:01] scfc_de: yep, this one [21:39:04] hedonil: But for me, kill -HUP $(pid of php-cgi) always worked. [21:39:36] I have a clean up script on tools-login, I'll run it. [21:39:39] scfc_de: ok, let's try the next "stucking" ones [21:40:10] scfc_de: I'll give you a ping if I have another one [21:40:34] Killed a few. Perhaps I'll set it up as a cron job until we fix that bug. [21:41:29] scfc_de: sounds like a good idea [21:41:48] hedonil: And done (every five minutes). [21:43:58] scfc_de: as an aside: low accessed tools like my webtest have no problem with terminating php-cgi wit webservice stop [21:44:23] while highly accessed tools have (sometimes) [21:45:14] hedonil: Are you sure? In my testing, "webservice stop" always left php-cgis behind. Only exception perhaps if it didn't start any php-cgis in the first place? [21:45:20] which keeps me wondering, if they hold persitant resources/connections and don't free up [21:45:54] * hedonil checks again [21:46:24] The problem AFAICS is that lighttpd doesn't connect to the existing php-cgi processes (don't know if that would even be possible), but doesn't create new ones either because there are already five processes running. [21:48:07] scfc_de: ok. I'm on tools-webgrid-01 montoring $top -u tools.newwebtest [21:48:30] an will issue webservice stop in a few sec [21:49:11] called stop [21:50:03] scfc_de: you're right, now the php-cgi's keep on living [21:50:25] scfc_de: but now, with short delay they are gone [21:51:39] scfc_de: so, after ~5sec delay they terminated properly [21:53:21] That could be my cron job :-). Let me deactivate that for the moment. [21:53:41] haha [21:53:42] Okay, try again, please. [21:54:03] started [21:54:28] stopping now [21:56:05] scfc_de: they keep alive [21:56:56] Okay, then I'll reenable my cron job :-). [22:00:04] scfc_de: and another (new) variant, I pkill -9 'd them on tools-webgrid-02 - but they keep living ! [22:01:04] scfc pid 8598, 8600,8601,8602,8603 [22:01:43] ? [22:02:25] hedonil: One moment. [22:03:16] * hedonil wonders who lost his magic pkill or hedonil ... [22:05:32] hedonil: Are you trying to kill them as user ... tools.something? [22:05:53] scfc_de: yep. as tools account, as always [22:07:55] hedonil: Just "kill -KILL 8598" as tools.newwebtest on tools-webgrid-02, and the process's gone?! [22:08:26] scfc_de: yep 8598 was killed [22:09:37] scfc_de: pkill -9 doesn' t work any longer for me [22:09:51] kill -KILL gives me bash: kill: (8600) - Operation not permitted [22:10:41] maybe I'm no admin ;) [22:11:18] Are you sure you are user tools.newwebtest when executing pkill? [22:11:47] scfc_de: shame on me [22:12:06] scfc_de: I'm another tool ... [22:12:32] wrong console [22:13:42] No problem; I've reenabled my cron job. [22:14:13] scfc_de: sent 8600 a soft pkill (doesn't work, we know that) [22:15:19] scfc_de: setn pkill -9 (as tools.newwebtest!) didn't help either [22:15:42] scfc_de: and now your bot nuked them all [22:16:22] this is strange [22:17:35] As it worked for me as tools.newwebtest, I suspect the problem lies with you :-). What do you mean by "sent"? [22:17:56] $pkill -9 [22:19:18] hedonil: Eh, that would look up processes with as their *name*? Just use kill? [22:19:37] scfc_de: but it may be my fault though [22:20:04] scfc_de: in my script I have: $ ssh tools-webgrid-01 'pkill -9 -U tools.newwebtest php-cgi' [22:20:24] hedonil: maybe pkill needs something to "grep" ;) [22:20:51] scfc_de: and as I issued kill, I was another tool [22:21:17] ... so seems illuminated ;) [22:21:36] at least the "kill php-cgi-issue" [22:29:10] scfc_de: I won't bother you, but now I have a running webservice in qstat, and /no/ processes [22:30:33] cursed [22:34:28] scfc_de: webservice restart; qstat shows running lighty, no processes, no webserice [22:35:57] Coren: scfc_de. any suggestions? [22:36:50] ok. back now [22:39:58] hedonil: That /should/ not be possible; the shepherd should have noticed its process dying. [22:40:52] Coren: in fact, the latter issue was /completely/ my fault [22:41:15] * hedonil apologizes [23:05:02] Coren: besides this this silly restart thingy, Just for a second look on the defaults config: [23:05:06] http://redmine.lighttpd.net/projects/1/wiki/Docs_Performance [23:05:12] vs. [23:05:25] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Default_configuration [23:05:50] there is also a default with 60 seconds [23:06:01] but this is: server.max-read-idle = 60 [23:08:06] You're missing the point of my default config which is "modest use of resources" rather than "performance". :-) But yeah, I'll probably tweak the idle timeout sometime next wee to see if it helps with the more heavily loaded tools. [23:10:37] Coren: :-) [23:28:27] I think keeping a server potentially around for 60 s for each keep-alive client doesn't necessarily count as "modest use of resources" :-).