[07:18:07] 6Labs, 10Tool-Labs: SGE master down - https://phabricator.wikimedia.org/T123034#1923189 (10doctaxon) I am still missing the execution of some diverse cronjobs last days, and I am not very lucky about it, because fellow Wikipedians rely on me and the working of my bots to set time as in the crontab stated. [07:52:27] 10MediaWiki-extensions-OpenStackManager: Pop-up notification when deleting an instance contains literal "$2" - https://phabricator.wikimedia.org/T123162#1923199 (10Krenair) My guess is this is from the JS code, and the second parameter is being passed as undefined. [08:02:24] 10MediaWiki-extensions-OpenStackManager: Pop-up notification when deleting an instance contains literal "$2" - https://phabricator.wikimedia.org/T123162#1923200 (10Krenair) Looks like it's expecting the data for `.novainstanceaction` elements to contain an `id` key, but it does not. [08:12:14] 10MediaWiki-extensions-OpenStackManager: Pop-up notification when deleting an instance contains literal "$2" - https://phabricator.wikimedia.org/T123162#1923203 (10Krenair) That just comes from `OpenStackNovaInstance::getInstanceId`, which looks up the instance's `OS-EXT-SRV-ATTR:instance_name` property. But som... [08:12:52] 6Labs: OS-EXT-SRV-ATTR:instance_name not set for some instances - https://phabricator.wikimedia.org/T123162#1923204 (10Krenair) [09:52:47] 10PAWS: Add a way to expose static files to the internet from PAWS - https://phabricator.wikimedia.org/T119859#1923226 (10yuvipanda) Yeah, all of that is done now! So all non-hidden (aka starts with . or in a dir that starts with .) are public by default now. This needs to be prominently displayed somewhere. [09:52:52] 10PAWS: Add a way to expose static files to the internet from PAWS - https://phabricator.wikimedia.org/T119859#1923227 (10yuvipanda) 5Open>3Resolved a:3yuvipanda [10:13:56] 6Labs, 10Tool-Labs: SGE master down - https://phabricator.wikimedia.org/T123034#1923231 (10scfc) @doctaxon: You have been a Wikipedian now for more than ten years (congratulations and thanks!), and I have seen you around the Toolserver and #Tool-Labs for probably half that time at the very least. With that ex... [10:31:52] 6Labs, 10Tool-Labs: SGE master down - https://phabricator.wikimedia.org/T123034#1923240 (10doctaxon) But there is nothing more I can say about it. /data/project/taxonbot * The cronjob beta-newday.tcl should start 0:00 UTC today, but it did not without any error report in beta-newday.out * The cronjob lkday.... [10:42:35] 6Labs, 10Tool-Labs: SGE master down - https://phabricator.wikimedia.org/T123034#1923245 (10doctaxon) [10:49:24] 6Labs, 10Tool-Labs: SGE master down - https://phabricator.wikimedia.org/T123034#1923254 (10zhuyifei1999) Not sure if it's tracked or not, but there was a similar outage yesterday (Jan 9), between 17:25:05 UTC and 17:50:07 UTC: ``` tools-exec-1206: Sat Jan 9 17:25:05 UTC 2016 error: commlib error: got select er... [10:51:41] 6Labs, 10Tool-Labs: SGE master down - https://phabricator.wikimedia.org/T123034#1923255 (10zhuyifei1999) The previous one is on Jan 7 when this bug was reported: ``` tools-exec-1210: Thu Jan 7 09:40:08 UTC 2016 error: commlib error: got select error (Connection refused) error: unable to send message to qmaster... [10:59:58] 6Labs, 10Tool-Labs: Inconsistent locale settings of different grid compute nodes. - https://phabricator.wikimedia.org/T121505#1923256 (10zhuyifei1999) 5Open>3Resolved a:3zhuyifei1999 Not sure why & how, but after the outage on {T122638} (and a [[https://wikitech.wikimedia.org/w/index.php?title=Nova_Resou... [11:00:13] 6Labs, 10Tool-Labs: Inconsistent locale settings of different grid compute nodes. - https://phabricator.wikimedia.org/T121505#1923261 (10zhuyifei1999) a:5zhuyifei1999>3None [11:46:52] 6Labs, 10Tool-Labs, 10grrrit-wm: Rogue grrrit-wm running... somewhere - https://phabricator.wikimedia.org/T123167#1923281 (10Peachey88) and now there is none. [12:30:21] 6Labs, 10Tool-Labs, 10grrrit-wm: Rogue grrrit-wm running... somewhere - https://phabricator.wikimedia.org/T123167#1923316 (10valhallasw) That's unfortunate. I tried to re-start the k8s one using ``` yuvipanda@tools-k8s-master-01:~$ ./kubectl --user=lolrrit-wm --namespace=lolrrit-wm create -f /home/yuvipanda... [12:46:07] 6Labs, 10Tool-Labs: SGE master down - https://phabricator.wikimedia.org/T123034#1923318 (10valhallasw) >>! In T123034#1923240, @doctaxon wrote: > But there is nothing more I can say about it. > > /data/project/taxonbot > > * The cronjob beta-newday.tcl should start 0:00 UTC today, but it did not without any... [12:48:49] 6Labs, 10Tool-Labs: SGE master down - https://phabricator.wikimedia.org/T123034#1923319 (10doctaxon) Maybe crontab is instable? running the cronjob at 10.01.2016 12:39:35 UTC but the cronjob should run 10.01.2016 00:00:00 UTC crontab: JSUB_OPTIONS=-once -j y -quiet -v LC_ALL=en_US.UTF-8 -mem 1g 0 0 *... [12:50:58] 6Labs, 10Tool-Labs: SGE master down - https://phabricator.wikimedia.org/T123034#1923320 (10valhallasw) No, that was me running it manually. [12:57:09] 6Labs, 10Tool-Labs: SGE master down - https://phabricator.wikimedia.org/T123034#1923321 (10valhallasw) >>! In T123034#1923254, @zhuyifei1999 wrote: > Not sure if it's tracked or not, but there was a similar outage yesterday (Jan 9), between 17:25:05 UTC and 17:50:07 UTC: Thanks, that's super useful informatio... [12:58:29] 6Labs, 10Tool-Labs: tools.taxonbot cronjob not firing - https://phabricator.wikimedia.org/T123186#1923322 (10valhallasw) 3NEW [13:04:11] valhallasw`cloud: should i add my exact same problem to that ticket? [13:04:31] gifti: cronjob not firing? Yes, please. [13:05:32] gifti: please add as much information as possible (toolname, which specific cronjob, etc) [13:05:41] i know ^^ [13:05:52] thanks! [13:08:14] 6Labs, 10Tool-Labs: tools.taxonbot cronjob not firing - https://phabricator.wikimedia.org/T123186#1923332 (10valhallasw) I'm wondering if this could just be a load issue. It's a 1-cpu host, and this is the load average in the last 24 hours: {F3216475} i.e. 'always above one, with peaks to 9 as midnight'. [13:11:31] valhallasw`cloud: what do you mean by "which files should contain information about these jobs"? [13:11:51] gifti: which .out and .err files [13:11:58] ah [13:15:53] 6Labs, 10Tool-Labs: tools.taxonbot cronjob not firing - https://phabricator.wikimedia.org/T123186#1923335 (10Giftpflanze) I have had the same problem (tools.giftbot): 2016-01-03 23:00 UTC: 0 22,23 * * 0 [ $(date +\%-H) = 0 ] && jsub -once -j y -quiet -v LC... [13:33:03] 6Labs, 10Tool-Labs: missing database on replica server - https://phabricator.wikimedia.org/T105713#1923336 (10Superyetkin) Beat? Could you please recreate the structure? [13:36:39] 6Labs, 10Tool-Labs: missing database on replica server - https://phabricator.wikimedia.org/T105713#1923339 (10valhallasw) 'beat me to it' = 'someone already created the table'. As far as I can see, the table is already there, at least on c2.labsdb. ``` MariaDB [(none)]> use s51698__yetkin; MariaDB [s51698__ye... [13:41:04] 6Labs, 10Tool-Labs: missing database on replica server - https://phabricator.wikimedia.org/T105713#1923351 (10Superyetkin) The same needs to be done for //s51698__yetkin2.wanted_items//. [13:50:28] 6Labs, 10Tool-Labs: missing database on replica server - https://phabricator.wikimedia.org/T105713#1923368 (10valhallasw) Assuming the same table structure as for s51698__yetkin: ``` tools.superyetkin@tools-bastion-01:~$ mysql --defaults-file=replica.my.cnf -h s2.labsdb Welcome to the MariaDB monitor. Command... [15:46:05] Hi i want to install hunspell on my tool. how can i do it? [15:46:38] Perhaps your account does not have write access to this directory? If the installation directory is a system-owned directory, you may need to sign in as the administrator or "root" account. [15:47:10] hello [15:47:19] let me check [15:47:52] you mean python-hunspell? [15:48:16] or python3-hunspell [15:48:16] yes [15:48:23] 2 [15:48:28] python2.7 [15:51:57] hm, I can't find the gerrit repository we use for puppet so that I could add it, we need to wait for someone, it doesn't seem tobe anywhere in docs [15:52:20] Results (Found 7): gerrit, whitespace, git-puppet, gerritsearch, ryanland, gitweb, hook, [15:52:20] @search gerrit [15:52:26] !git-puppet [15:52:26] git clone ssh://gerrit.wikimedia.org:29418/operations/puppet.git [15:52:32] hmm here we go [15:58:18] https://gerrit.wikimedia.org/r/#/c/263229/ [16:01:03] thanks. [16:16:34] petan: Tim Landscheidt asked for reference [16:16:46] petan: Tim Landscheidt asked for reference [16:16:47] what [16:17:10] https://gerrit.wikimedia.org/r/#/c/263229/2/modules/toollabs/manifests/exec_environ.pp [16:17:20] here Tim Landscheidt asked for reference [16:25:05] reza1615: what's your phabricator username? [16:25:42] 6Labs, 10Tool-Labs: Install python-hunspell (and dictionaries?) - https://phabricator.wikimedia.org/T123192#1923416 (10valhallasw) 3NEW [16:34:09] YuviPanda: do you have time to look at grrrit-wm? I can't get it back online [17:58:37] 6Labs, 10Tool-Labs: Bigbrother trying to restart existing job - https://phabricator.wikimedia.org/T123197#1923487 (10valhallasw) 3NEW [18:02:47] 6Labs, 10Tool-Labs: Bigbrother trying to restart existing job - https://phabricator.wikimedia.org/T123197#1923495 (10valhallasw) /data/project/admin/bigbrother.log doesn't show anything recent. Bigbrother log is full of ``` error: commlib error: got select error (Connection refused) error: unable to send mes... [18:03:00] 6Labs, 10Tool-Labs: Toolsbeta bigbrother failing to start toolhistory - https://phabricator.wikimedia.org/T123197#1923496 (10valhallasw) [18:10:01] 6Labs, 10Tool-Labs: tool-labs error pages HTTP/400 for POSTs - https://phabricator.wikimedia.org/T123136#1923500 (10valhallasw) Yeah, that's definitely a reasonable option, although the 'eat your own dog food' approach of the front page also has its charm. As for this specific issue, we could killing the nginx... [18:31:15] 6Labs, 10Tool-Labs: Toolsbeta bigbrother failing to start toolhistory - https://phabricator.wikimedia.org/T123197#1923516 (10scfc) I'm currently setting up the grid in Toolsbeta, so that I can work on the proxy, so this might produce a few more mails. [19:26:33] 6Labs, 10Tool-Labs: Toolsbeta bigbrother failing to start toolhistory - https://phabricator.wikimedia.org/T123197#1923533 (10valhallasw) OK! I was mostly worried this was an issue with the real bigbrother, but that one seems to be fairly content at the moment :-) [19:42:50] valhallasw`cloud: When someone says Kubernetes, is that equivalent to Amsterdam or Ashburn? [19:43:01] Like it is just the name of the hosting facility? [19:43:46] Oh, no, cluster is being used differently... http://kubernetes.io/ [19:52:09] Leah: no, it's roughly comparable to SGE [19:52:56] but then actively maintained, and using a more modern idea of how one wants to schedule tasks [19:54:08] Hmmm, all right. [19:54:19] Thanks for helping out with grrrit-wm. Hopefully we can get it fixed soon. [19:54:36] Leah: I'll try to get it running in a screen temporarily [19:54:50] that *should* still be possible [19:55:22] (I had not really considered that option) [19:55:45] screen? I think you mean s5n. :P [19:55:55] Err, s4n. [19:56:04] Bad joke is bad. [19:56:50] or just on SGE [19:58:17] 6Labs, 10Tool-Labs, 10grrrit-wm: Rogue grrrit-wm running... somewhere - https://phabricator.wikimedia.org/T123167#1923539 (10valhallasw) Now running on SGE again. [19:59:00] now let me get back to cleaning my inbox [20:22:08] 6Labs, 10Tool-Labs: exec hosts have apache2 running - https://phabricator.wikimedia.org/T105059#1923578 (10valhallasw) ``` valhallasw@tools-exec-1201:~$ apt-cache rdepends --installed apache2 --recurse ``` would give all local packages that depend on apache2 (recursively): ``` apache2 Reverse Depends: a... [21:16:37] Hello. Connections to my web tools are timing out recently, including simple views with minimal DB access. I see this error in the error.log: "(server.c.1444) [note] sockets disabled, connection limit reached". The connection limit seems to be configured elsewhere, so I can't change it. Any ideas on how to address this? [21:17:16] The affected tools are at https://tools.wmflabs.org/meta/ (though they're currently okay for a while since I restarted the web service). [21:19:13] Pathoschild: unfortunately, no. Lighttpd somehow doesn't close closed connections correctly. Someone wrote a workaround, though -- let me check if I can find the bug. [21:19:36] Pathoschild: https://phabricator.wikimedia.org/T104799 [21:21:51] Thanks. So the only solution for now is occasional restarts if the issue comes back? :| [21:22:31] Pathoschild: yes, unfortunately :/ [21:22:46] Pathoschild: basically, no-one has any idea what's going on [21:23:35] Alright, thanks. :) [21:23:45] 6Labs, 10Tool-Labs: lighttpd does not correctly close connections (CLOSE_WAIT) - https://phabricator.wikimedia.org/T104799#1923666 (10Pathoschild) This happened for tools.meta today. [21:36:54] Pathoschild: could it be there's a php script in the background that keeps running? [21:40:01] 6Labs, 10Tool-Labs: lighttpd does not correctly close connections (CLOSE_WAIT) - https://phabricator.wikimedia.org/T104799#1923667 (10valhallasw) The seven-year-old issue http://redmine.lighttpd.net/issues/1829 might actually be the same one. I think I ignored it last time (because it's 7 years old), but we ar... [21:45:08] valhallasw`cloud: there shouldn't be; everything in the 'meta' project just fetches data and renders a web view. [21:45:26] has it happened before? [21:45:43] (if it has, meta might be a good project to try some debugging steps) [22:08:12] 6Labs, 10Tool-Labs: Remove overly-large log files - https://phabricator.wikimedia.org/T122508#1923676 (10valhallasw) error.log files larger than 100MB: ``` 103236 /data/project/icommons/error.log 109640 /data/project/freefiles/error.log 120448 /data/project/checkwiki/error.log 121680 /data/project/magnust... [22:08:23] valhallasw`cloud: I've had similar symptoms a few times before, but it's the first time I've looked into it so I'm not sure if the cause was the same. [22:09:00] Pathoschild: ok, thanks! [22:21:39] 6Labs, 10Tool-Labs: lighttpd does not correctly close connections (CLOSE_WAIT) - https://phabricator.wikimedia.org/T104799#1923681 (10valhallasw) fwiw, this seems to be a fairly regular occurrence for some tools. I grepped all error.logs for "connection limit reached", and these tools seem to be regulars: ```...