[05:11:13] Any idea why I'd get HTTP 403 on a single *.js file through tools-static? This file returns 403 ( https://tools-static.wmflabs.org/meta/scripts/pathoschild.regexeditor.js ), but other files work fine. If I rename the file, it works fine under the new name but continues to 403 under the old name (even though it's not there anymore). [05:12:40] It has the same permissions as the other files. I recently `git pull`'d it, but don't see how they would break it. [08:40:34] 10Quarry: Every second attempt to use Quarry to do an SQL query fails - https://phabricator.wikimedia.org/T109014#1538688 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Oops, I had pooled in another runner host and forgot to do the manual provisioning of /etc/hosts (boo!). Fixed now. Thanks for reporting it! [13:29:25] anyone tell me how to find a submitted and running jsub when "job" shows nothing? [13:30:04] sDrewth: qstat [13:30:44] returns nothing [13:31:02] then there is no job running on the grid [13:31:27] the python job is running somewhere as the log file is still populating [13:31:45] which user? [13:31:46] touch.py to a big ns [13:31:51] wikisource-bot [13:31:55] and which logfile? [13:32:20] wikisource-bot/enWS_104purge [13:33:03] hrm. [13:33:35] unless the logging is very slow and retrospective to the task [13:35:18] yep, there's a process running on tools-exec-1214 [13:35:32] (found using for i in {01..15}; do ssh tools-exec-12$i ps -futools.wikisource-bot; done ) [13:35:54] would you be able to kill it? it started erroring after 100 or slo lines [13:36:02] yes [13:36:07] thx [13:36:10] sDrewth: killed [13:36:24] I'm not sure how it escaped SGE's grasp though [13:36:33] could this be due to the reboots? [13:36:48] it did the same yesterday when I was running these in series [13:37:49] gifti: maybe, but I'm not sure how. The grid master will be rebooted today and hasn't been offline [13:38:03] but there are some ways to get away under the master's grasp [13:38:10] which might have happened accidentally here [13:38:20] sDrewth: how did you start this task? [13:38:29] oh! [13:38:33] sDrewth: don't use -daemonize :-) [13:38:46] ah [13:39:02] not needed when using the grid? [13:39:23] actively harmful when using the grid [13:39:35] (and an SGE bug that SGE allows it) [13:39:40] k [13:40:13] sDrewth: basically, -daemonize causes the task to completely disappear from SGE's radar, even though it's still running [13:41:03] okay, then what is the easy means to get it to log with pywikibot? [13:41:22] so sexy that it did it all so easily [13:41:38] duh -log: [13:42:11] sDrewth: it should already log in /logs/scriptname.log [13:42:17] but for a seperate log, yes, -log: [13:43:14] 6Labs, 10Tool-Labs: SGE loses daemonized processes - https://phabricator.wikimedia.org/T109071#1539595 (10valhallasw) 3NEW [13:47:27] cannot find a good place on wikitech to where to say that [13:47:50] sDrewth: the 'don't use -daemonize'? [13:47:55] mm. maybe the pywikibot on tool labs page? [13:48:55] unfortunately the docs are somewhat hard to search :/ [13:49:08] https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Developing#Setup_pywikibot_on_Labs_.28locally.29 [13:49:22] under "setup job submission"? [13:52:05] done [14:10:44] 6Labs, 10Tool-Labs: SGE loses daemonized processes - https://phabricator.wikimedia.org/T109071#1539629 (10scfc) This reminds me of the `php-cgi` processes that `lighttpd` jobs leave behind, which I always assume to be related to our custom `jobkill` script (even though that doesn't seem to be configured for th... [14:14:58] 6Labs, 10Tool-Labs: SGE loses daemonized processes - https://phabricator.wikimedia.org/T109071#1539646 (10valhallasw) I'll try to do that. The issue was originally reported on IRC by sDrewth, so I'll have to fiddle around to reproduce it as well. [14:16:34] valhallasw`cloud: do you want the complete command line that I was using? [14:16:46] sDrewth: that might be helpful, yes [14:16:55] sDrewth: luckily touch.py is basically a no-op command already [14:17:09] jsub python /shared/pywikipedia/core/scripts/touch.py -lang:cy -family:wikisource -namespace:104 -start:\! -daemonze:cyWS_104purge [14:17:25] sDrewth: thank [14:17:27] sure, which was why I wasn't particularly worried, and why I felt competent to do [14:17:39] ack [14:17:58] spelling of daemonize which was correct [14:18:10] jsub python /shared/pywikipedia/core/scripts/touch.py -lang:cy -family:wikisource -namespace:104 -start:\! -daemonize:cyWS_104purge [14:30:26] 6Labs, 10Tool-Labs: SGE loses daemonized processes - https://phabricator.wikimedia.org/T109071#1539666 (10valhallasw) The following reproduces it. wget https://raw.githubusercontent.com/wikimedia/pywikibot-core/0592360e6b58456d25c082e91fabe91590f0b85d/pywikibot/daemonize.py test.py: ``` #!/usr/bin/env python... [14:43:23] !log tools killing remaining jobs on tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407 [14:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, dummy [15:04:42] 6Labs, 10Tool-Labs: Check for error log ownership before starting webservice job - https://phabricator.wikimedia.org/T99576#1539738 (10scfc) a:3scfc [15:08:10] hmm, beta labs is giving me: (Cannot access the database: Can't connect to MySQL server on '10.68.17.94' (4) (10.68.17.94)) [15:20:16] !log tools Adding back to the grid engine queue: tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407 [15:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, dummy [15:27:51] Coren: unless you’re dying to do reboots, I’m going to reschedule a few for early next week. [15:29:59] http://botbot.wmflabs.org has been giving a 502 for a while. i rarely use it but it's nice to have [15:37:39] ok, that’s it, today’s reboot is concluded. [15:38:39] niedzielski: please contact the maintainers of the 'bots' project (mostly petan|lyon, I think). [15:39:24] valhallasw`cloud: thanks! is this the appropriate channel for that discussion? [15:40:16] niedzielski: in principle, yes, but in practice, this channel is mostly tool labs-specific discussion plus labs-wide admin things [15:40:53] 6Labs: Labs proxy 502 should indicate where to ask for support - https://phabricator.wikimedia.org/T109078#1539867 (10valhallasw) 3NEW [15:41:24] niedzielski: this is the appropriate channel if the people you need to talk to are here :) [15:41:40] valhallasw`cloud andrewbogott: cool, thanks! [15:44:17] petan|lyon: hey! i noticed that botbot.wmflabs.org has been giving 502 for a while. i think the past couple days or maybe even a week-ish? i don't regularly use the tool but it is nice to have. if it's possible to fix the issue as a very low priority task, i'd appreciate it [15:47:15] 6Labs, 10Tool-Labs: webservice2 not starting - https://phabricator.wikimedia.org/T87641#1539888 (10Ricordisamoa) [15:47:26] 6Labs, 10Tool-Labs: wikidata-todo webservice not working - https://phabricator.wikimedia.org/T74532#1539890 (10Ricordisamoa) [15:57:43] James_F: the bot hit the rate limiter despite it not being a real save [15:58:08] sDrewth: No, I mean, give it +bot on the wikis individually. [15:58:28] I cheated and ran it from my steward account [15:58:31] sDrewth: This is why letting wikis opt-out of global governance is a stupid idea. [15:58:35] sDrewth: Hmm. :-) [15:58:41] it was a null eit so not hitting RC [15:58:58] Indeed. [15:59:07] James_F: I think that we just need to have a WS bot group [15:59:22] sDrewth: Maybe. [15:59:47] well, that will allow the conversation to take place for rights in Phabricator, to be assigned [16:00:34] what is ws? [16:00:53] wikisource [16:01:01] oh [16:10:22] sDrewth: it /is/ a real save [16:10:33] but a null edit [16:10:51] (it doesn't use purge, because no-one cared to rewrite the bot once purge was introduced) [16:14:50] purge didn't work anyway [16:15:15] and many of these null edits are showing as real edits, shit [16:15:34] huh [17:12:44] 6Labs: Labs proxy 502 should indicate where to ask for support - https://phabricator.wikimedia.org/T109078#1540287 (10scfc) As the whole proxy setup is very complex: You mean the situation where the target instance of a "web proxy" (https://wikitech.wikimedia.org/wiki/Special:NovaProxy), i. e. from the proxy's p... [17:21:18] !log disabling grid jobqueue for tools-exec-1211 tools-exec-1212 tools-exec-1215 tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01 in anticipation of monday reboot of labvirt1004 [17:21:18] disabling is not a valid project. [17:21:27] !log tools disabling grid jobqueue for tools-exec-1211 tools-exec-1212 tools-exec-1215 tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01 in anticipation of monday reboot of labvirt1004 [17:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, dummy [17:46:48] 6Labs: Labs proxy 502 should indicate where to ask for support - https://phabricator.wikimedia.org/T109078#1540476 (10valhallasw) Yes, exactly. Adding a clearer error message for errors on the backend server sounds like it's going to cause more issues than it solves. [19:05:25] hello, I've got my Ruby tool ready to go. I've followed the instructions at https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Web#Other_web_servers but am still unable to browse to the tool [19:06:22] I ran `jstart -q webgrid-generic ./httpserver.sh` and with `qstat` I see a running job 'httpserver' [19:06:44] however `ps -ef | grep unicorn` does not return anything, telling me unicorn did not start [19:07:54] you have to run that on the actual exec host [19:08:22] which tool is this? [19:08:36] musikbot? [19:08:43] musikanimal [19:09:00] musikbot is just a cronjob, no frontend [19:10:11] in httpserver.sh I made sure to export the PATH so that it knows where the Ruby gems are, just like I did with the sh file used to boot up MusikBot [19:10:31] the job seems to stay in the queued state [19:11:22] maybe this needs to run on trusty and it's not? [19:12:51] could very well be. Also check the memory requirements. [19:13:00] is there a way to tell jstart to use trusty? [19:13:58] I tried `jstart -q webgrid-generic ./httpserver.sh release=trusty` and didn't get an error [19:14:01] -l release=trusty [19:14:02] I think [19:15:05] yeah that's what I have in the crontab for MusikBot [19:17:13] I have to go again [19:18:05] ok thank you! [19:29:39] MusikAnimal: the portgrabber thing can be tricky. I think environment variables (export PATH=… etc) will be lost by "exec protgrabber". You may need to write a second script which portgrabber executes. But there should be errors in ~/httpserver.err which say what happend [19:30:03] no errors were logged [19:30:41] here's my httpserver.sh: http://pastebin.com/7kNfwqi6 [19:31:22] I actually get "Four hundred and four!" at https://tools.wmflabs.org/musikanimal/ so the httpserver is running, I guess [19:31:26] but Unicorn is not [19:32:48] I could make a shell script that portgrabber could run, but not sure how to make it accept the port that it would pass into the shell script [19:33:02] according to the docs, it appends the port at the end [19:34:04] I'm not too great with bash, but sounds like maybe I need a "function" that accepts a param, portgrabber would run that function [19:34:50] and the function would export the PATH, set the local ruby, etc, then do `unicorn -p $1 unicorn.rb` where $1 would hopefully be the port portgrabber passed in [19:35:33] you need $2 [19:35:50] or, wait a second [19:36:07] yeah it's $1, misread my script [19:36:54] you could also use the undocumented and "unsupported" webservice-new from https://phabricator.wikimedia.org/T97230 [19:37:27] it doesn't restart the job if it crashes, but else it works fine for me and gets rid of the portgrabber thing [19:38:03] webservice-new generic start somescript.sh [19:39:03] somescript.sh will get the $PORT variable, so unicorn -p $PORT unicorn.rb should work with werbservice-new (without all the portgrabber exec) [19:41:58] hmm interesting [19:42:43] though I'd prefer to go with the documented route if possible. I guess I could make a cronjob to check the HTTP status of the site and restart it with webservice-new as needed [19:42:59] since it doesn't automatically restart, as you say [20:07:37] 6Labs, 10Tool-Labs: SGE loses daemonized processes - https://phabricator.wikimedia.org/T109071#1541000 (10coren) 5Open>3Invalid a:3coren That's... not a bug. This program daemonizes, and the resulting stray process is the undesirable, but perfectly expected result. The grid expect that the job is compl... [20:22:27] Coren: 'people accidentally having jobs running on the grid with no clear way to stop them' sounds like a system issue to me [20:22:58] Coren: why is it letting jobs daemonize in the first place? or at least killing everything that remains when it considers the job 'done' [20:23:29] That's not how things work. You can't prevent daemonization without preventing legitimate forks() [20:25:04] In that case, I'd say SGE should kill remaining processes when the invoked process exits [20:26:51] but whatever the solution, the current result 'unsupervised jobs running on an exec node' is undesirable [20:30:42] Did Wikitech/labsconsole run LQT at some point? [20:31:30] valhallasw`cloud: Hm. That's reasonable. But it's not possible to locate stray tasks at exit from the shepherd. [20:31:48] Because parentage is lost on exit() of the parent. [20:32:12] Best we can do is look for stray processes by UID and hunt-and-seek suspect ones. [20:45:04] Maybe wrap the called script with another script that does the killing? [20:45:34] Oh, but the parent is still gone then [20:45:35] Hm [20:59:30] I think it's easier to educate and inform the users than it is to try to fix what is, ultimately, a relatively infrequent issue. [22:01:13] 6Labs, 10Tool-Labs, 5Patch-For-Review: Install flake8 on Tool-Labs - https://phabricator.wikimedia.org/T90447#1541446 (10valhallasw) 5Open>3Resolved a:3valhallasw [22:01:15] 6Labs, 10Tool-Labs, 7Tracking: Packages to be added to toollabs puppet - https://phabricator.wikimedia.org/T55704#1541448 (10valhallasw) [23:27:57] hello? [23:42:32] 6Labs, 3Labs-Sprint-105, 3Labs-Sprint-108, 3Labs-Sprint-109, 5Patch-For-Review: Archive NFS data for projects that no longer have NFS - https://phabricator.wikimedia.org/T104857#1541862 (10Andrew) 5Open>3Resolved The archive-project-volumes script works again, and archives are now in /srv/others/orph... [23:53:17] 6Labs, 10Labs-Infrastructure: Rename (remove?) shell user "80686" - https://phabricator.wikimedia.org/T63967#1541882 (10scfc) (`toolwatcher` has moved to `tools-submit`, so the warning now appears there.) So the conclusion is to remove the user: ``` scfc@tools-bastion-01:~$ ldaplist -l passwd 80686 dn: uid=... [23:54:10] 6Labs, 10Labs-Infrastructure: Remove shell user "80686" - https://phabricator.wikimedia.org/T63967#1541887 (10scfc)