[00:29:32] 10Tool-Labs, 3Labs-Q4-Sprint-3, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Make proxylistener not need to keep open socket connections open - https://phabricator.wikimedia.org/T96059#1208507 (10yuvipanda) Hmm, maybe tomorrow I'll just merge this and fiddle around until it all works. [00:30:00] 10Tool-Labs, 3Labs-Q4-Sprint-3, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Make proxylistener not need to keep open socket connections open - https://phabricator.wikimedia.org/T96059#1208508 (10yuvipanda) (toolsbeta doesn't seem fully set up yet, another thing we got to do at some point) [03:41:32] 6Labs, 10Tool-Labs, 7Tracking: Make toollabs reliable enough (Tracking) - https://phabricator.wikimedia.org/T90534#1208649 (10MZMcBride) Thank you for filing this task, Yuvi. [03:51:04] 10Tool-Labs: Configure web services in such a way that users don't have to (re)start it ever - https://phabricator.wikimedia.org/T94883#1208660 (10MZMcBride) Thanks for filing this task, Maarten. [04:13:06] YuviPanda: so...I was converting the `checker` tool to use uwsgi-python, and now it somehow has two webservices running [07:26:44] PROBLEM - Puppet failure on tools-webgrid-07 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [07:26:48] PROBLEM - Puppet failure on tools-exec-15 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [07:26:52] PROBLEM - Puppet failure on tools-exec-04 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [07:31:07] 10Tool-Labs: Fix puppet cron - https://phabricator.wikimedia.org/T96122#1208818 (10valhallasw) [07:37:41] PROBLEM - Puppet failure on tools-exec-23 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [0.0] [07:37:57] PROBLEM - Puppet failure on tools-exec-catscan is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [07:37:58] PROBLEM - Puppet failure on tools-exec-08 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [0.0] [07:38:09] PROBLEM - Puppet failure on tools-submit is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [07:38:29] PROBLEM - Puppet failure on tools-redis-slave is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [07:38:35] PROBLEM - Puppet failure on tools-webgrid-06 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [0.0] [07:43:08] RECOVERY - Puppet failure on tools-submit is OK: OK: Less than 1.00% above the threshold [0.0] [07:46:43] RECOVERY - Puppet failure on tools-webgrid-07 is OK: OK: Less than 1.00% above the threshold [0.0] [07:46:47] RECOVERY - Puppet failure on tools-exec-15 is OK: OK: Less than 1.00% above the threshold [0.0] [07:46:51] RECOVERY - Puppet failure on tools-exec-04 is OK: OK: Less than 1.00% above the threshold [0.0] [07:47:42] RECOVERY - Puppet failure on tools-exec-23 is OK: OK: Less than 1.00% above the threshold [0.0] [07:48:00] RECOVERY - Puppet failure on tools-exec-catscan is OK: OK: Less than 1.00% above the threshold [0.0] [07:48:01] RECOVERY - Puppet failure on tools-exec-08 is OK: OK: Less than 1.00% above the threshold [0.0] [07:48:24] RECOVERY - Puppet failure on tools-redis-slave is OK: OK: Less than 1.00% above the threshold [0.0] [07:48:35] RECOVERY - Puppet failure on tools-webgrid-06 is OK: OK: Less than 1.00% above the threshold [0.0] [09:14:55] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Lodaviz was created, changed by Lodaviz link https://wikitech.wikimedia.org/wiki/Nova+Resource%3aTools%2fAccess+Request%2fLodaviz edit summary: Created page with "{{Tools Access Request |Justification=I intend to play with pywikibot. |Completed=false |User Name=Lodaviz }}" [09:27:10] PROBLEM - Puppet failure on tools-submit is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [10:17:11] RECOVERY - Puppet failure on tools-submit is OK: OK: Less than 1.00% above the threshold [0.0] [11:37:39] PROBLEM - Puppet failure on tools-exec-23 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:37:45] PROBLEM - Puppet failure on tools-exec-10 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:37:53] PROBLEM - Puppet failure on tools-exec-09 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:38:09] PROBLEM - Puppet failure on tools-exec-05 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:38:13] PROBLEM - Puppet failure on tools-exec-03 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:38:21] PROBLEM - Puppet failure on tools-webproxy-02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:38:23] PROBLEM - Puppet failure on tools-redis is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:38:24] PROBLEM - Puppet failure on tools-exec-20 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:38:25] PROBLEM - Puppet failure on tools-webgrid-08 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:38:31] PROBLEM - Puppet failure on tools-exec-24 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:38:32] PROBLEM - Puppet failure on tools-exec-22 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:38:40] PROBLEM - Puppet failure on tools-exec-gift is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:38:42] PROBLEM - Puppet failure on tools-login is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:38:51] PROBLEM - Puppet failure on tools-exec-15 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:38:53] PROBLEM - Puppet failure on tools-exec-04 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:38:57] PROBLEM - Puppet failure on tools-webgrid-generic-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:38:57] PROBLEM - Puppet failure on tools-webgrid-02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:38:57] PROBLEM - Puppet failure on tools-exec-wmt is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:39:33] PROBLEM - Puppet failure on tools-exec-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:39:37] PROBLEM - Puppet failure on tools-exec-06 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:39:37] PROBLEM - Puppet failure on tools-master is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:39:41] PROBLEM - Puppet failure on tools-dev is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:43:27] PROBLEM - Puppet failure on tools-redis-slave is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:43:33] PROBLEM - Puppet failure on tools-webgrid-06 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:44:11] PROBLEM - Puppet failure on tools-submit is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [11:48:22] RECOVERY - Puppet failure on tools-redis is OK: OK: Less than 1.00% above the threshold [0.0] [11:54:42] PROBLEM - Puppet failure on tools-webgrid-07 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [11:57:52] RECOVERY - Puppet failure on tools-exec-09 is OK: OK: Less than 1.00% above the threshold [0.0] [11:58:10] RECOVERY - Puppet failure on tools-exec-05 is OK: OK: Less than 1.00% above the threshold [0.0] [12:02:52] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Abshirdheere was modified, changed by Abshirdheere link https://wikitech.wikimedia.org/w/index.php?diff=153929 edit summary: [12:08:41] RECOVERY - Puppet failure on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0] [12:08:41] RECOVERY - Puppet failure on tools-login is OK: OK: Less than 1.00% above the threshold [0.0] [12:08:47] 10Tool-Labs: Unattended upgrades are failing from time to time - https://phabricator.wikimedia.org/T92491#1209035 (10scfc) ``` From: root@tools.wmflabs.org (Cron Daemon) Subject: Cron test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) To: root@tools.wmflabs.org Date:... [12:08:49] RECOVERY - Puppet failure on tools-exec-15 is OK: OK: Less than 1.00% above the threshold [0.0] [12:08:51] RECOVERY - Puppet failure on tools-exec-04 is OK: OK: Less than 1.00% above the threshold [0.0] [12:08:55] RECOVERY - Puppet failure on tools-webgrid-generic-01 is OK: OK: Less than 1.00% above the threshold [0.0] [12:08:56] RECOVERY - Puppet failure on tools-webgrid-02 is OK: OK: Less than 1.00% above the threshold [0.0] [12:09:09] RECOVERY - Puppet failure on tools-submit is OK: OK: Less than 1.00% above the threshold [0.0] [12:09:33] RECOVERY - Puppet failure on tools-exec-01 is OK: OK: Less than 1.00% above the threshold [0.0] [12:09:40] RECOVERY - Puppet failure on tools-dev is OK: OK: Less than 1.00% above the threshold [0.0] [12:12:46] RECOVERY - Puppet failure on tools-exec-10 is OK: OK: Less than 1.00% above the threshold [0.0] [12:13:20] RECOVERY - Puppet failure on tools-exec-03 is OK: OK: Less than 1.00% above the threshold [0.0] [12:13:21] RECOVERY - Puppet failure on tools-webproxy-02 is OK: OK: Less than 1.00% above the threshold [0.0] [12:13:21] RECOVERY - Puppet failure on tools-exec-20 is OK: OK: Less than 1.00% above the threshold [0.0] [12:13:28] RECOVERY - Puppet failure on tools-webgrid-08 is OK: OK: Less than 1.00% above the threshold [0.0] [12:13:28] RECOVERY - Puppet failure on tools-redis-slave is OK: OK: Less than 1.00% above the threshold [0.0] [12:13:30] RECOVERY - Puppet failure on tools-exec-24 is OK: OK: Less than 1.00% above the threshold [0.0] [12:13:34] RECOVERY - Puppet failure on tools-exec-22 is OK: OK: Less than 1.00% above the threshold [0.0] [12:14:36] RECOVERY - Puppet failure on tools-exec-06 is OK: OK: Less than 1.00% above the threshold [0.0] [12:14:44] RECOVERY - Puppet failure on tools-webgrid-07 is OK: OK: Less than 1.00% above the threshold [0.0] [12:18:06] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Lodaviz was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=153933 edit summary: [12:18:55] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Abshirdheere was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=153936 edit summary: [12:24:49] PROBLEM - Puppet failure on tools-exec-04 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [0.0] [12:25:07] PROBLEM - Puppet failure on tools-submit is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [12:25:28] PROBLEM - Puppet failure on tools-shadow is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [12:25:34] PROBLEM - Puppet failure on tools-exec-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [12:25:42] PROBLEM - Puppet failure on tools-webgrid-07 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [12:25:52] PROBLEM - Puppet failure on tools-webgrid-03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [12:25:58] PROBLEM - Puppet failure on tools-exec-catscan is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [12:26:24] PROBLEM - Puppet failure on tools-exec-cyberbot is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [12:27:40] RECOVERY - Puppet failure on tools-exec-23 is OK: OK: Less than 1.00% above the threshold [0.0] [12:27:51] 10Tool-Labs: Fix puppet cron - https://phabricator.wikimedia.org/T96122#1209043 (10scfc) Looking randomly at `/var/log/syslog` on `tools-bastion-01` says inter alia: ``` Apr 15 08:45:54 tools-bastion-01 puppet-agent[11346]: Could not send report: Timeout::Error ``` ``` Apr 15 11:15:18 tools-bastion-01 puppet-a... [12:27:57] PROBLEM - Puppet failure on tools-services-01 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [0.0] [12:28:47] PROBLEM - Puppet failure on tools-services-02 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [0.0] [12:28:47] PROBLEM - Puppet failure on tools-exec-10 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [0.0] [12:29:07] PROBLEM - Puppet failure on tools-exec-05 is CRITICAL: CRITICAL: 87.50% of data above the critical threshold [0.0] [12:29:21] PROBLEM - Puppet failure on tools-webproxy-02 is CRITICAL: CRITICAL: 87.50% of data above the critical threshold [0.0] [12:29:21] PROBLEM - Puppet failure on tools-exec-20 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [0.0] [12:33:34] RECOVERY - Puppet failure on tools-webgrid-06 is OK: OK: Less than 1.00% above the threshold [0.0] [12:33:58] RECOVERY - Puppet failure on tools-exec-wmt is OK: OK: Less than 1.00% above the threshold [0.0] [12:34:38] RECOVERY - Puppet failure on tools-master is OK: OK: Less than 1.00% above the threshold [0.0] [12:39:23] 6Labs, 7Puppet: Puppet logs should be timestamped in a human-readable way - https://phabricator.wikimedia.org/T88108#1209056 (10scfc) [12:41:02] PROBLEM - Puppet failure on tools-exec-catscan is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [0.0] [12:47:04] Wikimedia Labs | Status: Tools down | https://www.mediawiki.org/wiki/Wikimedia_Labs | Channel logs: https://bit.ly/11GZvbS | Open bugs: http://bit.ly/1l2wFhO | Admin log: http://bit.ly/ROfuY5. [12:47:44] Clarify since accounts.wmflabs is still up. [12:47:55] RECOVERY - Puppet failure on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [12:48:32] 10Tool-Labs: Fix puppet cron - https://phabricator.wikimedia.org/T96122#1209077 (10valhallasw) There are three things, I think: 1) LDAP is broken, 2) the cron emails are not very informative, and 3) we get 300-ish emails about the same issue 2) should be easy to solve (tail the log file on error and crown will... [12:49:19] RECOVERY - Puppet failure on tools-webproxy-02 is OK: OK: Less than 1.00% above the threshold [0.0] [12:49:51] RECOVERY - Puppet failure on tools-exec-04 is OK: OK: Less than 1.00% above the threshold [0.0] [12:50:07] RECOVERY - Puppet failure on tools-submit is OK: OK: Less than 1.00% above the threshold [0.0] [12:50:30] RECOVERY - Puppet failure on tools-shadow is OK: OK: Less than 1.00% above the threshold [0.0] [12:50:34] RECOVERY - Puppet failure on tools-exec-01 is OK: OK: Less than 1.00% above the threshold [0.0] [12:50:44] RECOVERY - Puppet failure on tools-webgrid-07 is OK: OK: Less than 1.00% above the threshold [0.0] [12:50:54] RECOVERY - Puppet failure on tools-webgrid-03 is OK: OK: Less than 1.00% above the threshold [0.0] [12:51:04] RECOVERY - Puppet failure on tools-exec-catscan is OK: OK: Less than 1.00% above the threshold [0.0] [12:51:28] RECOVERY - Puppet failure on tools-exec-cyberbot is OK: OK: Less than 1.00% above the threshold [0.0] [12:53:44] RECOVERY - Puppet failure on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [12:53:44] RECOVERY - Puppet failure on tools-exec-10 is OK: OK: Less than 1.00% above the threshold [0.0] [12:54:10] RECOVERY - Puppet failure on tools-exec-05 is OK: OK: Less than 1.00% above the threshold [0.0] [12:54:26] RECOVERY - Puppet failure on tools-exec-20 is OK: OK: Less than 1.00% above the threshold [0.0] [12:56:35] Good job getting it back up quickly. [13:06:11] Hm, I wonder if thing might be improved for puppet if we change its invokation to wait a bit and try again in the case of early failure. [13:10:17] PROBLEM - Puppet failure on tools-bastion-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [13:20:15] RECOVERY - Puppet failure on tools-bastion-01 is OK: OK: Less than 1.00% above the threshold [0.0] [13:29:30] 10Tool-Labs: URGENT: http://wdq.wmflabs.org/api not accessible from Tools Labs - https://phabricator.wikimedia.org/T96136#1209153 (10Magnus) 3NEW [13:31:33] 10Tool-Labs: URGENT: http://wdq.wmflabs.org/api not accessible from Tools Labs - https://phabricator.wikimedia.org/T96136#1209160 (10Magnus) 5Open>3Resolved a:3Magnus Seems to have fixed itself... [13:33:06] 10Tool-Labs: URGENT: http://wdq.wmflabs.org/api not accessible from Tools Labs - https://phabricator.wikimedia.org/T96136#1209180 (10Andrew) There was a brief DNS outage -- that's probably what you saw. It's resolved now and a patch is in the works to fix the ultimate cause. Sorry for the interruption! https:... [13:49:53] andrewbogott: Anything I should know about that dns burp? (I.e. quick fixes if it recurs, etc) [13:50:20] Coren: it’s the same issue as always — you have to restart pdns after you restart opendj [13:50:56] alex was debugging the periodic keystone failure on virt1000 and restarted ldap as part of the troubleshooting [13:52:17] that bug is like a bear trap set in the middle of our infrastructure :( [14:04:37] 6Labs: labvirt1001 network config problem - https://phabricator.wikimedia.org/T96097#1209218 (10Andrew) My test vm in labvirt1001 is now working properly. Thank you! [15:18:00] 6Labs, 6operations: One instance hammering on NFS should not make it unavailable to everyone else - https://phabricator.wikimedia.org/T95766#1209414 (10coren) NFS indeed does not allow us to know which enduser is responsible for any specific traffic, as an unavoidable consequence of the levels of abstraction t... [15:52:14] 6Labs, 3Labs-Q4-Sprint-2, 3Labs-Q4-Sprint-3, 3ToolLabs-Goals-Q4: Do a rolling restart of Tool Labs precise instances - https://phabricator.wikimedia.org/T95557#1209485 (10coren) [15:54:54] 6Labs, 3Labs-Q4-Sprint-2, 3Labs-Q4-Sprint-3, 3ToolLabs-Goals-Q4: Do a rolling restart of Tool Labs precise instances - https://phabricator.wikimedia.org/T95557#1209488 (10coren) The plan here is to evacuate two nodes, reboot them, and repool them. Once that is done, new jobs will be disabled on remaining... [16:15:18] 6Labs, 6operations, 10ops-eqiad: labvirt100x boxes 'no carrier' on eth1 - https://phabricator.wikimedia.org/T95973#1209547 (10Cmjohnson) 5Open>3Resolved This should be resolved now thanks to Faidons fix. [16:31:45] YuviPanda, andrewbogott_afk Hi [16:32:07] andrewbogott_afk: Just got ceph integration with vcenter. [16:32:18] working... [16:44:28] 10Tool-Labs-tools-Other: Fix tool kmlexport - https://phabricator.wikimedia.org/T92963#1209657 (10Teslaton) 8+ hr downtime right now... Isn't the monitoring triggered autorestart possible workaround for these lengthy ill states? It should be quite straightforward to set up. [17:14:15] 10Tool-Labs-tools-Other: Fix tool kmlexport - https://phabricator.wikimedia.org/T92963#1209739 (10yuvipanda) Interesting, the webservice itself was up but for some reason the proxy thought it wasn't... Our monitoring / restarting thing looks for the job being up rather than the http endpoint being hit, so that w... [17:22:40] 10Tool-Labs-tools-Other: Fix tool kmlexport - https://phabricator.wikimedia.org/T92963#1209774 (10yuvipanda) (i've restarted it for now) [17:52:29] hey Coren! around? [18:15:56] YuviPanda: I am. [18:16:14] YuviPanda: Sorry I didn't see your ping earlier; was doing the vpt rounds. [18:16:21] Coren: so I’m going to start pushing https://gerrit.wikimedia.org/r/#/c/204193/2 through, might have to make a few additional patches. that ok? [18:16:50] shouldn’t cause any proxy issues but new webservices starts might not ‘stick’ for a minute or two, but we can restart them all a lot more easily now if needed [18:17:22] YuviPanda: My only "real" concern is what happens if proxyreleaser fails to run properly for any reason (like dead DNS, etc) [18:17:38] YuviPanda: So we should also have a reaping thing at regular interval to reclaim. [18:17:55] hmm [18:18:20] so have a thing that 1. gets list of active proxies, 2. looks at current local list, 3. reaps any that don’t match? [18:18:50] Sounds about right. Doesn't need to be frequent, but at least once every so often just to make sure we don't "leak" ports. [18:19:02] yah, fair enough [18:19:13] but I think this can proceed as is, and then I’ll write that? [18:19:48] YuviPanda: I'm about to start doing https://phabricator.wikimedia.org/T95557 Shouldn't be much visible impact but it does mean that some of the exec nodes will be depooled so queues might grow a bit. [18:20:08] Coren: oh, are you going to do that right now? [18:20:44] YuviPanda: Sure, doesn't have to be right now if you are reasonably satisfied the epilog script is fairly reliable. [18:21:07] YuviPanda: That's the plan, though I'm not touching the webproxies at first so as to not interfere with your port stuff. [18:21:07] it triggered for both stops and OOMs yeah [18:21:17] Coren: the webproxies are all trusty :D [18:21:25] s/webproxies/web nodes/ [18:21:28] ah right [18:21:29] cool [18:21:59] I tried doing the unmount-remount via salt. A grand total of 2 instances had it work. :-) [18:22:02] Coren: hmm, this might still put pressure on the master if we do both of these at the same time… [18:22:38] YuviPanda: I very much doubt the master will suffer for having to reschedule a couple jobs, but I can wait until you're done to start if you prefer. [18:22:51] Coren: ah, if you think it’ll hold up fine I’ll just go ahead [18:23:24] Coren: do !log so I know what’s happening :) [18:23:30] * Coren nods. [18:23:51] will the jobs on the giftbot queue survive? [18:24:16] gifti: I wasn't planning on doing your node at all and leaving you to it. [18:24:25] gifti: Since I can't migrate your jobs to another node. [18:24:28] oh, ok [18:24:55] then i will do it on the 19th (when it is clear) [18:25:28] gifti: Sounds okay to me. [18:25:46] am i even able to do that? [18:25:50] !log tools disabled -exec-01 and -exec-02 to new jobs. [18:25:57] Logged the message, Master [18:26:03] gifti: No, but whenever you are ready you just poke one of us. [18:26:10] ok, great [18:31:12] PROBLEM - Puppet failure on tools-webgrid-05 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [18:31:26] * YuviPanda looks [18:32:16] ah that’s allright [18:32:17] just a race [18:34:44] Hm. There are nonrestartable jobs on the hosts that have been running for a _long_ time. I may have no choice but to qdel them. :-( [18:35:05] * Coren wishes continuous jobs were always properly put on the continuous queue. [18:35:12] shouldn’t all jobs be restartable? [18:35:40] On the continuous queue, yes - but the tasks queue is meant for things that may not be idempotent that have to run once. [18:35:51] So they are not restartable. [18:35:58] PROBLEM - Puppet failure on tools-webgrid-02 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [0.0] [18:36:13] (Think: some bot that is doing template substitution; having it run twice may be disastrous) [18:36:56] Most task jobs are short-lived, and many have already drained from the nodes as expected. [18:37:52] But there are tasks that have been there for some time. Hm. Some of them are probably wedged anyways - I doubt a non-continuous task is really running since 07/01/2014 18:10:20 [18:38:30] Some continuous bots are genuinely happily running for a year (with some restarts) though. [18:39:42] !log tools disabled puppet on running webproxy, tools-webproxy-01 [18:39:46] Logged the message, Master [18:40:17] !log tools tools-exec-01 drained of jobs; rebooting [18:40:21] Logged the message, Master [18:43:01] !log tools tools-exec-01 back sans idmap, returning to pool [18:43:06] Logged the message, Master [18:45:49] YuviPanda: BTW, something I never though of checking, your manifest knows to not try to restart jobs that are already queued, right? [18:46:13] RECOVERY - Puppet failure on tools-webgrid-05 is OK: OK: Less than 1.00% above the threshold [0.0] [18:47:00] Coren: for webservices? interesting - it attempts to start them if they aren’t ‘running’, by calling webservice start with appropriate command [18:47:45] YuviPanda: It probably should avoid trying to start jobs that are being queued already - otherwise jobs that are rescheduled to move by the master might end up being doubly started. [18:48:09] Coren: but webservice itself should take care of that maybe? [18:48:17] I’ll take a look after doing this migration [18:48:19] good catch [18:49:02] Well, webservice might refuse to start the job again if it sees it queued but then the manifest will get this as 'webservice failed' rather than 'don't bother starting it, it's on its way up already' [18:49:34] not sure about that. webservice also looks for state running :P [18:49:59] !log tools dequeuing tools-exec-03 whilst waiting for -02 to drain. [18:50:06] Logged the message, Master [19:00:57] RECOVERY - Puppet failure on tools-webgrid-02 is OK: OK: Less than 1.00% above the threshold [0.0] [19:17:13] PROBLEM - Puppet failure on tools-webgrid-05 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [19:20:11] PROBLEM - Puppet failure on tools-webgrid-01 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [0.0] [19:26:44] !log tools -exec-03 drained, rebooting [19:26:49] Logged the message, Master [19:27:10] RECOVERY - Puppet failure on tools-webgrid-05 is OK: OK: Less than 1.00% above the threshold [0.0] [19:28:38] !log tools -exec-03 rebooted, requeing [19:28:42] Logged the message, Master [19:29:33] !log tools -exec-02 drained, rebooting [19:29:38] Logged the message, Master [19:30:08] RECOVERY - Puppet failure on tools-webgrid-01 is OK: OK: Less than 1.00% above the threshold [0.0] [19:30:50] !log tools draining -exec-04 [19:30:55] Logged the message, Master [19:32:13] !log tools repool -exec-02 [19:32:17] Logged the message, Master [19:33:38] hi, I'm trying to send mail to myself using eranroz at tools.wmflabs.org and I don't get mail. I can send myself mail using wikitech. What can be the problem? [19:34:29] There are a number of possible issues; could you open a phabricator ticket for it please? [19:35:25] Yes, though I'm actually more intereted to know that if I send some other tool owner mail he get it :) [19:40:29] eranroz: I'll be glad to look at the logs to figure out what happened in a little while, I'm in the middle of something atm. [19:40:32] 10Tool-Labs: Email forwarding problem in wmflabs - https://phabricator.wikimedia.org/T96184#1210225 (10eranroz) 3NEW a:3coren [19:40:57] ok np. It isn't urgent. thank you :) [19:41:51] !log tools disabling new jobs on remaining (exec) precise instances [19:41:56] Logged the message, Master [19:52:54] !log tools -exec-04 drained, rebooting. [19:52:58] Logged the message, Master [19:56:04] !log tools -exec-04 repooled [19:56:08] Logged the message, Master [20:00:21] Coren: Are there logs of people who restarted a tool? [20:00:42] sjoerddebruin: which tool? [20:01:08] sjoerddebruin: Not really, as a rule, though normally only labs admins and the tool owner can do so. [20:01:29] T13|away: We (nlwiki) have a RC-bot, JelteBot. But the owner is not active, but the bot sometimes disappears and comes back after a few weeks. [20:01:52] So, I want to know if the owner (Jelte) restarted it or a labs admin did it [20:02:05] Possibly neither [20:02:06] sjoerddebruin: Lemme see if the owner has logged in at all. [20:02:56] It's possible the tool went down and came back up by a full labs restart as a byproduct [20:03:06] Yeah, that kind of stuff. [20:03:18] sjoerddebruin: I can tell you the maintainer hasn't been around since Apr 1 at least. Did the tool restart recently? [20:04:00] It disappeared on 16 Feb and returned on 31 March. [20:05:01] Also gone on 17 Jan, 28 jan back. [20:05:32] Speaking of abandoned tools... Coren is there someone in the office that can access the consensus for that RfC? [20:05:38] I see no activity for jeltebot's maintainer since 2014-07-03 [20:05:38] The problem is that the owner doesn't responds on talk page messages or e-mails. [20:05:58] Coren, T13|away: can I copy your messages to a other IRC-channel? [20:06:23] T13|away: Maybe poke James Alexander? I think he's the best one to do so. [20:06:27] sjoerddebruin: Sure. [20:07:21] sjoerddebruin: As far as I can tell, the bot becomes unresponsive but because it has a start stanza in cron it will restart automatically whenever there is a labs outage. [20:07:39] sjoerddebruin: So there is a bug in the bot that is "accidentally" fixed when there is a problem. :-) [20:07:56] Hence the byproduct restart I mentioned [20:10:34] !log tools -exec-05 drained, rebooting. [20:10:40] Logged the message, Master [20:10:46] Coren: hmm, so [20:10:50] stracing shows me [20:10:50] that [20:10:54] my client is hung on [20:10:54] recvfrom(3, [20:10:56] and server on [20:11:02] select(5, [4], [], [], {0, 500000}) = 0 (Timeout) [20:11:28] YuviPanda: Hm. Buffering is biting you. Your "ok" is buffered but not flushed. [20:11:45] shouldn’t the ‘close’ flush it? [20:11:54] * Coren /hates/ stdio-style implicit buffering. [20:12:10] Hm. It should. [20:12:14] But that may be on the client side. [20:12:39] I.e.: the "registered" is buffered but unsent, and the client just sits there waiting for a reply to what it hasn't yet sent. [20:13:06] sendto(3, "register\n.*\nhttp://tools-webgrid"..., 64, 0, NULL, 0) = 64 [20:13:27] No trailing newline? Aren't you using readline()? [20:13:43] Ah, truncated string. [20:13:44] Coren: there is, it’s just not showing it with strace [20:13:46] yeah [20:13:54] Hm. [20:14:23] Still try an explicit flush before the shutdown() [20:14:42] I don't know where that library does buffering and what makes it flush. [20:14:57] Coren: oh, no it doesn’t buffer on the client side [20:14:58] at all [20:15:02] It might be silly enough to try to flush on the close() -- *after* the shutdown(). :-) [20:15:42] !log tools repool -exec-05 [20:15:47] Logged the message, Master [20:17:12] Coren: Where can I find that RFC? [20:17:38] https://meta.wikimedia.org/wiki/Requests_for_comment/Abandoned_Labs_tools [20:17:43] thanks [20:22:28] Anyway, thanks for your help. Now I need to ask myself why I waited so long with this.... [20:35:33] lol [20:41:43] !log tools -exec-06 drained, rebooting [20:41:48] Logged the message, Master [20:43:48] !log tools -exec-06 requeued [20:43:52] Logged the message, Master [20:46:35] YuviPanda: Any luck? [20:46:42] Coren: yeah things seem ok :) [20:46:50] well registration at least [20:46:52] trying unregister now [20:47:24] !log tools -exec-07 drained, rebooting [20:47:28] Logged the message, Master [20:48:44] Coren: I’m going to set the epilog script now [20:49:03] !log tools -exec-07 repooled. [20:49:07] Logged the message, Master [20:54:58] !log tools -exec-10 drained, rebooting [20:55:02] Logged the message, Master [20:57:04] 6Labs, 3Labs-Q4-Sprint-2, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Disable idmap entirely on Labs Precise instances - https://phabricator.wikimedia.org/T95555#1210368 (10Ricordisamoa) [20:59:52] Coren: hmm, qacct -j seems to hang forever and then tell me job doesn’t exist... [21:03:32] YuviPanda: The log is getting pretty long and qacct actually does a linear search for the job. [21:03:42] and gives up at some point? [21:03:59] No, it should never give up. Are you /sure/ the job has been accounted? [21:04:33] oh wait [21:04:38] what does ‘accounted’ mean? [21:04:59] That the job has in fact ended, and that the master has had the time to collect the numbers (10-20s) [21:05:14] qacct too fast doesn't always work. :-) [21:05:15] ah, right. That doesn’t do what I think it does so ignore me [21:05:50] You might just be wanting 'qstat -j ' :-) [21:06:11] !log tools -exec-10 repooled [21:06:16] Logged the message, Master [21:09:26] !log tools -exec-09 drained, rebooting [21:11:37] !log tools -exec-09 repooled [21:17:39] YuviPanda: I need to go eat. ATM, -exec-08, -11, -13 and -14 are disabled and I'm waiting for them to finish draining. The other ones are all rebooted and enabled. Except for the slightly lower capacity, there should be no issue. [21:17:48] Coren: \o/ cool [21:18:01] I should eat too, but let me make sure unregistration is totally working before I do that :) [21:18:58] YuviPanda: If you feel bored and notice one of them completely drained all that is needed at that point is a reboot followed by a qmod -e '*@tools-exec-XX' to reenable it. [21:19:06] cool! [21:19:08] But otherwise don't bother, I'll do it after dinner. [21:19:13] will do if I finish it off with this :) [21:25:55] Hi ops :) Help! It seems I need permissions to read logs at /var/logs/upstart on deployment-eventlogging02.eqiad.wmflabs. [21:26:30] madhuvishy: you should have sudo already I think [21:26:47] YuviPanda: says Permission denied. [21:26:53] for sudo? [21:26:57] no [21:27:14] https://www.irccloud.com/pastebin/ZotWUtfk [21:27:38] madhuvishy: use with sudo :) [21:28:11] YuviPanda: ah. sigh why did i not get that first. [21:28:17] okay that works [21:28:32] madhuvishy: :) [21:35:26] YuviPanda: the redis csv is done, but I haven't looked at it yet; it's ~450MB anyway [21:38:33] valhallasw`cloud: ah cool :) [21:40:09] !log tools -exec-11 drained, rebooting [21:40:46] Coren: btw, the epilog script works fine :) [21:41:15] Joy! [21:41:18] * Coren goes back to food [21:44:01] less mailspam! \o/ ;-) [22:00:15] !log tools -exec-08 and -exec-13 drained, rebooting [22:01:45] !log tools -exec-08 and -exec-13 repooled [22:07:48] 10Tool-Labs, 3Labs-Q4-Sprint-3, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Make proxylistener not need to keep open socket connections open - https://phabricator.wikimedia.org/T96059#1210694 (10yuvipanda) \o/ I'll also add a monitoring script that checks periodically to ensure that everything is working ok. It... [22:08:23] YuviPanda: Only -14 left; ima give the task runnig there a few hours to wrap up - not much harm in leaving one of 13 nodes disabled for a while. [22:09:17] 6Labs, 3Labs-Q4-Sprint-2, 3Labs-Q4-Sprint-3, 3ToolLabs-Goals-Q4: Do a rolling restart of Tool Labs precise instances - https://phabricator.wikimedia.org/T95557#1210698 (10coren) Most exec nodes done; webgrid hosts to come tomorrow. [22:09:49] Coren: \o/ cool [22:10:41] Ima check in on it later tonight. [22:10:43] o/ for now [22:11:02] Coren: \o/ night [22:13:23] 10Tool-Labs, 3Labs-Q4-Sprint-3, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Make proxylistener not need to keep open socket connections open - https://phabricator.wikimedia.org/T96059#1210706 (10yuvipanda) This also means we can now balance our proxies by simply replicating redis info from one to the other :) [22:40:12] 10Tool-Labs: Email forwarding problem in wmflabs - https://phabricator.wikimedia.org/T96184#1210805 (10scfc) Do you have message IDs or times of mails that you sent? AFAICS, you are using Gmail for receiving mail. Did you send those tests with Gmail as well? In that case Gmail does hide them from you (cf. htt...