[03:14:34] (03PS1) 10Tim Landscheidt: Add list-user-databases command [labs/toollabs] - 10https://gerrit.wikimedia.org/r/234934 (https://phabricator.wikimedia.org/T91231) [03:26:21] 6Labs, 10CirrusSearch, 10Datasets-Archiving, 6Discovery, 10Labs-Infrastructure: Make available an XL labs instance with ~350GB available disk space. - https://phabricator.wikimedia.org/T108767#1588334 (10Deskana) 5Open>3Resolved a:3Deskana As per T108766, https://suggesty.wmflabs.org has the enwiki... [03:26:33] 6Labs, 10CirrusSearch, 10Datasets-Archiving, 6Discovery, 10Labs-Infrastructure: Make available an XL labs instance with ~350GB available disk space. - https://phabricator.wikimedia.org/T108767#1588341 (10Deskana) p:5Triage>3Normal [05:41:10] 6Labs, 10Tool-Labs, 7Upstream: Unable to explain queries on replicated databases - https://phabricator.wikimedia.org/T50875#1588398 (10Ricordisamoa) [05:45:38] 6Labs, 10Wikimedia-Extension-setup, 10wikitech.wikimedia.org, 7I18n, 5Patch-For-Review: Install Translate extension on wikitech - https://phabricator.wikimedia.org/T100313#1588400 (10Nemo_bis) p:5Triage>3Normal [07:20:19] 6Labs: access.log files are not being updated - https://phabricator.wikimedia.org/T110861#1588473 (10Emijrp) 3NEW [07:26:05] 6Labs, 10Tool-Labs, 7Upstream: Unable to explain queries on replicated databases - https://phabricator.wikimedia.org/T50875#1588480 (10valhallasw) [07:26:40] 6Labs, 10CirrusSearch, 10Datasets-Archiving, 6Discovery, 10Labs-Infrastructure: Make available an XL labs instance with ~350GB available disk space. - https://phabricator.wikimedia.org/T108767#1588485 (10Nemo_bis) 5Resolved>3Invalid Apparently there are [[https://wikitech.wikimedia.org/wiki/Nova_Reso... [07:31:51] !log tools removed paniclog on tools-submit; probably related to the NFS outage yesterday (although I'm not sure why that would give OOMs) [07:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [07:32:52] 6Labs, 10Tool-Labs: access.log files are not being updated - https://phabricator.wikimedia.org/T110861#1588489 (10valhallasw) [07:38:59] 6Labs, 10Tool-Labs: access.log files are not being updated - https://phabricator.wikimedia.org/T110861#1588493 (10valhallasw) Possibly. The error.log file descriptors are still functional, but access.log is not. ``` tools.gerrit-reviewer-bot@tools-webgrid-lighttpd-1402:~$ ls -l /proc/8835/fd (...) l-wx------... [07:39:10] YuviPanda: ^ if you have time to do that... [07:56:09] 6Labs, 10Tool-Labs: nginx puppet manifest requires nfs so error page cannot be updated over puppet - https://phabricator.wikimedia.org/T110836#1588496 (10valhallasw) *nod*. In general, it's OK if parts of the puppet manifest fail, but in this case, it prevented `/etc/nginx/sites-enabled/proxy` from being updat... [08:01:30] valhallasw`cloud: yeah I can take care of that shortly [08:39:01] 6Labs, 6operations: labstore1002 not mounting all LVs after reboot - https://phabricator.wikimedia.org/T110832#1588588 (10fgiunchedi) actionables: * `start-nfs` doesn't seem to have launched or checked `sync-exports` so bindmounts weren't present when nfs was first started * I couldn't find an equivalent `stop... [12:20:11] hi, can somone help me with bigbrother? [12:20:48] 6Labs, 10Tool-Labs, 6operations: labstore1003 alerting because of network saturation - https://phabricator.wikimedia.org/T110881#1588993 (10ArielGlenn) 3NEW [12:26:12] Steinsplitter: depends on the question [12:27:05] valhallasw`cloud: i am not sure how this works (looks like som sort of deamontools). how to check if a specific job is running? [12:27:48] qstat? [12:28:28] maybe you should start with what you're trying to achieve... [12:28:51] checking if a job is running with .bigbrother [12:29:08] i can't find a documenatation how the bigbrother stuff works :/ [12:29:33] Steinsplitter: https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Grid#Bigbrother [12:30:07] oh, thanks :) <3 [13:18:07] 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-111: Update remaining virt nodes to OpenStack Juno - https://phabricator.wikimedia.org/T110886#1589076 (10Andrew) 3NEW a:3Andrew [13:18:29] 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-111: Update Labs to OpenStack Juno - https://phabricator.wikimedia.org/T110047#1589084 (10Andrew) 5duplicate>3Open [13:18:29] 6Labs, 10Labs-Infrastructure: Update Labs to OpenStack Kilo - https://phabricator.wikimedia.org/T110045#1589085 (10Andrew) [13:18:59] 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-111, 10labs-sprint-112: Update remaining virt nodes to OpenStack Juno - https://phabricator.wikimedia.org/T110886#1589076 (10Andrew) [13:27:11] 6Labs, 10wikitech.wikimedia.org: Keystone tokens truncated when wikitech stores them - https://phabricator.wikimedia.org/T92014#1589103 (10Andrew) 5Open>3Resolved Pretty sure it worked, or at least helped. [13:44:28] having more labs wierdness ... i can ssh into cirrus-browsser-bot.search.eqiad.wmflabs, but not estest1001.search.eqiad.wmflabs or estest1002.search.eqiad.wmflabs. ssh should be the same throughout the projet :S [13:44:47] perhapsr recovering from nfs ... but the instances shouldn't have had nfs [13:47:31] hi yalls [13:47:36] i can't see any instances in labs console [13:49:18] andrewbogott: ^ [13:49:20] ottomata: I have the same ishiew... Also can't ssh into my instance [13:49:31] * AndyRussG waves [13:49:37] \me waves back :) [13:49:44] * ebernhardson uses the wrong / [13:49:46] AndyRussG: didn’t we fix this issue last week? [13:50:00] ottomata: what project are you looking at? [13:50:00] andrewbogott: yeah it was fixed! [13:50:09] now it's back... [13:50:21] Before I didn't have any trouble ssh-ing in, now it says pubkey denied [13:50:35] i'm still geting permission denied (public key) at estest100{1,2}.search.eqiad.wmflabs [13:50:43] but oddly i can ssh into another machine in same project [13:51:19] um, ok, one at a time... [13:51:33] andrewbogott: services and analytics [13:51:51] i also can't log into an instance that I created on friday [13:53:32] ebernhardson: try estest1001 now please? [13:53:55] andrewbogott: still perm denied [13:54:18] it says 'input_userauth_request: invalid user ebernhardson' [13:54:25] I’ll look further in a moment... [13:54:39] how odd, thanks [13:55:45] ottomata: I expect this is https://phabricator.wikimedia.org/T110887. Can you try logging out of wikitech and back in and see if I fixed it? [13:55:48] well, ‘fixed' [13:56:34] ebernhardson: which instance /can/ you log into? [13:56:50] andrewbogott: cirrus-browser-bot.search.eqiad.wmflabs (the only other instance in search) [13:57:06] ok [13:57:16] err, it did this morning when i first tried [13:57:21] now it says 'Connection closed by UNKNOWN' [13:57:24] andrewbogott: ^ [13:57:30] try one more time please? [13:57:46] andrewbogott: done [13:57:50] cool, andrewbogott i can seen instances [13:57:57] but can't log into my instance I created on friday [13:57:59] oh, um, 1001 I mean. [13:57:59] see* [13:58:06] ottomata: that’s surely unrelated, but I’ll look [13:58:10] k [13:58:16] kafka-event-bus.services.eqiad.wmflabs [13:59:17] ottomata: puppet is broken on that instance, which probably doesn’t help [13:59:27] Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declaration: Package[openjdk-7-jdk] is already declared; cannot redeclare at /etc/puppet/manifests/role/zookeeper.pp:32 on node kafka-event-bus.services.eqiad.wmflabs [14:00:38] hmmm, i had thought i had let it spawn up before I checked those boxes [14:00:42] ok andrewbogott i'll delete it and create a new one [14:00:57] hm, or maybe I can just fix. [14:04:28] logging into cirrus-browser-bot now working, but still no on estest100{1,2}. fwiw Special:NovaInstance is also empty when 'search' project set as filter. login/logout dance doesn't make a difference. I can see instances in other non-search projects. so odd... [14:05:55] having david try add/remove me from the project ...worked the other week to fix his wierd account [14:16:04] ebernhardson: if I understand correctly, you should now be able to log into estest1001 but not estest1002. Can you confirm? [14:16:29] andrewbogott: 1001 confirmed, sec [14:16:41] andrewbogott: 1002 still denied, what was it? [14:17:14] I still don’t know for sure. nslcd thinks that ldap isnt’ there (which is needed for account info.) When I restart it it comes up fine, no complaints. [14:17:25] ho wodd [14:17:45] I wonder if you created those instances right at the moment that I was upgrading the ldap cert. Created on Friday? [14:18:06] oh, wait, that doesn’t explain it, I can reproduce with new instances today [14:18:11] andrewbogott: nope, created these on the 20th [14:18:25] andrewbogott: they are a self-hosted puppet master if that breaks things [14:18:48] estest1001 is the master, 1002 talks to it [14:20:14] ebernhardson: anyway, I bumped 1002 as well so you should be able to get on with things [14:20:47] andrewbogott: looks to be working as well now. thanks a bunch! [14:20:54] 6Labs: Logins fail on new instances - https://phabricator.wikimedia.org/T110891#1589216 (10Andrew) 3NEW a:3Andrew [14:21:17] AndyRussG: are you still blocked by something? [14:26:51] 6Labs: Logins fail on new instances - https://phabricator.wikimedia.org/T110891#1589250 (10Andrew) Aug 31 14:23:53 nscld-test-1001 systemd[1]: Startup finished in 6.061s (kernel) + 1min 24.383s (userspace) = 1min 30.444s. Aug 31 14:24:34 nscld-test-1001 nslcd[1833]: [1d5ae9] ldap_start_tl... [14:31:45] andrewbogott: yeah... still can't ssh in. I also still can't see instances or proxies (the latter is a recurrence) but that's not a blocker now, only the ssh access. Thx!! [14:31:58] AndyRussG: can’t ssh in where? [14:36:39] andrewbogott: central-notice-performance.fundraising.eqiad.wmflabs [14:37:13] wikitech & labs user: andyrussg [14:37:28] AndyRussG: try now? [14:38:05] andrewbogott: "connection closed by UNKNOWN" [14:38:31] “Failed publickey for andyrussg from 10.68.17.232 port 33599 ssh2: DSA 06:c0:f0:aa:e9:5d:25:ea:1b:f8:6b:79:21:ec:1a:12" [14:38:39] I’ll look more after I sort out this other thing... [14:39:07] K [14:39:19] I didn't change any setup on my side or anything [14:41:36] Are you shut out of all instances or just that one? [14:41:40] e.g. can you reach the bastion? [14:42:29] andrewbogott: deployment-eventlogging02.eqiad.wmflabs is working, lemme try more stuff [14:42:48] How new is central-notice-performance? [14:43:40] 6Labs, 10Tool-Labs: access.log files are not being updated - https://phabricator.wikimedia.org/T110861#1589311 (10scfc) And looking at your work in the past months, you probably already have a command that lists all web service jobs started before `$time`? :-) [14:43:42] andrewbogott: ssh bastion1.eqiad.wmflabs is all good [14:44:29] andrewbogott: central-notice-performance is pretty new, I'm pretty sure it was just last week that I created it. But it was indeed working, I have it set up for what I need, just I need to tweak it here and there from time to time as I do different tests [14:45:00] what project is it in? [14:46:08] andrewbogott: fundraising [14:58:08] 6Labs, 10Labs-Other-Projects, 6operations: labstore1003 alerting because of network saturation - https://phabricator.wikimedia.org/T110881#1589345 (10scfc) [15:00:24] 6Labs, 10Labs-Other-Projects, 6operations: labstore1003 alerting because of network saturation - https://phabricator.wikimedia.org/T110881#1589354 (10scfc) (The host `dumps-3.dumps.eqiad.wmflabs` is part of the [[https://wikitech.wikimedia.org/wiki/Nova_Resource:Dumps|Dumps project]] and not related to #Tool... [15:01:34] 6Labs, 10wikitech.wikimedia.org: Can't list instances on Special:NovaInstance - https://phabricator.wikimedia.org/T110629#1589363 (10Krenair) a:3Andrew [15:57:16] (03PS1) 10Krinkle: Send performance/* repo activity to #wikimedia-perf [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/235028 [15:58:16] YuviPanda: Can you help me deploy ^ ? [16:07:47] (03PS1) 10Tim Landscheidt: Query proxy for list of active entries [labs/toollabs] - 10https://gerrit.wikimedia.org/r/235030 (https://phabricator.wikimedia.org/T93197) [16:09:33] (03CR) 10Tim Landscheidt: [C: 032 V: 032] "I tested this live on tools.wmflabs.org." [labs/toollabs] - 10https://gerrit.wikimedia.org/r/235030 (https://phabricator.wikimedia.org/T93197) (owner: 10Tim Landscheidt) [16:13:38] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-3, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Get rid of portgranter - https://phabricator.wikimedia.org/T93046#1589581 (10scfc) [16:13:40] 6Labs, 10Tool-Labs, 5Patch-For-Review: Make list.php not rely on portgranter - https://phabricator.wikimedia.org/T93197#1589580 (10scfc) 5Open>3Resolved [16:33:06] 6Labs, 6operations, 10wikitech.wikimedia.org: intermittent nutcracker failures - https://phabricator.wikimedia.org/T105131#1589617 (10jcrespo) mw1125 + mw1142 were depooled by @fgiunchedi just some minutes ago with the same kind of error: ``` Memcached error for key "enwiki:messages:en:status" on server "/... [16:46:12] 6Labs, 6operations, 10wikitech.wikimedia.org: intermittent nutcracker failures - https://phabricator.wikimedia.org/T105131#1589655 (10fgiunchedi) on the nutcracker side, the logs were being spammed by errors: (mw1142) ``` [2015-08-31 15:02:00.428] nc_response.c:159 filter stray rsp 1553464878 len 41 on s 87... [16:52:01] !topic Wikimedia Labs | Status: new instances need a push to get off the ground, ask andrewbogott or yuvipanda for help | Channel is logged: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-labs/ [16:59:26] Hi! I'm trying to run a query on enwiki db that fetches pages with most revisions. There's a single join between the revision and page table. I've let it run over an hour a few times now but it doesn't seem to work. Is there a hard time limit after which it stops executing? [16:59:35] It runs fine on smaller wikis. [17:00:12] Niharika, that query will be huge [17:00:27] why not use a dump? [17:06:20] (03PS1) 10Rush: ssh: Add Chase (Ops) to root-authorized-keys [labs/private] - 10https://gerrit.wikimedia.org/r/235040 [17:08:11] Niharika: so what I've usually found is to 1. make that query in quarry (quarry.wmflabs.org), 2. find someone in the research team to look at it and tell you how to optimize it :) [17:09:27] (03CR) 10Yuvipanda: [C: 032 V: 032] ssh: Add Chase (Ops) to root-authorized-keys [labs/private] - 10https://gerrit.wikimedia.org/r/235040 (owner: 10Rush) [17:09:30] chasemp: done! [17:09:36] tx [17:20:04] ebernhardson: Were you able to log into 1001 until recently, or did it never work? [17:22:50] andrewbogott: today was first issue, i last logged in on thursday and logged in many times over last week [17:23:02] ebernhardson: cool, good to know [17:25:14] ebernhardson: can we reboot one of your vm's that has been affected and see if it happens again? [17:26:20] chasemp: sure, not doing anything on them at this moment [17:32:03] 6Labs, 3Labs-Sprint-107, 3Labs-Sprint-108, 3Labs-Sprint-109, and 2 others: Evaluate kubernetes for use on Tool Labs - https://phabricator.wikimedia.org/T107993#1509843 (10yuvipanda) [17:32:19] 6Labs, 3Labs-Sprint-107, 3Labs-Sprint-108, 10labs-sprint-112: Evaluate a 'cluster solution' for use on Tool Labs - https://phabricator.wikimedia.org/T106475#1589804 (10yuvipanda) [17:34:07] Who handles hiera stuff in OpenStackManager? https://translatewiki.net/wiki/Thread:Support/About_MediaWiki:Right-editallhiera/en [17:34:32] 6Labs, 3Labs-sprint-112, 5Patch-For-Review: Logins fail on new instances - https://phabricator.wikimedia.org/T110891#1589822 (10Andrew) [17:36:16] andrewbogott: ^ a question you likely know the answer to? [17:36:43] I’m not sure I understand the question... [17:36:50] But also I’m in back-to-back meetings, sorry [17:44:00] 6Labs, 10wikitech.wikimedia.org, 3Labs-sprint-112: Can't list instances on Special:NovaInstance - https://phabricator.wikimedia.org/T110629#1589851 (10Andrew) [17:44:27] 6Labs, 10Labs-Infrastructure, 3Labs-sprint-112: Update Labs to OpenStack Kilo - https://phabricator.wikimedia.org/T110045#1589853 (10Andrew) [17:47:31] 6Labs, 3Labs-sprint-112: Update openstack docs for new command-line format - https://phabricator.wikimedia.org/T110912#1589864 (10Andrew) 3NEW a:3Andrew [17:50:22] 6Labs, 10Incident-20150617-LabsNFSOutage, 3Labs-Sprint-102, 3Labs-Sprint-103: Labs: rewrite remaining labstore* scripts - https://phabricator.wikimedia.org/T102520#1589893 (10yuvipanda) maintain-replicas is gone, but sync-exports isnt'. [17:50:32] 6Labs, 10Incident-20150617-LabsNFSOutage, 3Labs-Sprint-102, 3Labs-Sprint-103, 3Labs-sprint-112: Labs: rewrite remaining labstore* scripts - https://phabricator.wikimedia.org/T102520#1589894 (10yuvipanda) [18:48:54] YuviPanda: Sorry, I vanished after asking that...the query is being made by a bot, so I'm not sure how I can use quarry. And yeah, I'll poke someone is research if I can't figure it out in a day. :) [18:49:15] Niharika: or pastebin or anything. (2) is more important [18:49:34] Right. [18:51:25] I'm curious how quarry runs it faster though. :) [19:16:22] 6Labs, 6operations, 10wikitech.wikimedia.org, 5Patch-For-Review: Figure out what to do about maintenance scripts on silver/wikitech - https://phabricator.wikimedia.org/T107547#1590336 (10Krenair) 5Open>3Resolved a:3Krenair [19:19:38] 6Labs, 10CirrusSearch, 6Discovery, 10wikitech.wikimedia.org, 7Wikimedia-log-errors: Wikitech CirrusSearch jobs throwing exceptions on silver - https://phabricator.wikimedia.org/T110635#1590354 (10JanZerebecki) Will probably be fixed by: https://gerrit.wikimedia.org/r/#/c/235048 [19:31:42] !log tools https://phabricator.wikimedia.org/T110861 : rescheduling 521 webgrid jobs, at a rate of one per second, while watching the accounting log for issues [19:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [19:32:10] valhallasw`cloud: whoops, sorry :'( [19:32:15] thanks for taking care of that [19:35:25] !log tools one per second still seems to make SGE unhappy; there's a whole set of jobs dying, mostly uwsgi? [19:35:28] ^ YuviPanda [19:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [19:35:51] not sure why uwsgi is more problematic? maybe webgrid-generic hosts are overloaded? [19:36:03] valhallasw`cloud: hmm, I added a new node... [19:36:10] !log tools last restarted job is 1423661, rest of them are still in /home/valhallaw/webgrid_jobs [19:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [19:36:26] still, it's just those, which is a bit odd [19:36:32] hmm [19:36:38] the monitor should restart them anyway [19:36:39] no, sorry, that's not true [19:36:52] I restarted a lot of uwsgi jobs because the hosts are alphabetic of course [19:36:58] I should randomize which jobs I kill :P [19:37:01] ok, first shower [19:37:11] ah :) [19:43:36] 6Labs, 10Labs-Infrastructure, 5Continuous-Integration-Isolation: Labs project admin can not create per project image on Horizon - https://phabricator.wikimedia.org/T110936#1590423 (10hashar) 3NEW [19:58:28] 6Labs, 6operations, 3Labs-sprint-112: labstore1002 out of space in vg to create new snapshots - https://phabricator.wikimedia.org/T109954#1590473 (10yuvipanda) [20:20:36] 6Labs, 10Tool-Labs, 10Continuous-Integration-Config: Job labs-toollabs-debian-glue is failing for labs/toollabs repository - https://phabricator.wikimedia.org/T110939#1590522 (10scfc) 3NEW [20:21:35] 6Labs, 10Tool-Labs, 10Continuous-Integration-Config: Job labs-toollabs-debian-glue is failing for labs/toollabs repository - https://phabricator.wikimedia.org/T110939#1590536 (10scfc) [20:21:37] 6Labs, 10Tool-Labs, 5Patch-For-Review: Create a utility that dumps all databases of a user - https://phabricator.wikimedia.org/T91231#1590535 (10scfc) [20:21:39] !log tools now doing more rescheduling, with 5 sec intervals, on a sorted list to spread load between queues [20:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [20:23:20] !log tools doh. accidentally used the wrong file, causing restarts for another few uwsgi hosts. Three more jobs dead *sigh* [20:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [20:25:26] !log tools ca 500 jobs @ 5s/job = approx 40 minutes [20:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [20:25:51] 6Labs, 10Tool-Labs: access.log files are not being updated - https://phabricator.wikimedia.org/T110861#1590559 (10valhallasw) Sort of. ``` qstat -f -xml | grep 'tools-webgrid' | sed -e 's/.*@//' | sed -e 's/<.*//' > webgrid_hosts qhost -j -h `cat webgrid_hosts` |sed -e 's/^\s*//' | cut -d ' ' -f 1|egrep ^[0-... [20:29:51] !log tools |sort is not so spread out in terms of affected hosts because a lot of jobs were started on lighttpd-1409 and -1410 around the same time. [20:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [20:40:10] YuviPanda: the slower & semi-randomized rescheduling seems to work better [20:40:24] so it's indeed that SGE can't cope with large sets of rescheduled jobs [20:40:25] gah. [20:40:39] Not too surprised... [20:40:46] What do you mean by can't cope BTW? [20:40:52] Does it just not schedule them? [20:40:55] YuviPanda: 'jobs randomly die' [20:41:03] Ah [20:41:09] I don't know what happens exactly, but it shows code 25 = rescheduling, then code 100 = died [20:41:18] Bio [20:41:20] Boo [20:41:50] but maybe it's actually the post-job script that takes too long or something like that [20:42:58] although it happened with non-webgrid jobs earlier, so.. dunno. [20:43:21] Hmm [20:47:27] YuviPanda: there are a few /other/ jobs continuously dieing though, so I'm guessing those people have a cronjob to restart the jobs every few minutes... [20:48:17] Webwatcherm [20:48:18] ? [20:48:43] (I'm eating out ATM) [20:48:44] no, just */5 * * * * jsub [21:02:16] Ugh [21:06:25] YuviPanda: ugggh! [21:06:31] lots of job failed errors [21:06:38] ut file:08/31/2015 21:05:59 [52425:24887]: can't stat() "/data/project/serviceawards/error.log" as stdout_pa [21:06:57] Ugh. But NFS is mounted.. [21:07:07] I'll go back home now I'll be there shortly. [21:07:21] maybe not on tools-webgrid-lighttpd-1210.eqiad.wmflabs [21:07:34] no, seems to work [21:07:35] weird [21:10:44] !log tools some jobs still died (including tools.admin). I'm assuming service.manifest will make sure they start again [21:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [21:11:33] YuviPanda: are you sure the manifest watcher is running? [21:12:08] I don't know. Puppet should have bought it up [21:12:15] I'm still walking home. Have no laptop atm [21:12:21] ok [21:15:05] YuviPanda: oh goddamn it [21:15:09] lots of webservices did not come back up [21:15:29] !log tools several webservices seem to actually have not gotten back online?! what on earth is going on. [21:15:33] valhallasw`cloud: check the tools-services hosts [21:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [21:15:46] To see if they have the web servicemonitor running? [21:15:52] (Eta 3mins) [21:17:04] tools.a+ 10242 0.5 0.4 39772 9988 ? S 21:16 0:00 /usr/bin/python /usr/local/bin/webservice --release precise lighttpd restart [21:17:54] nothing else, though [21:18:14] !log tools running puppet agent -tv on tools-services-02 to make sure webservicemonitor is running [21:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [21:19:49] !log tools seems to have some errors in restarting: subprocess.CalledProcessError: Command '['/usr/bin/sudo', '-i', '-u', 'tools.javatest', '/usr/local/bin/webservice', '--release', 'trusty', 'generic', 'restart']' returned non-zero exit status 2 [21:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [21:20:12] !log tools restarted webservicemonitor [21:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [21:21:16] !log tools webservice: error: argument server: invalid choice: 'generic' (choose from 'lighttpd', 'tomcat', 'uwsgi-python', 'nodejs', 'uwsgi-plain') (for tools.javatest) [21:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [21:21:47] valhallasw`cloud: am at laptop now [21:21:51] sshing in [21:22:05] YuviPanda: I don't know webservicemonitor, but it's definitely not doing its job at the moment. Not sure why. [21:22:40] valhallasw`cloud: where are the logs again? [21:22:51] YuviPanda: /var/log/upstart/webservicemonitor.log [21:23:31] it did restart gerrit-reviewer-bot's webservice at some point [21:23:35] valhallasw`cloud: I think I fixed it [21:23:39] let's see [21:23:40] so I dunno, maybe it's just backgroud noise? [21:23:59] valhallasw`cloud: crosswatch was using webservice-new which put in a type: generic which is what that error was from [21:24:22] subprocess.CalledProcessError: Command '['/usr/bin/sudo', '-i', '-u', 'tools.javatest', '/usr/local/bin/webservice', '--release', 'trusty', 'generic', 'restart']' returned non-zero exit status 2 [21:24:23] ^ still happening [21:24:30] actually no [21:24:39] subprocess.TimeoutExpired: Command '['/usr/bin/sudo', '-i', '-u', 'tools.audetools', '/usr/local/bin/webservice', '--release', 'precise', 'lighttpd', 'restart']' timed out after 15 seconds [21:24:44] that's unrelated to the generic stuff [21:24:55] in fact the generic stuff isn't causing it to 'fail' [21:24:56] oh, ok [21:24:58] it just goes on [21:25:13] and valhallasw`cloud for example, PermissionError: [Errno 13] Permission denied: '/data/project/dawikitool/service.log' [21:25:33] yeah that's just weird [21:25:54] but it does seem the last few are just noise [21:26:04] 6Labs, 3Labs-sprint-112: Restore some files from /home/gwicke - https://phabricator.wikimedia.org/T110698#1590717 (10yuvipanda) [21:26:38] valhallasw`cloud: I removed javatest one too [21:26:59] we should maybe have monitoring for continuously failing webservice jobs? [21:27:47] valhallasw`cloud: or throttling, I guess. [21:27:50] yes, we should... [21:28:01] seems to be just mjbmr and audetools [21:29:43] YuviPanda: but, er, the result of this restart round is even more disturbing. Jobs disappearing is a Bad Thing (TM) [21:29:48] not even a status 100 exit [21:30:16] valhallasw`cloud: so all webservices are back up? [21:31:05] also, I don't trust GridEngine one bit, but I"m also helpless in the sense I don't really know what to do about that. [21:31:07] YuviPanda: yeah, javatest and mjmbr were already broken much earlier. Audetools seems to be a result of the rescheduling [21:31:13] right [21:31:17] so that's ok, I guess [21:38:34] YuviPanda: it restarted 350 jobs... [21:38:41] 6Labs, 10Tool-Labs: continuous jobs killed during restart despite rescheduling - https://phabricator.wikimedia.org/T109362#1590763 (10valhallasw) With the rescheduling of all webservices today, we got a few more data points. A few jobs died during rescheduling, according to the accounting log: ``` 100 2015-08... [21:38:49] valhallasw`cloud: how many were there to begin with? [21:38:54] valhallasw`cloud: sometimes it might restart them in two runs [21:38:56] 550 [21:39:06] YuviPanda: I didn't kill the jobs, I rescheduled them [21:39:18] ... [21:39:18] so they shouldn't have to be restarted by webservicewatcher [21:39:22] ah I see [21:39:26] so SGE lost 350 jobs [21:39:32] yep. [21:39:33] of the 550. [21:39:40] wonderful [21:39:43] I'm going to cry in a corner now. [21:40:11] so the 'does a pretty good job at rescheduling' can be removed from the comparison, I think. [21:42:08] yeah... [21:42:09] sigh [21:43:23] valhallasw`cloud: re-adjusted [21:43:36] thanks [21:43:38] * aude sad :( [21:45:16] valhallasw`cloud: I've been hitting the sample php app with httperf [21:45:27] valhallasw`cloud: so at some point it stops responding because it's getting too much traffic, but when it stops it just recovers [21:45:33] that's good [21:45:34] so I guess that/'s good [21:45:38] have you tried rescheduling it? :-p [21:45:39] I'm going to integrate a health check into it [21:45:49] valhallasw`cloud: I'm going to randomly reboot instances next [21:45:51] and see what happens [21:45:55] :-) [21:45:59] after scheduling a hundred of these ofc [21:46:10] valhallasw`cloud: I think with a health check it should get restarted when enough processes deadlock [21:46:24] (Remember 1% of these requests go into an infinte loop) [21:46:45] anyway, I gotta go now... [21:46:46] *nod* [21:46:48] valhallasw`cloud: <3 thank you [21:46:58] yw [21:47:55] valhallasw`cloud: any other tests we should be doing, btw? [21:48:04] I don't know [21:48:11] * valhallasw`cloud is going to bed [21:48:13] ok [21:48:16] valhallasw`cloud: night! [21:48:17] i should too [22:50:07] Damianz: Redis relay has been borked for a while :( [22:52:58] 6Labs, 10Tool-Labs: tools-webgrid-generic-1405 is unaccessible by ssh - https://phabricator.wikimedia.org/T110965#1590987 (10scfc) 3NEW [22:55:36] andrewbogott: ^ [22:55:46] Can you give that instance a kick? [22:58:55] 6Labs, 10Tool-Labs: tools-webgrid-generic-1405 is unaccessible by ssh - https://phabricator.wikimedia.org/T110965#1591054 (10Andrew) ...is that better? [23:01:07] YuviPanda, i want to bulk-edit just under 200 phab tasks, to add a project. I've noticed that wikibugs tends to get kicked for flooding when that kind of thing happens. Is there a recommended solution? Or should I just wait until late at night to do so? [23:03:59] 6Labs, 10Tool-Labs: tools-webgrid-generic-1405 is unaccessible by ssh - https://phabricator.wikimedia.org/T110965#1591080 (10scfc) 5Open>3Resolved a:3Andrew [23:04:16] 6Labs, 10Tool-Labs: tools-webgrid-generic-1405 is unaccessible by ssh - https://phabricator.wikimedia.org/T110965#1590987 (10scfc) Yes, I can now access the host from outside. [23:29:32] 6Labs, 10Tool-Labs: Remove modules/toollabs/files/host_aliases - https://phabricator.wikimedia.org/T109485#1591172 (10scfc) p:5Lowest>3High We have run into this problem a couple of times lately, and I'd like to get this out of the way rather sooner than later. [23:29:41] 6Labs, 10Tool-Labs: Remove modules/toollabs/files/host_aliases - https://phabricator.wikimedia.org/T109485#1591174 (10scfc) a:3scfc [23:46:37] 6Labs, 3Labs-sprint-112, 5Patch-For-Review: Logins fail on new instances - https://phabricator.wikimedia.org/T110891#1591242 (10Andrew) https://gerrit.wikimedia.org/r/235142 looks to have fixed Jessie. I've also forced a restart of nslcd everywhere that salt can reach. [23:57:22] 6Labs, 10Tool-Labs: Remove modules/toollabs/files/host_aliases - https://phabricator.wikimedia.org/T109485#1591260 (10scfc) For `tools-webgrid-generic-1404`: ``` qmod -d webgrid-generic\@tools-webgrid-generic-1404.eqiad.wmflabs qconf -mq webgrid-generic qmod -rj 1766173 499843 499859 qconf -de tools-webgrid-g...