[00:12:20] 6Labs, 10Tool-Labs, 5Patch-For-Review: Remove modules/toollabs/files/host_aliases - https://phabricator.wikimedia.org/T109485#1591286 (10scfc) Plan for after the change gets merged: 1. On `tools-master`, `sudo service gridengine-master restart`. This should be safe and not cause any loss of data. 2. On `to... [00:18:46] (03PS1) 10Krinkle: app: Fix edit summary bug - content after section should not be dropped [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/235158 [00:19:24] (03PS2) 10Krinkle: app: Fix edit summary bug - content after section should not be dropped [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/235158 [00:19:36] (03CR) 10Krinkle: [C: 032] app: Fix edit summary bug - content after section should not be dropped [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/235158 (owner: 10Krinkle) [00:20:31] 6Labs, 10Tool-Labs, 5Patch-For-Review: Remove modules/toollabs/files/host_aliases - https://phabricator.wikimedia.org/T109485#1591296 (10scfc) Preparation work: Disabled all queues on `tools-exec-1218`, `tools-exec-1401`, `tools-webgrid-generic-1401`, `tools-webgrid-lighttpd-1201` and `tools-webgrid-lighttpd... [00:20:41] (03CR) 10Krinkle: [V: 032] "TODO: Set up lint test." [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/235158 (owner: 10Krinkle) [00:24:47] 6Labs, 3Labs-sprint-112, 5Patch-For-Review: Logins fail on new instances - https://phabricator.wikimedia.org/T110891#1591302 (10Andrew) 5Open>3Resolved yep, fixed on Trusty as well. New images are live now. [00:24:51] (03CR) 10Jforrester: [C: 032] Send performance/* repo activity to #wikimedia-perf [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/235028 (owner: 10Krinkle) [00:26:53] (03Merged) 10jenkins-bot: Send performance/* repo activity to #wikimedia-perf [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/235028 (owner: 10Krinkle) [00:28:49] Hmm. [00:34:12] I can submit jobs [00:34:16] Is ti okay? [00:34:18] *can't [00:35:32] Coren, valhallasw`cloud [00:45:33] 6Labs, 10wikitech.wikimedia.org, 7Database: SemanticMediaWiki tries to create temporary tables, but can't as wikiuser is restricted - https://phabricator.wikimedia.org/T110981#1591340 (10Krenair) 3NEW [00:47:12] Krenair: SMW moved to github, the project was archived. [00:47:25] * Krenair grumbles [00:48:09] (03PS1) 10Jforrester: Follow-up ad0675b8: Use performance.* as the regex instead [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/235161 [00:50:47] (03CR) 10Krinkle: [C: 031] Follow-up ad0675b8: Use performance.* as the regex instead [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/235161 (owner: 10Jforrester) [00:50:57] (03CR) 10Jforrester: [C: 032] Follow-up ad0675b8: Use performance.* as the regex instead [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/235161 (owner: 10Jforrester) [00:51:00] (03Merged) 10jenkins-bot: Follow-up ad0675b8: Use performance.* as the regex instead [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/235161 (owner: 10Jforrester) [01:00:46] tools-bastion-02 aka tools-dev is in a strange state, seems to be firewalled from the rest of the network. qstat says: [01:00:49] error: denied: host "tools-bastion-02.eqiad.wmflabs" is neither submit nor admin host [01:01:08] and "telnet tools-redis 6137" (which works on tools-login): [01:01:15] telnet: Unable to connect to remote host: Connection refused [01:12:52] !log tools.lolrrit-wm Re-restarting grrrit-wm rolled back to 2f5de55ff75c3c268decfda7442dcdd62df0a42d [01:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master [01:18:08] 6Labs, 10Tool-Labs: tools-bastion-02 (aka tools-dev) can't submit grid jobs - https://phabricator.wikimedia.org/T110982#1591388 (10Sitic) 3NEW [01:18:38] ^^ never mind that telnet tools-redis thing, that was just a bad port in my bash history [01:31:39] 6Labs, 10Tool-Labs: tools-bastion-02 (aka tools-dev) can't submit grid jobs - https://phabricator.wikimedia.org/T110982#1591405 (10scfc) 5Open>3Resolved a:3scfc Sorry, as part of T109485 I had removed the host as a submit host without realizing that people actually do work there :-). I have re-added it. [01:34:05] (03PS1) 10BBlack: empty uni.wm.o key for compiler testing [labs/private] - 10https://gerrit.wikimedia.org/r/235172 [01:34:07] 6Labs, 10Tool-Labs, 5Patch-For-Review: Remove modules/toollabs/files/host_aliases - https://phabricator.wikimedia.org/T109485#1591414 (10scfc) I have readded `tools-bastion-02` as a submit host because people are actually using it (cf. T110982). So the actual switch of the host name would be done between 1.... [01:34:17] (03CR) 10BBlack: [C: 032 V: 032] empty uni.wm.o key for compiler testing [labs/private] - 10https://gerrit.wikimedia.org/r/235172 (owner: 10BBlack) [03:02:41] 6Labs, 10Tool-Labs, 5Patch-For-Review: Remove modules/toollabs/files/host_aliases - https://phabricator.wikimedia.org/T109485#1591475 (10scfc) After the number of pending jobs grew, I undid the preparation work by: ``` scfc@tools-bastion-01:~$ for host in tools-exec-1218 tools-exec-1401 tools-webgrid-generi... [03:03:18] 6Labs, 10Wikimedia-Site-Requests, 10wikitech.wikimedia.org: MWEchoNotificationEmailBundleJob causes exceptions due to delays not being supported by non-redis job queues - https://phabricator.wikimedia.org/T110985#1591476 (10Krenair) 3NEW a:3Krenair [03:09:41] 6Labs, 10CirrusSearch, 6Discovery, 10wikitech.wikimedia.org, and 2 others: Wikitech CirrusSearch jobs throwing exceptions on silver - https://phabricator.wikimedia.org/T110635#1591488 (10Krenair) a:3JanZerebecki [03:09:55] 6Labs, 10CirrusSearch, 6Discovery, 10wikitech.wikimedia.org, and 2 others: Wikitech CirrusSearch jobs throwing exceptions on silver - https://phabricator.wikimedia.org/T110635#1591489 (10Krenair) 5Open>3Resolved Looks like that did the trick. [03:23:03] 6Labs, 6operations, 10wikitech.wikimedia.org: Determine whether wikitech should really depend on production search cluster - https://phabricator.wikimedia.org/T110987#1591503 (10Krenair) 3NEW [05:46:32] hey laboraticians, on a labs instance without any evil NFS mounts, is any partition backed up? [05:47:14] developer-doc-devhub.developer-doc.eqiad.wmflabs has /dev/vda1 mounted on /, is that backed up? [06:01:55] YuviPanda: hi. I newly maintain "enwp10" on toolabs. Unfortunately it does not run anymore https://tools.wmflabs.org/enwp10/cgi-bin/list2.fcgi?run=yes&projecta=Skepticism&importance=Unknown-Class&quality=Unassessed-Class [06:01:59] spagewmf: no, i don't think SSSI [06:02:19] YuviPanda: wanted to restart the webservice... but this fails [06:02:23] webservice start [06:02:23] Starting web service..............................Timeout: could not start job in 30stools.enwp10@tools-bastion-01:~$ [06:03:12] YuviPanda: I'm a newby on toolabs and it looks like something fundamental does not work, but I don't know what. Any idea? [06:03:32] Kelson42: there were earlier issues with the grid -- looking now. Yuvi is probably still asleep (european time zone) [06:04:22] valhallasw`cloud: ok, thx for the feedback. hopefuly he can answer later, at the time he will be awake ;) [06:06:19] !log tools investigating SGE issues reported on irc/email [06:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [06:06:47] !log tools test job does not get submitted because all queues are overloaded?! [06:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [06:07:59] !log tools e.g. "queue instance "task@tools-exec-1211.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=1.820000 (= 0.070000 + 0.50 * 14.000000 with nproc=4) >= 1.75" but the actual load is only 0.3?! [06:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [06:09:25] uhh what on earth? [06:10:28] my giftbot queue is also overloaded, it seems [06:17:32] !log tools going to restart sge_qmaster, hoping this solves the issue :/ [06:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [06:18:44] 6Labs, 10Tool-Labs: SGE queues all overloaded / jobs not submitting although load averages are low - https://phabricator.wikimedia.org/T110994#1591712 (10valhallasw) p:5Triage>3Unbreak! [06:22:17] fixed my issue [06:23:52] !log tools seems to have worked. SGE :( [06:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [06:23:59] 6Labs, 10Tool-Labs: SGE queues all overloaded / jobs not submitting although load averages are low - https://phabricator.wikimedia.org/T110994#1591727 (10valhallasw) p:5Unbreak!>3High [06:26:04] 6Labs, 10Tool-Labs: SGE queues all overloaded / jobs not submitting although load averages are low - https://phabricator.wikimedia.org/T110994#1591729 (10valhallasw) Restarting gridengine master seems to have helped -- but what happened to get the queues in this state?! More issues from the earlier NFS outage?... [07:29:09] Kelson42: webservice start should be OK now [09:30:12] 6Labs, 10Labs-Infrastructure, 5Continuous-Integration-Isolation: Include Base::Standard-packages in labs images - https://phabricator.wikimedia.org/T94995#1592058 (10hashar) 5Open>3declined a:3hashar From T110735 , we now only apply a subset of `operations/puppet` since lot of parts are not easily appl... [09:35:45] 6Labs, 10Labs-Infrastructure, 5Continuous-Integration-Isolation: Investigate non blocking fs resizing when instance is booted - https://phabricator.wikimedia.org/T104974#1592076 (10hashar) 5Open>3Resolved a:3hashar I have filled this tasks for instances booted from #labs images. The dib images using Je... [10:10:29] I'm getting a puppet failure like this when applying a couple of classes to an existing and working instance (with self hosted puppet master) [10:10:32] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: No matching value for selector param '(undefined)' at /etc/puppet/modules/ldap/manifests/role/config.pp:8 on node test-cassandra5.monitoring.eqiad.wmflabs [10:10:36] seen this before? [10:10:50] godog: new instance? [10:11:12] grrrit-wm: if so, restart nslcd and try again [10:11:57] YuviPanda: yeah new instance [10:12:37] no change after 'service nslcd restart' [10:13:28] what I'm puzzled about is that I can go from an instance with a working puppet + self-hosted master to a non working one with the above by applying two classes [10:14:07] thanks anyway [10:14:45] :( [10:14:55] andrewbogott: has been investigating these a lot more than I have, he might have more info [10:16:30] ack, is it tracked in phab btw? [10:16:54] godog: think so, let me find a link [10:17:36] godog: hmm, https://phabricator.wikimedia.org/T110891 says it was fixed [10:18:07] godog: I can take a look if you'd like [10:18:50] godog: hmm, looks like $site is undefined for some reason [10:19:03] YuviPanda: yeah, sure test-cassandra5.eqiad.wmflabs [10:21:14] godog: hmm, so [10:21:16] eth0 is inet addr:127.0.0.2 Bcast:0.0.0.0 Mask:255.255.255.255 [10:21:25] godog: which is the underlying problem, I think [10:21:52] no idea how that happened [10:21:55] it is supposed to get 10.xxx [10:22:53] ah yeah that's from the classes I've applied, could be that [10:23:48] YuviPanda: for some reason I'm being asked a sudo password and logging in as root doesn't seem to work, can you try removing those two? [10:24:51] YuviPanda: nevermind I got it [10:25:32] godog: ok! [10:25:44] godog: so the classes you applied set eth0 IP? [10:25:53] godog: so $::site is determined from the IP range [10:26:03] godog: so not using a 10.68 range will mess up $::site causing problems [10:27:46] ah I see [10:27:57] godog: this is in manifests/realm.pp [10:28:08] the LDAP thing was a red herring, sorry about that [10:28:13] but yeah, realm.pp has the IP ranges [10:29:11] sigh, I also misread the error message, grepping puppet for '(undefined)' would have found it [10:29:26] sorry I'm really frustrated by puppet's failure modes [10:29:46] '(undefined)' is maybe not the best there, it should perhaps be undef [10:29:58] but I am worried about changing it, what if there was an actual reason it was '(undefined)' rather than undef [10:30:52] possibly yeah, setting site based on the ip address doesn't seem ideal, I would have expected to be set when provisioning a machine and puppet reads it [10:34:09] godog: yeah... it could maybe come from the fqdn [10:36:52] heh, from experience making decisions based on the hostname is as fragile unfortunately [10:37:22] I was thinking more of a regular file e.g. /etc/wikimedia/site [10:37:40] where would that be populated? [10:37:42] by the image? [10:37:52] when first provisioning the image yeah [10:38:49] IMO in general puppet shouldn't decide/detect over things that are not supposed to change over the lifespan of a machine [10:39:24] (rambling, not an actual plan/idea) [10:40:11] yeah, I agree [10:40:16] site, realm... [10:40:17] project [10:41:12] addshore: is the category aspect on in the beta cluster? [10:41:24] nope [10:41:33] k [10:42:10] spent over an hour now trying to reproduce this 1 thing [10:42:22] but everything is happening as I would expect [10:42:52] * sDrewth shrugs, if it helps I was using touch.py [10:43:03] :D [10:43:16] yeh, well, thats the next step, until now I was just trying to reproduce it manually [10:43:23] guess now I will try with touch.py itself [10:43:45] if that doesn't work I can only guess there is something else interacting with it all! [10:44:21] there was a lot going on at the time with category stuff [11:35:30] sDrewth: were you touching all of the pages <<< valhallasw`cloud [11:35:44] or just touching 1 and then saw all of the category changes? [11:36:04] I was touching all pages [11:36:09] hmmm, okay [11:36:25] we had an issue where they were not assigned to commons image [11:36:36] and you wern't touching the File itself, but all of the Side: pages? [11:36:40] we still have an issue [11:36:43] correct [11:36:59] hot assigned to commons image? [11:37:01] *not [11:37:27] file usge [11:37:29] usage [11:37:41] ahhh, so commons didn't show the page as using the file? [11:37:44] https://phabricator.wikimedia.org/T108799 [11:37:49] correct [11:38:31] and I still haven't worked out why the my bot is not showing as a bot on edits either [11:38:56] separate issue however [12:34:26] 6Labs, 6operations, 3Labs-sprint-112, 5Patch-For-Review: labstore1002 out of space in vg to create new snapshots - https://phabricator.wikimedia.org/T109954#1592521 (10yuvipanda) Ok, so the problem was that the cleanup script wasn't being triggered by any means automatically. Should be fixed now - need to... [12:34:39] 6Labs, 6operations, 3Labs-sprint-112, 5Patch-For-Review: labstore1002 out of space in vg to create new snapshots - https://phabricator.wikimedia.org/T109954#1592523 (10yuvipanda) a:3yuvipanda [13:08:24] 6Labs, 10Tool-Labs: SGE queues all overloaded / jobs not submitting although load averages are low - https://phabricator.wikimedia.org/T110994#1592678 (10scfc) I did disable and later enable queues on some hosts (cf. T109485#1591475). After that, only the queue for `tools-webgrid-lighttpd-1411` was disabled,... [13:20:12] 6Labs, 10Tool-Labs: SGE queues all overloaded / jobs not submitting although load averages are low - https://phabricator.wikimedia.org/T110994#1592706 (10valhallasw) I think it's unlikely that was the cause, because *all* queues were in this weird state (qhost didn't report high load). I think it's because of... [14:12:35] 6Labs, 3Labs-sprint-112, 3ToolLabs-Goals-Q4: Fix documentation & puppetization for labs NFS - https://phabricator.wikimedia.org/T88723#1592926 (10mark) [14:52:38] 6Labs, 10Wikimedia-Mailing-lists: expand labs listinfo pages and link them to eachother - https://phabricator.wikimedia.org/T97480#1593106 (10Dzahn) I don't think it's right that this has been moved to "Shell/site". HTML templates allow for a list admin to redesign the listinfo pages as they see fit, only usin... [15:01:56] 6Labs, 10Wikimedia-Mailing-lists: expand labs listinfo pages and link them to eachother - https://phabricator.wikimedia.org/T97480#1593147 (10JohnLewis) I think I flipped the workboard quickly while going through. [15:03:03] YuviPanda: could you take a look at ^ / https://phabricator.wikimedia.org/T97480 when you get a few seconds. might be a nice thing to do with the aliases for labs-announce on labs-l if you haven't already :) [15:03:53] JohnFLewis: I have no idea how to do any of that... :( [15:06:10] somewhere in mailman [15:06:11] YuviPanda: it's html code changes on the listinfos (https://lists.wikimedia.org/mailman/edithtml/labs-l/listinfo.html) [15:06:15] :D [15:26:01] (03PS1) 10Jean-Frédéric: Do not add interwiki's on unused images pages to unbreak bot [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/235244 (https://phabricator.wikimedia.org/T110829) [15:35:49] (03PS1) 10Jean-Frédéric: Do not add interwiki's on unused images pages to unbreak bot [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/235247 (https://phabricator.wikimedia.org/T110829) [15:37:02] (03CR) 10Jean-Frédéric: [C: 032 V: 032] Do not add interwiki's on unused images pages to unbreak bot [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/235247 (https://phabricator.wikimedia.org/T110829) (owner: 10Jean-Frédéric) [15:44:32] YuviPanda, valhallasw`cloud, looks like git_pull_cdnjs is failing again. Not for lack of inodes this time [15:44:42] andrewbogott: let me take a look [15:44:51] probably git pull doing weird merges again [15:44:55] easy to reproduce on tools-web-static-01 [15:45:00] also why no shinken-wm here? [15:45:11] because you shut it down at some point [15:45:15] andrewbogott: btw, can you do https://phabricator.wikimedia.org/T110698? [15:45:15] with the NFS failure I think [15:45:28] ah yeah [15:45:39] but puppet was supposed to start it back up! [15:45:50] YuviPanda: yeah, I can… backups are on labstore2001, right? Is the filesystem there obvious? [15:46:14] andrewbogott: should be... [15:46:20] andrewbogott: is under /srv/eqiad I think [15:46:24] andrewbogott: yeah, there's a whole bunch of weird merge commits again. I'll git reset --hard for now. [15:46:46] YuviPanda: and what are the actual mechanics of the restore? Just scp back to labstore1002? [15:47:07] andrewbogott: no, scp to localhost and then to the parsoid-spof instance [15:47:11] !log tools git reset --hard cdnjs on tools-web-static-01 [15:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [15:47:20] oh, sure [15:47:29] I don't know why we get merge commits, but I'm guessing upstream is force-pushing things [15:47:37] ok. YuviPanda, I’m sure telling me how to do this is harder than doing it yourself but I nonetheless appreciate knowing how :) [15:47:54] I'll investigate later... [15:48:18] andrewbogott: :D so this is from an 'archived' project - I'm not sure if we ever actually ran that script? [15:48:36] Oh, yeah, I did. Crap, so maybe they aren’t in backups [15:48:41] well, anyway, I’ll find ‘em [15:48:51] andrewbogott: yeah, so it should be in wherever the archived stuff is [15:49:08] andrewbogott: then just scp it to your local machine, then scp it to the parsoid-spof instance (in the VE project) [15:49:20] andrewbogott: they're very well aware it's a SPOF (hence the name...) [15:50:04] YuviPanda: for future reference, all of that stuff is tarballed in /srv/others/orphan-volumes on labstore1002 [15:51:47] andrewbogott: ah, ok! [15:53:08] godog: still having issues with a new instance? [15:53:33] HELLO [15:53:53] andrewbogott: we sorted it out [15:54:07] ok [15:55:19] andrewbogott: yup we did, PEBCAK as it turns out [15:55:20] ah, ok :) [15:55:20] good to hear [15:55:20] glad everything is ok [15:55:21] and sorted out [15:55:21] JJ_ you’re a chatbot, yes? [15:55:21] big up [15:55:21] nah [15:55:21] im human [15:55:22] bah, I wish I could remember how to op myself [15:55:22] godog: and $realm calculation being strange [15:55:22] so confused with the internet [15:56:19] are we coding things on PC which is directly affecting physical reality? [15:56:21] YuviPanda: yeah! I found a ticket about that and updated [15:58:39] 6Labs, 3Labs-sprint-112: Restore some files from /home/gwicke - https://phabricator.wikimedia.org/T110698#1593408 (10Andrew) On a good day this is an easy thing to do, but NFS is overextended right now so we need to do some infrastructure work before I can restore your archive. Please ping me in a few days an... [15:59:19] 6Labs, 3Labs-sprint-112: Restore some files from /home/gwicke - https://phabricator.wikimedia.org/T110698#1593415 (10yuvipanda) a:5yuvipanda>3None [16:07:52] (03PS1) 10Jean-Frédéric: Specify site to use when specifying NamespaceFilterPageGenerator [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/235260 (https://phabricator.wikimedia.org/T110420) [16:08:22] (03CR) 10Jean-Frédéric: [C: 032 V: 032] Specify site to use when specifying NamespaceFilterPageGenerator [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/235260 (https://phabricator.wikimedia.org/T110420) (owner: 10Jean-Frédéric) [16:22:51] RECOVERY - Puppet failure on tools-web-static-01 is OK Less than 1.00% above the threshold [0.0] [16:23:33] 6Labs, 10Tool-Labs: sql script does not accept wildcards as parameter - https://phabricator.wikimedia.org/T75595#1593518 (10scfc) Thanks for the patch. The quoting of `"${*:2}"` works for the case where an SQL statement is supplied as a command line argument, but fails for when `sql` is only used to access th... [16:40:22] 6Labs, 10Tool-Labs: SGE queues all overloaded / jobs not submitting although load averages are low - https://phabricator.wikimedia.org/T110994#1593560 (10Merl) ``` np_load_avg=1.810000 (= 0.060000 + 0.50 * 14.000000 with nproc=4) >= 1.75 = np_load_avg + job_load_adjustments * [weight number of jobs started in... [16:47:26] andrewbogott / YuviPanda, I get the feeling the cdnjs development model is 'we revert things by force pushing' [16:47:53] I wonder if that's just git being too big for what they do [16:48:07] valhallasw`cloud: I also wonder if we can ditch git::clone and just have a shallow clone that's force reset [16:48:19] maybe the puppet git class needs to have an option of just doing ‘reset —hard origin’ after every checkout [16:48:50] we can svn clone the repo :P [16:48:51] We don’t need to support local patches do we? [16:48:51] andrewbogott: nope [16:48:51] * YuviPanda hits valhallasw`cloud with gridengine :P [16:48:52] yeah, but it's probably better to raise alarms on force pushes [16:48:52] was the problem with the checkout or with the rebase? [16:48:56] but we don't actually care! [16:48:59] if they force push [16:49:00] right? [16:49:03] andrewbogott: https://github.com/cdnjs/cdnjs/issues/5587 [16:49:05] well [16:49:13] force push is a great way to sneakily insert stuff [16:49:27] but we don't actually check what they do [16:49:30] so *shrug* [16:50:03] yeah [16:51:06] andrewbogott: so what happens is: they force push to remove a package X, git merges so we end up with a merged version which includes package X. Then at some point they re-add package X and we get a merge conflict [16:51:47] yeah, so reset origin would avoid that [16:52:02] but yeah, I think we should probably make the git::pull thing either pull-without-merge /or/ fetch-reset-hard [16:52:38] so 'git pull --ff-only' vs 'git fetch && git reset --hard origin/master' [16:52:43] * andrewbogott thinks that ‘git pull’ should be be deprecated, anyway [16:53:07] as long as there’s no danger of people being upset about local patches… I like the latter. [16:53:11] We know for sure what we’re getting [16:54:56] I think --ff-only is probably better as default option [16:55:05] guaranteed to not do weird stuff or lose information without raising hell [16:55:39] oh yeah, the git class definitely shouldn’t rebase by default! But in this case it might be appropriate. [16:55:58] ah yeah, for this case the reset option is what we want [17:00:57] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL - Socket timeout after 10 seconds [17:05:47] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 878928 bytes in 2.319 second response time [17:10:51] valhallasw`cloud: andrewbogott ^ just as a fyi, ^ is just mark rebuilding the RAID array. apparently we were one disk failure away from data loss since the outage [17:11:02] fun fun fun [17:12:21] * valhallasw`cloud cheers mark on [17:12:35] gluster just looks better and better [17:13:14] some more people taking on labs raid/nfs storage would help too :) [17:13:52] * mark wonders why he can't login on tools exec nodes [17:14:07] mark, want me to add you? [17:14:11] yes please [17:14:35] I’m guessing your username on wikitech is ‘mark' [17:14:41] I guess ;) [17:15:00] oh, hm, you’re already project admin [17:15:15] you can log in to other labs boxes but not tools? [17:15:21] Or other tools boxes but not exec nodes? [17:15:27] oh [17:15:29] it's totally my fault [17:15:32] sorry, can't trust a manager [17:15:36] ok :) [17:17:03] * andrewbogott -> lunch [17:17:04] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL - Socket timeout after 10 seconds [17:17:44] (intermittent failures only - wfm ^) [17:18:54] i'd still prefer it if that were not the case :P [17:19:29] grrr, any idea why that’s flapping? Every time I look it’s up but shinken disagrees [17:19:51] andrewbogott: I think it just failes intermittently because of NFS failures and shinken catches it and we don't [17:20:00] just slow [17:20:40] yeah it can just timeout because it attempts to read NFS at exactly the wrong time? [17:20:53] yes [17:20:56] it's also a mess of PHP scripts that nobody wants to look at much :( [17:20:59] i wonder what the test timeout is [17:23:14] it's whatever the default check_http timeout is [17:23:47] YuviPanda: it's also the handler for all 500s of other tools :| [17:26:04] non-tool labs rebuild is done btw [17:28:42] hm, the beta cluster is very slow at the moment, I think [17:29:49] RECOVERY - Puppet failure on tools-web-static-02 is OK Less than 1.00% above the threshold [0.0] [17:44:23] * mark is on a random exec node and sees it do quite an insane nr of NFS requests per second [17:44:23] like 10kreq/s [17:46:10] is it mine by chance? [17:46:14] no idea [17:46:16] tools-exec-1403 [17:46:25] ok, then not, phew :) [17:46:42] but if there's anything anyone can do to reduce load on Tools NFS, i'd recommend doing ti :) [17:47:03] just now or in general? [17:47:10] certainly now [17:47:21] and it's never a great idea to put a lot of load on NFS I'd say, that's something we're discouraging :) [17:47:31] but now because of a raid rebuild [17:55:41] YuviPanda: hi [17:55:42] valhallasw`cloud: thx, now it works [17:56:25] YuviPanda: on toolabs if "webservice start" timeouts... then "webservice start" tells you "Your webservice is already running" [17:56:41] YuviPanda: but this is not the case!? [17:58:06] server_caps delegreturn getacl setacl fs_locations rel_lkowner [17:58:06] 26 0% 0 0% 0 0% 0 0% 0 0% 2658196872 73% [17:58:47] something is doing a whole lot of locking on nfs [17:59:22] have you tools to find out more? [18:01:16] Kelson42: which tool? it can take a bit of time before the webservice is available [18:02:36] valhallasw`cloud: enwp10 [18:02:36] valhallasw`cloud: ok, but if "webservice start" tells you "Timeout: could not start job", then the service should not start! [18:02:59] Kelson42: webservice seems up [18:03:02] tools.enwp10@tools-bastion-01:~$ webservice stop [18:03:03] Stopping web service [18:03:03] tools.enwp10@tools-bastion-01:~$ webservice start [18:03:03] Starting web service..............................Timeout: could not start job in 30stools.enwp10@tools-bastion-01:~$ [18:03:03] tools.enwp10@tools-bastion-01:~$ webservice start [18:03:05] Your webservice is already running [18:04:02] valhallasw`cloud: yes, it works but the problem is that the webservice status is not trustable at all [18:04:14] I don't see how that's the case. The timeout message could be a bit clearer, but webservice status just asks qstat whether the webservice is running. [18:04:35] tools.enwp10@tools-bastion-01:~$ webservice status [18:04:35] Your webservice is running [18:04:36] and it is. [18:05:25] valhallasw`cloud: how to you explain in the log I have posted, that the webservice is running, although one line above it stops with a timeout? [18:06:03] because the webservice didn't stat in 30 seconds [18:06:24] it did, however, start after a longer period of time [18:06:52] valhallasw`cloud: ok, then indeed the message (the behaviour of "webservice start") is totally misleading [18:08:46] Kelson42: I disagree, but feel free to open a bug on Phabricator. [18:10:53] valhallasw`cloud: Kelson42 I think maybe the web service monitor started it after your start failed [18:10:54] ? [18:11:04] See service.log for an entry [18:11:07] no, I think the start is just a bit slow because of NFS [18:12:33] Well, ClueBot gave up :P [18:12:44] YuviPanda: I have a lot of " No running webservice job found, attempting to start it" [18:13:34] mark: do we have an eta for the rebuild? [18:13:42] weeks [18:13:44] YuviPanda: I just want to write a small script restarting automatically the webservice if for some reason it dies... so need to rely on something to run "webservice start".... [18:13:58] i just eased the rebuild speed [18:14:08] so, i'm monitoring nfs load on a random tools exec node [18:14:11] and all it's doing is file locking [18:14:14] Kelson42: please don't do that. We have web service monitor that does that [18:14:38] Kelson42: right now there is NFS slowdowns causing issues probably. [18:14:39] there's nothing in the service.log from today (just from yesterday and before that) [18:15:36] YuviPanda: any solution to autorestart services (avoid me to monitor this manually)? [18:15:39] so it's just slowness in starting the job [18:15:49] Kelson42: webservices /are/ autorestarted [18:16:05] valhallasw`cloud: ok, then its perfect. [18:16:22] YuviPanda: valhallasw`cloud thx again for the support today. Wish you the best to fix last problems with NFS. [18:16:31] you're welcome [18:21:08] wikitech wonky at the moment too? [18:24:05] Can't get any OSM functions to work on wikitech. Already tried the login/logout dance. [18:26:45] ostriches: what do you mean with 'OSM functions'? [18:26:45] ostriches: have you tried 'remove yourself from the project and get re-added'? [18:27:14] valhallasw`cloud: OpenStackManager stuffs. Like adjusting public IPs, instances, etc. [18:27:18] YuviPanda: No. [18:27:41] yeah, that's probably the issue that's solved by removing-and-readding [18:27:46] YuviPanda: btw, can you do that for me for tools? [18:27:51] and toolsbeta [18:27:54] valhallasw`cloud: yes, let me do that [18:28:23] YuviPanda: Removed myself from deployment-prep, can you readd? [18:28:28] +admin [18:28:30] ostriches: yes [18:28:52] ostriches: what's your username? [18:28:58] "Chad" [18:29:49] the 'delete' button is dangerously close to 'add member' [18:30:08] lol, yerp [18:31:05] ostriches: done [18:31:13] valhallasw`cloud: doing for you now, on tools [18:32:05] YuviPanda: That did it, thx [18:34:06] valhallasw`cloud: done for tools [18:34:14] valhallasw`cloud: also, heh, I still can't spell or pronounce your name :P [18:34:50] hahaha. Yeah, I don't think anyone non-dutch can spell my name. [18:34:55] Merljin is very common [18:35:16] Is it pronouced like Merlin? [18:35:50] valhallasw`cloud: yeah, I'm sticking to just saying 'valhalla' :P [18:36:49] RichSmith: sort of. It's indeed the same name, but with the dutch ij digraph in there. https://en.wikipedia.org/wiki/IJ_(digraph) has a voice sample [18:47:04] hello!!!! I have not logged in on labs in three months and now .. ahem... i can't seem to be able to ssh [18:47:15] i saw a note about bastion and nfs [18:47:29] but i am not sure what do i need to change on my ssh config [18:50:34] nuria: can you tell me what’s happening when you try to log in? [18:50:46] andrewbogott: hello!, yes [18:51:18] https://www.irccloud.com/pastebin/ncxW1Xgq/ [18:52:42] andrewbogott: i can connect to bastion [18:53:02] (bast1001.wikimedia.org) [18:53:25] I think your labs username is different from your prod username [18:53:34] do you know what the labs name is, or shall I look it up? [18:54:31] valhallasw`cloud: re-added for toolsbeta too, btw [18:54:38] thanks [18:55:06] andrewbogott: both was 'nuria' [18:55:15] andrewbogott: orrr [18:55:22] ok, then you’ll need to ssh nuria@ [18:55:24] andrewbogott: i had /home/nuria [18:55:37] right now I see you trying to log in as ‘nuriaruiz' [18:55:51] which labs does not recognize as a valid user. Presumably that’s your name on your local system? [18:56:53] looks like it’s working now [18:56:56] andrewbogott: ah, yes, i guess i forgot i did that before? [18:57:04] andrewbogott: sorry about that! [18:57:08] andrewbogott: and many thanks [18:57:12] 6Labs, 10Tool-Labs: SGE queues all overloaded / jobs not submitting although load averages are low - https://phabricator.wikimedia.org/T110994#1594077 (10valhallasw) Thanks you for that explanation, that helps a lot! A bit more post-mortem work. From the accounting file, I got a list of finished jobs starting... [18:57:15] no problem :) [18:59:05] 6Labs, 10Tool-Labs: SGE queues all overloaded / jobs not submitting although load averages are low - https://phabricator.wikimedia.org/T110994#1594098 (10valhallasw) What's also odd is that the first report on IRC came in at 02:34 CEST = 00:34 UTC, while there doesn't seem to be anything in the accounting log... [19:00:43] andrewbogott: could i possibly have my homedir restored? [19:00:50] andrewbogott: if it is a big deal np [19:01:29] YuviPanda: btw, multichill suggested we could choose to not care about multi-tenancy logstash, and just dump everything readable for all. Most log files already are, after all.... [19:01:53] most log files aren't world readable, are they? [19:01:55] of course they are. [19:01:58] hahahahaha [19:02:05] * YuviPanda goes to find a wall somewhere to bash his head at [19:02:19] access.log is pretty world-readable by default, I think :-p [19:02:26] yeah, but error.log [19:02:35] I bet a lot of secrets are too [19:03:18] I'm not sure what in error.log should be considered very secret, to be honest [19:03:41] YuviPanda, do you know what’s happening with /home in project analytics? [19:03:56] it’s not marked for archiving but neither is it mounted on this instance... [19:04:07] andrewbogott: yes, there's a bug for it. [19:04:48] jesus, phab search sucks [19:05:52] well I can't find it [19:06:32] 6Labs, 10Tool-Labs: SGE queues all overloaded / jobs not submitting although load averages are low - https://phabricator.wikimedia.org/T110994#1594154 (10scfc) My logging in T109485 was done fairly soon after the corresponding action, i. e. I //enabled// the queues ~ 3:00Z after seeing the pending queue growin... [19:07:01] ok, fuck you too phabricator [19:07:02] sigh [19:07:44] * valhallasw`cloud hugs YuviPanda [19:08:51] andrewbogott: ok, so what is happening is that NFS homedir mounts were killed during the big NFS outage [19:09:00] andrewbogott: but they needed a way to keep doing backups, so /data/project is mounted [19:09:14] andrewbogott: so nuria your homedir might be backed up on /data/project - take a look and recover what you'd like? [19:10:55] YuviPanda: k, [19:14:22] 6Labs, 10wikitech.wikimedia.org, 3Labs-sprint-112: Can't list instances on Special:NovaInstance - https://phabricator.wikimedia.org/T110629#1594216 (10Andrew) @scfc, is this still happening for you? (If so, that's good, because I don't have any other test cases) [19:14:51] YuviPanda: many thanks [19:26:48] 6Labs, 10Tool-Labs, 10Continuous-Integration-Config: Job labs-toollabs-debian-glue is failing for labs/toollabs repository - https://phabricator.wikimedia.org/T110939#1594306 (10hashar) That is an issue in jenkins-debian-glue on Trusty: ``` ... 00:00:01.212 Checking out Revision f275d97d7010b3bb2709d4a5211e2... [19:33:01] 6Labs, 10wikitech.wikimedia.org, 3Labs-sprint-112: Can't list instances on Special:NovaInstance - https://phabricator.wikimedia.org/T110629#1594357 (10scfc) Yes, it is. After deleting all cookies for `wikitech.wikimedia.org` and logging in again, "Tim Landscheidt" still cannot see any instances and "Tim Lan... [19:35:09] 6Labs, 10Tool-Labs, 10Continuous-Integration-Config: Job labs-toollabs-debian-glue is failing for labs/toollabs repository - https://phabricator.wikimedia.org/T110939#1594360 (10hashar) I manually triggered the debian-glue job. It uses the `jessie` distribution as provided by the Debian project (no apt.wikim... [19:43:09] valhallasw`cloud: ugh, webservicemonitor failing with qstat not workign [19:44:32] YuviPanda: qstat not working? [19:44:41] what's not working....? [19:44:50] valhallasw`cloud: yeah, tools-services-02 stopped being a submit host [19:44:56] >_< [19:44:59] valhallasw`cloud: and since webservicemonitor shells out to qstat that failed [19:45:02] valhallasw`cloud: just re-added it [19:45:03] but [19:45:05] WHAT THE FUCK [19:45:43] :P [19:45:45] scfc did something with submit hosts last night [19:45:50] valhallasw`cloud: so it was missing tools-checker-01 as well [19:45:52] working on the hostname issue [19:47:22] https://phabricator.wikimedia.org/T110982#1591388 [19:48:00] 6Labs, 10Tool-Labs, 10Continuous-Integration-Config: Job labs-toollabs-debian-glue is failing for labs/toollabs repository - https://phabricator.wikimedia.org/T110939#1594380 (10hashar) We could get the target distribution from the `debian/changelog` file using: export distribution=$(dpkg-parsechangelog --s... [19:48:07] YuviPanda: ^ [19:48:18] it's listed there [19:48:49] valhallasw`cloud: ah, ok [19:48:50] YuviPanda: ^ [19:50:01] gah, irccloud [19:50:46] 6Labs, 10Tool-Labs, 10Continuous-Integration-Config: Change sid pbuilder image name to 'unstable' - https://phabricator.wikimedia.org/T111097#1594382 (10hashar) 3NEW a:3akosiaris [19:50:49] valhallasw`cloud: ah, I see [19:51:07] 6Labs, 10Tool-Labs, 5Patch-For-Review: Remove modules/toollabs/files/host_aliases - https://phabricator.wikimedia.org/T109485#1594391 (10yuvipanda) Re-enabled it for tools-services-02, it was running webservicemonitor. [19:51:11] valhallasw`cloud: haha, inorite [19:53:28] 6Labs, 10Tool-Labs, 10Continuous-Integration-Config: Job labs-toollabs-debian-glue is failing for labs/toollabs repository - https://phabricator.wikimedia.org/T110939#1594400 (10hashar) p:5Triage>3Low [20:07:14] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1206 is CRITICAL 22.22% of data above the critical threshold [0.0] [20:14:26] "Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Reading data from Tools failed: NoMethodError: undefined method `[]' for nil:NilClass at /etc/puppet/manifests/realm.pp:65 on node tools-webgrid-lighttpd-1206.tools.eqiad.wmflabs" [20:15:45] andrewbogott: ^ anything to do with the juno upgrade? [20:16:10] valhallasw`cloud: I don’t know! I will look. [20:17:05] ‘realm’ comes from ldap I think [20:17:14] andrewbogott: might be DNS related (I think it's the ipresolve(hiera('labs_recursor'),4) line, but I have an old checkout) [20:17:42] that would explain the ruby error as well [20:18:08] yep, line 65 is " $nameservers = [ ipresolve(hiera('labs_recursor'),4) ]" [20:19:05] is this happening intermittently? Puppet runs fine on tools-webgrid-lighttpd-1206 right now [20:19:43] might have been a hickup then -- I was responding to the shinken error [20:20:59] there’s a similar thing that happens when instances get low on memory and factor errors out [20:21:11] let me know if you see a pattern of that failure [20:21:21] I was upgrading virt nodes just now, but I don’t know why that would affect dns... [20:21:39] YuviPanda, wasn't Coren supposed to be back by now? [20:21:52] He’s out sick, he’ll be back when he’s back :( [20:21:54] I suppose it could be a memory issues (theres ~380M free which is not a lot) [20:22:08] Cyberpower678: also, why do you need coren specifically? [20:22:17] Oh dear. Hopefully not these past two weeks he's been gone. [20:22:19] Cyberpower678: for your exec node question, please just open a task on phab [20:27:46] valhallasw`cloud: BTW, I tried to push a new grrrit-wm config patch last night and it failed miserably so I rolled back. Whenever you have some spare time (;-)) might be worth a poke? [20:28:48] Did the labs API break? [20:28:53] https://tools.wmflabs.org/nagf/?project=integration - 404 Project not found [20:32:08] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1206 is OK Less than 1.00% above the threshold [0.0] [20:32:44] {"batchcomplete":"","query":{"novainstances":[]}} [20:32:58] https://wikitech.wikimedia.org/w/api.php?format=json&action=query&list=novainstances&niregion=eqiad&niproject=integration [20:33:00] wtf [20:33:07] niproject=cvn is working [20:33:08] weird [20:33:25] James_F: ehhhh, yes, sure. What went wrong? [20:33:27] hashar: [20:33:50] James_F: did you just fab deploy? [20:35:12] !log tools.lolrrit-wm valhallasw: Deployed c54dcb7da4aa9e56cd2f077171d0fd151d8e463a Follow-up ad0675b8: Use performance.* as the regex instead [20:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master [20:35:27] valhallasw`cloud: I merged it, logged in, pulled/reapplied/job restarted and it didnt'. [20:35:41] valhallasw`cloud: joined as grrrit-wm1 not grrrit-wm, didn't join some channels, didn't emit events. [20:35:48] ah [20:35:52] valhallasw`cloud: And nothing in the logs. [20:36:05] In fact, exactly what just happened when you re-deployed. :-) [20:36:35] the logs are in logs/, that seems updated [20:37:03] but it doesn't get to the ssh state [20:37:08] Well, they didn't explain why it hadn't worked [20:37:17] 6Labs, 10Tool-Labs, 5Patch-For-Review: Remove modules/toollabs/files/host_aliases - https://phabricator.wikimedia.org/T109485#1594599 (10scfc) Ah, https://wikitech.wikimedia.org/wiki/Hiera:Tools customizes `role::labs::tools::services::active_host`. I had only looked at `hieradata/`, sorry. [20:37:38] ugh, ##wmt again [20:37:42] d /win 2 [20:38:26] although I can join [20:38:27] :/ [20:39:37] valhallasw`cloud: kill it [20:40:06] 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-111, 3Labs-sprint-112, 5Patch-For-Review: Update remaining virt nodes to OpenStack Juno - https://phabricator.wikimedia.org/T110886#1594605 (10Andrew) 5Open>3Resolved All labvirt100x hosts running Juno now, and the scheduler is back to normal. [20:40:06] 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-111: Update Labs to OpenStack Juno - https://phabricator.wikimedia.org/T110047#1594607 (10Andrew) [20:40:47] !log tools.lolrrit-wm valhallasw: Deployed 7cff198c3ac504e8b38520997bdd72d5cbc8481c remove ##wmt, cannot join [20:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master [20:41:21] 6Labs, 10Labs-Infrastructure, 3Labs-sprint-112: Update Labs to OpenStack Kilo - https://phabricator.wikimedia.org/T110045#1594610 (10Andrew) Scheduled for Wednesday 2015-09-09 16:00 UTC Of course I need to write the related puppet config patches in the meantime. [20:41:24] aaah stupid grrit [20:41:29] yes, working now! [20:42:04] 6Labs, 10Labs-Infrastructure: Keystone/Wikitech project membership messed up - https://phabricator.wikimedia.org/T110887#1594611 (10Andrew) [20:43:04] (03CR) 10Merlijn van Deen: "test" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/235356 (owner: 10Merlijn van Deen) [20:43:15] ...or is it? :9 [20:43:18] yes. [20:43:19] good [20:43:30] James_F: so you did everything right, basically :-p [20:43:55] valhallasw`cloud: That's a really silly reason for it to fail. :-) [20:43:56] but the bot doesn't like not being able to join channels [20:43:58] I know [20:44:28] valhallasw`cloud: But thanks! [20:45:01] 6Labs, 10Tool-Labs, 5Patch-For-Review: Remove modules/toollabs/files/host_aliases - https://phabricator.wikimedia.org/T109485#1594663 (10scfc) … so disabled ` tools-services-01` as submit host. [20:48:13] 6Labs, 10Labs-Infrastructure, 3Labs-sprint-112: Update Labs to OpenStack Kilo - https://phabricator.wikimedia.org/T110045#1594683 (10Andrew) [20:48:13] valhallasw`cloud: Want me to push https://gerrit.wikimedia.org/r/#/c/235357/ now to show that I can? ;-) [20:48:14] 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-111: Update Labs to OpenStack Juno - https://phabricator.wikimedia.org/T110047#1594682 (10Andrew) 5Open>3Resolved [20:48:27] James_F: protip: use fab deploy :-) [20:48:37] valhallasw`cloud: What is 'fab deploy'? [20:48:46] valhallasw`cloud: And why is it not documented? [20:49:01] James_F: because bug #1 ;-) [20:49:11] valhallasw`cloud: https://wikitech.wikimedia.org/wiki/Grrrit-wm :-P [20:49:16] fabric is a tool to automate deployment etc [20:49:29] OK. [20:49:45] And this is a tool that makes deploying grrrit-wm easier? [20:50:16] yes [20:50:19] We use fab to deploy Zuul changes as well (fab deploy_zuul; when inside integration/config.git), saves us a lot of time there at least [20:50:40] Though it's not much more than a glorified make file (e.g. make deploy) [20:51:05] James_F: https://wikitech.wikimedia.org/wiki/Grrrit-wm#Deploying ;-) [20:51:29] valhallasw`cloud: It auto-rebases `auth` onto master and deploys? [20:51:35] remotely, yes [20:51:38] Neat. [20:51:43] and auto-!log-s [20:51:47] Very neat. [20:51:52] fabfile.py is the actual code [20:51:54] so you run it from local workstation, not tools-login (and presumably starts ssh/) [20:51:59] yep [20:52:31] it does assume 'ssh tools-login.wmflabs.org' works as-is [20:52:35] I thik [20:53:12] * James_F checks. [20:53:42] !log tools.lolrrit-wm jforrester: Deployed 7cff198c3ac504e8b38520997bdd72d5cbc8481c remove ##wmt, cannot join [20:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master [20:53:52] \o/ [20:53:57] Hmm. Well, it seemed to work. [20:54:16] Though ideally it'd not do anything if master was already good to go… [20:54:24] s/good to go/live/ [20:54:29] But neat. [20:54:45] yeah, that'd be even nicer, but it's mostly for a '+2, wait for merge, fab deploy' workflow [20:55:06] * James_F nods. [20:55:20] When we move to Phabricator code review we can do it in oneline. [20:55:30] `arc merge && fab deploy` [20:55:33] Or whatever. [20:55:42] fab merge-deploy-ALL-the-things [20:55:44] ;-) [20:55:50] * James_F grins. [20:55:54] Also, "When". [20:55:56] * James_F coughs. [20:56:13] anyway. The wm1 wm2 wm386 thing is also a bit annoying, and I'm not completely sure what happens [20:56:46] I think it connects before the old connection is gone, or something like that [20:57:04] but again not very high priority... [20:59:07] ....or it's SGE doing something crazy again :( [21:01:11] !log tools killed one of the grrrit-wm jobs; for some reason two of them were running?! Not sure what SGE is up to lately. [21:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [21:02:21] * valhallasw`cloud goes cry in a corner again [21:05:30] 6Labs, 10wikitech.wikimedia.org, 3Labs-sprint-112: Can't list instances on Special:NovaInstance - https://phabricator.wikimedia.org/T110629#1594715 (10Andrew) Are you able to view things as you'd expect within Horizon? Or is your project membership broken there as well? [21:06:40] 6Labs, 10wikitech.wikimedia.org, 3Labs-sprint-112: Can't list instances on Special:NovaInstance - https://phabricator.wikimedia.org/T110629#1594717 (10Andrew) Also, what does your project filter look like? Are you still able to select projects that you belong to? [21:09:49] valhallasw`cloud: Still got a clone. Reporting twice. [21:09:59] arggghh [21:10:10] SGE, what are you doing :( [21:10:39] !log tools.lolrrit-wm now qdel'ing the job, but there's still one left. ARRRGGHGH [21:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master [21:10:45] oh, there we go. [21:11:13] !log tools.lolrrit-wm and then we `fab start-job` and then hopefully everything is allright again....? [21:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master [21:11:36] Krinkle: thanks for the ping. Should be OK now. [21:11:47] Yep [21:11:48] thx [21:20:51] 6Labs, 10Wikimedia-Site-Requests, 10wikitech.wikimedia.org, 5Patch-For-Review: MWEchoNotificationEmailBundleJob causes exceptions due to delays not being supported by non-redis job queues - https://phabricator.wikimedia.org/T110985#1594779 (10Krenair) 5Open>3Resolved [21:34:35] 6Labs, 10wikitech.wikimedia.org, 3Labs-sprint-112: Can't list instances on Special:NovaInstance - https://phabricator.wikimedia.org/T110629#1594842 (10Andrew) ok -- I'm still interested in the answers to the above questions but I've also put a small live-hack in place. Does it affect behavior? [21:49:27] 6Labs, 10wikitech.wikimedia.org, 3Labs-sprint-112: Can't list instances on Special:NovaInstance - https://phabricator.wikimedia.org/T110629#1594918 (10scfc) I have tried https://horizon.wikimedia.org/ just now for the first time, and the list of instance there is perfectly fine. In wikitech, yes, I was able... [22:03:30] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1411 unreachable and can't get password entries - https://phabricator.wikimedia.org/T110783#1594985 (10scfc) 5Open>3Resolved a:3scfc I have rebooted the instance via https://horizon.wikimedia.org/project/instances/, `ssh` worked fine afterwards, so I re-enabled... [22:10:51] <[Crow]> Coren are you ATK by any chance? [22:11:06] <[Crow]> Or can someone restart CorenSearchBot on the Labs cluster? [22:34:48] 6Labs, 6operations, 10wikitech.wikimedia.org: intermittent nutcracker failures - https://phabricator.wikimedia.org/T105131#1595100 (10chasemp) Saw this again today on mw1142. As before it was accompanied by: `[2015-09-01 20:52:21.847] nc_proxy.c:330 client connections 935 exceed limit 93` http://graphit... [22:53:05] 6Labs, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: Parsoid on wikitech fails - https://phabricator.wikimedia.org/T108776#1595175 (10Jdforrester-WMF) [22:53:44] 6Labs, 3Labs-sprint-112: Restore some files from /home/gwicke - https://phabricator.wikimedia.org/T110698#1595214 (10GWicke) @Andrew, okay. I'm sure you know this, but [it is possible to extract specific files without unpacking the entire tar](http://www.cyberciti.biz/faq/linux-unix-extracting-specific-files... [23:27:14] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1402 is CRITICAL 55.56% of data above the critical threshold [0.0]