[00:12:20] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review: Remove modules/toollabs/files/host_aliases - https://phabricator.wikimedia.org/T109485#1591286 (10scfc) Plan for after the change gets merged:  1. On `tools-master`, `sudo service gridengine-master restart`.  This should be safe and not cause any loss of data. 2. On `to...
[00:18:46] <grrrit-wm>	 (03PS1) 10Krinkle: app: Fix edit summary bug - content after section should not be dropped [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/235158 
[00:19:24] <grrrit-wm>	 (03PS2) 10Krinkle: app: Fix edit summary bug - content after section should not be dropped [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/235158 
[00:19:36] <grrrit-wm>	 (03CR) 10Krinkle: [C: 032] app: Fix edit summary bug - content after section should not be dropped [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/235158 (owner: 10Krinkle)
[00:20:31] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review: Remove modules/toollabs/files/host_aliases - https://phabricator.wikimedia.org/T109485#1591296 (10scfc) Preparation work: Disabled all queues on `tools-exec-1218`, `tools-exec-1401`, `tools-webgrid-generic-1401`, `tools-webgrid-lighttpd-1201` and `tools-webgrid-lighttpd...
[00:20:41] <grrrit-wm>	 (03CR) 10Krinkle: [V: 032] "TODO: Set up lint test." [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/235158 (owner: 10Krinkle)
[00:24:47] <wikibugs>	 6Labs, 3Labs-sprint-112, 5Patch-For-Review: Logins fail on new instances - https://phabricator.wikimedia.org/T110891#1591302 (10Andrew) 5Open>3Resolved yep, fixed on Trusty as well.  New images are live now.
[00:24:51] <grrrit-wm>	 (03CR) 10Jforrester: [C: 032] Send performance/* repo activity to #wikimedia-perf [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/235028 (owner: 10Krinkle)
[00:26:53] <grrrit-wm>	 (03Merged) 10jenkins-bot: Send performance/* repo activity to #wikimedia-perf [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/235028 (owner: 10Krinkle)
[00:28:49] <James_F>	 Hmm.
[00:34:12] <Amir1>	 I can submit jobs
[00:34:16] <Amir1>	 Is ti okay?
[00:34:18] <Amir1>	 *can't
[00:35:32] <Amir1>	 Coren, valhallasw`cloud 
[00:45:33] <wikibugs>	 6Labs, 10wikitech.wikimedia.org, 7Database: SemanticMediaWiki tries to create temporary tables, but can't as wikiuser is restricted - https://phabricator.wikimedia.org/T110981#1591340 (10Krenair) 3NEW
[00:47:12] <legoktm>	 Krenair: SMW moved to github, the project was archived.
[00:47:25] * Krenair grumbles
[00:48:09] <grrrit-wm>	 (03PS1) 10Jforrester: Follow-up ad0675b8: Use performance.* as the regex instead [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/235161 
[00:50:47] <grrrit-wm>	 (03CR) 10Krinkle: [C: 031] Follow-up ad0675b8: Use performance.* as the regex instead [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/235161 (owner: 10Jforrester)
[00:50:57] <grrrit-wm>	 (03CR) 10Jforrester: [C: 032] Follow-up ad0675b8: Use performance.* as the regex instead [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/235161 (owner: 10Jforrester)
[00:51:00] <grrrit-wm>	 (03Merged) 10jenkins-bot: Follow-up ad0675b8: Use performance.* as the regex instead [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/235161 (owner: 10Jforrester)
[01:00:46] <sitic>	 tools-bastion-02 aka tools-dev is in a strange state, seems to be firewalled from the rest of the network. qstat says:
[01:00:49] <sitic>	 error: denied: host "tools-bastion-02.eqiad.wmflabs" is neither submit nor admin host
[01:01:08] <sitic>	 and "telnet tools-redis 6137" (which works on tools-login):
[01:01:15] <sitic>	 telnet: Unable to connect to remote host: Connection refused
[01:12:52] <James_F>	 !log tools.lolrrit-wm Re-restarting grrrit-wm rolled back to 2f5de55ff75c3c268decfda7442dcdd62df0a42d
[01:12:56] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master
[01:18:08] <wikibugs>	 6Labs, 10Tool-Labs: tools-bastion-02 (aka tools-dev) can't submit grid jobs - https://phabricator.wikimedia.org/T110982#1591388 (10Sitic) 3NEW
[01:18:38] <sitic>	 ^^ never mind that telnet tools-redis thing, that was just a bad port in my bash history
[01:31:39] <wikibugs>	 6Labs, 10Tool-Labs: tools-bastion-02 (aka tools-dev) can't submit grid jobs - https://phabricator.wikimedia.org/T110982#1591405 (10scfc) 5Open>3Resolved a:3scfc Sorry, as part of T109485 I had removed the host as a submit host without realizing that people actually do work there :-).  I have re-added it.
[01:34:05] <grrrit-wm>	 (03PS1) 10BBlack: empty uni.wm.o key for compiler testing [labs/private] - 10https://gerrit.wikimedia.org/r/235172 
[01:34:07] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review: Remove modules/toollabs/files/host_aliases - https://phabricator.wikimedia.org/T109485#1591414 (10scfc) I have readded `tools-bastion-02` as a submit host because people are actually using it (cf. T110982).  So the actual switch of the host name would be done between 1....
[01:34:17] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] empty uni.wm.o key for compiler testing [labs/private] - 10https://gerrit.wikimedia.org/r/235172 (owner: 10BBlack)
[03:02:41] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review: Remove modules/toollabs/files/host_aliases - https://phabricator.wikimedia.org/T109485#1591475 (10scfc) After the number of pending jobs grew, I undid the preparation work by:  ``` scfc@tools-bastion-01:~$ for host in tools-exec-1218 tools-exec-1401 tools-webgrid-generi...
[03:03:18] <wikibugs>	 6Labs, 10Wikimedia-Site-Requests, 10wikitech.wikimedia.org: MWEchoNotificationEmailBundleJob causes exceptions due to delays not being supported by non-redis job queues - https://phabricator.wikimedia.org/T110985#1591476 (10Krenair) 3NEW a:3Krenair
[03:09:41] <wikibugs>	 6Labs, 10CirrusSearch, 6Discovery, 10wikitech.wikimedia.org, and 2 others: Wikitech CirrusSearch jobs throwing exceptions on silver - https://phabricator.wikimedia.org/T110635#1591488 (10Krenair) a:3JanZerebecki
[03:09:55] <wikibugs>	 6Labs, 10CirrusSearch, 6Discovery, 10wikitech.wikimedia.org, and 2 others: Wikitech CirrusSearch jobs throwing exceptions on silver - https://phabricator.wikimedia.org/T110635#1591489 (10Krenair) 5Open>3Resolved Looks like that did the trick.
[03:23:03] <wikibugs>	 6Labs, 6operations, 10wikitech.wikimedia.org: Determine whether wikitech should really depend on production search cluster - https://phabricator.wikimedia.org/T110987#1591503 (10Krenair) 3NEW
[05:46:32] <spagewmf>	 hey laboraticians, on a labs instance without any evil NFS mounts, is any partition backed up?
[05:47:14] <spagewmf>	 developer-doc-devhub.developer-doc.eqiad.wmflabs has /dev/vda1 mounted on /, is that backed up?
[06:01:55] <Kelson42>	 YuviPanda: hi. I newly maintain "enwp10" on toolabs. Unfortunately it does not run anymore https://tools.wmflabs.org/enwp10/cgi-bin/list2.fcgi?run=yes&projecta=Skepticism&importance=Unknown-Class&quality=Unassessed-Class
[06:01:59] <valhallasw`cloud>	 spagewmf: no, i don't think SSSI
[06:02:19] <Kelson42>	 YuviPanda: wanted to restart the webservice... but this fails 
[06:02:23] <Kelson42>	 webservice start
[06:02:23] <Kelson42>	 Starting web service..............................Timeout: could not start job in 30stools.enwp10@tools-bastion-01:~$
[06:03:12] <Kelson42>	 YuviPanda: I'm a newby on toolabs and it looks like something fundamental does not work, but I don't know what. Any idea?
[06:03:32] <valhallasw`cloud>	 Kelson42: there were earlier issues with the grid -- looking now. Yuvi is probably still asleep (european time zone)
[06:04:22] <Kelson42>	 valhallasw`cloud: ok, thx for the feedback. hopefuly he can answer later, at the time he will be awake ;)
[06:06:19] <valhallasw`cloud>	 !log tools investigating SGE issues reported on irc/email
[06:06:23] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master
[06:06:47] <valhallasw`cloud>	 !log tools test job does not get submitted because all queues are overloaded?!
[06:06:49] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master
[06:07:59] <valhallasw`cloud>	 !log tools e.g. "queue instance "task@tools-exec-1211.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=1.820000 (= 0.070000 + 0.50 * 14.000000 with nproc=4) >= 1.75" but the actual load is only 0.3?!
[06:08:02] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master
[06:09:25] <valhallasw`cloud>	 uhh what on earth?
[06:10:28] <gifti>	 my giftbot queue is also overloaded, it seems
[06:17:32] <valhallasw`cloud>	 !log tools going to restart sge_qmaster, hoping this solves the issue :/
[06:17:36] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master
[06:18:44] <wikibugs>	 6Labs, 10Tool-Labs: SGE queues all overloaded / jobs not submitting although load averages are low - https://phabricator.wikimedia.org/T110994#1591712 (10valhallasw) p:5Triage>3Unbreak!
[06:22:17] <gifti>	 fixed my issue
[06:23:52] <valhallasw`cloud>	 !log tools seems to have worked. SGE :(
[06:23:56] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master
[06:23:59] <wikibugs>	 6Labs, 10Tool-Labs: SGE queues all overloaded / jobs not submitting although load averages are low - https://phabricator.wikimedia.org/T110994#1591727 (10valhallasw) p:5Unbreak!>3High
[06:26:04] <wikibugs>	 6Labs, 10Tool-Labs: SGE queues all overloaded / jobs not submitting although load averages are low - https://phabricator.wikimedia.org/T110994#1591729 (10valhallasw) Restarting gridengine master seems to have helped -- but what happened to get the queues in this state?! More issues from the earlier NFS outage?...
[07:29:09] <valhallasw`cloud>	 Kelson42: webservice start should be OK now
[09:30:12] <wikibugs>	 6Labs, 10Labs-Infrastructure, 5Continuous-Integration-Isolation: Include Base::Standard-packages in labs images - https://phabricator.wikimedia.org/T94995#1592058 (10hashar) 5Open>3declined a:3hashar From T110735 , we now only apply a subset of `operations/puppet` since lot of parts are not easily appl...
[09:35:45] <wikibugs>	 6Labs, 10Labs-Infrastructure, 5Continuous-Integration-Isolation: Investigate non blocking fs resizing when instance is booted - https://phabricator.wikimedia.org/T104974#1592076 (10hashar) 5Open>3Resolved a:3hashar I have filled this tasks for instances booted from #labs images. The dib images using Je...
[10:10:29] <godog>	 I'm getting a puppet failure like this when applying a couple of classes to an existing and working instance (with self hosted puppet master)
[10:10:32] <godog>	 Error: Could not retrieve catalog from remote server: Error 400 on SERVER: No matching value for selector param '(undefined)' at /etc/puppet/modules/ldap/manifests/role/config.pp:8 on node test-cassandra5.monitoring.eqiad.wmflabs
[10:10:36] <godog>	 seen this before?
[10:10:50] <YuviPanda>	 godog: new instance?
[10:11:12] <YuviPanda>	 grrrit-wm: if so, restart nslcd and try again
[10:11:57] <godog>	 YuviPanda: yeah new instance
[10:12:37] <godog>	 no change after 'service nslcd restart'
[10:13:28] <godog>	 what I'm puzzled about is that I can go from an instance with a working puppet + self-hosted master to a non working one with the above by applying two classes
[10:14:07] <godog>	 thanks anyway
[10:14:45] <YuviPanda>	 :(
[10:14:55] <YuviPanda>	 andrewbogott: has been investigating these a lot more than I have, he might have more info
[10:16:30] <godog>	 ack, is it tracked in phab btw?
[10:16:54] <YuviPanda>	 godog: think so, let me find a link
[10:17:36] <YuviPanda>	 godog: hmm, https://phabricator.wikimedia.org/T110891 says it was fixed
[10:18:07] <YuviPanda>	 godog: I can take a look if you'd like
[10:18:50] <YuviPanda>	 godog: hmm, looks like $site is undefined for some reason
[10:19:03] <godog>	 YuviPanda: yeah, sure test-cassandra5.eqiad.wmflabs
[10:21:14] <YuviPanda>	 godog: hmm, so 
[10:21:16] <YuviPanda>	 eth0 is           inet addr:127.0.0.2  Bcast:0.0.0.0  Mask:255.255.255.255
[10:21:25] <YuviPanda>	 godog: which is the underlying problem, I think
[10:21:52] <YuviPanda>	 no idea how that happened
[10:21:55] <YuviPanda>	 it is supposed to get 10.xxx
[10:22:53] <godog>	 ah yeah that's from the classes I've applied, could be that
[10:23:48] <godog>	 YuviPanda: for some reason I'm being asked a sudo password and logging in as root doesn't seem to work, can you try removing those two?
[10:24:51] <godog>	 YuviPanda: nevermind I got it
[10:25:32] <YuviPanda>	 godog: ok!
[10:25:44] <YuviPanda>	 godog: so the classes you applied set eth0 IP?
[10:25:53] <YuviPanda>	 godog: so $::site is determined from the IP range
[10:26:03] <YuviPanda>	 godog: so not using a 10.68 range will mess up $::site causing problems
[10:27:46] <godog>	 ah I see
[10:27:57] <YuviPanda>	 godog: this is in manifests/realm.pp
[10:28:08] <YuviPanda>	 the LDAP thing was a red herring, sorry about that
[10:28:13] <YuviPanda>	 but yeah, realm.pp has the IP ranges
[10:29:11] <godog>	 sigh, I also misread the error message, grepping puppet for '(undefined)' would have found it
[10:29:26] <godog>	 sorry I'm really frustrated by puppet's failure modes
[10:29:46] <YuviPanda>	 '(undefined)' is maybe not the best there, it should perhaps be undef
[10:29:58] <YuviPanda>	 but I am worried about changing it, what if there was an actual reason it was '(undefined)' rather than undef
[10:30:52] <godog>	 possibly yeah, setting site based on the ip address doesn't seem ideal, I would have expected to be set when provisioning a machine and puppet reads it
[10:34:09] <YuviPanda>	 godog: yeah... it could maybe come from the fqdn
[10:36:52] <godog>	 heh, from experience making decisions based on the hostname is as fragile unfortunately
[10:37:22] <godog>	 I was thinking more of a regular file e.g. /etc/wikimedia/site
[10:37:40] <YuviPanda>	 where would that be populated?
[10:37:42] <YuviPanda>	 by the image?
[10:37:52] <godog>	 when first provisioning the image yeah
[10:38:49] <godog>	 IMO in general puppet shouldn't decide/detect over things that are not supposed to change over the lifespan of a machine
[10:39:24] <godog>	 (rambling, not an actual plan/idea)
[10:40:11] <YuviPanda>	 yeah, I agree
[10:40:16] <YuviPanda>	 site, realm...
[10:40:17] <YuviPanda>	 project
[10:41:12] <sDrewth>	 addshore: is the category aspect on in the beta cluster?
[10:41:24] <addshore>	 nope
[10:41:33] <sDrewth>	 k
[10:42:10] <addshore>	 spent over an hour now trying to reproduce this 1 thing
[10:42:22] <addshore>	 but everything is happening as I would expect
[10:42:52] * sDrewth shrugs, if it helps I was using touch.py
[10:43:03] <addshore>	 :D
[10:43:16] <addshore>	 yeh, well, thats the next step, until now I was just trying to reproduce it manually
[10:43:23] <addshore>	 guess now I will try with touch.py itself
[10:43:45] <addshore>	 if that doesn't work I can only guess there is something else interacting with it all!
[10:44:21] <sDrewth>	 there was a lot going on at the time with category stuff
[11:35:30] <addshore>	 sDrewth: were you touching all of the pages <<< valhallasw`cloud 
[11:35:44] <addshore>	 or just touching 1 and then saw all of the category changes?
[11:36:04] <sDrewth>	 I was touching all pages
[11:36:09] <addshore>	 hmmm, okay
[11:36:25] <sDrewth>	 we had an issue where they were not assigned to commons image
[11:36:36] <addshore>	 and you wern't touching the File itself, but all of the Side: pages?
[11:36:40] <sDrewth>	 we still have an issue
[11:36:43] <sDrewth>	 correct
[11:36:59] <addshore>	 hot assigned to commons image?
[11:37:01] <addshore>	 *not
[11:37:27] <sDrewth>	 file usge
[11:37:29] <sDrewth>	 usage
[11:37:41] <addshore>	 ahhh, so commons didn't show the page as using the file?
[11:37:44] <sDrewth>	 https://phabricator.wikimedia.org/T108799
[11:37:49] <sDrewth>	 correct
[11:38:31] <sDrewth>	 and I still haven't worked out why the my bot is not showing as a bot on edits either
[11:38:56] <sDrewth>	 separate issue however
[12:34:26] <wikibugs>	 6Labs, 6operations, 3Labs-sprint-112, 5Patch-For-Review: labstore1002 out of space in vg to create new snapshots - https://phabricator.wikimedia.org/T109954#1592521 (10yuvipanda) Ok, so the problem was that the cleanup script wasn't being triggered by any means automatically. Should be fixed now - need to...
[12:34:39] <wikibugs>	 6Labs, 6operations, 3Labs-sprint-112, 5Patch-For-Review: labstore1002 out of space in vg to create new snapshots - https://phabricator.wikimedia.org/T109954#1592523 (10yuvipanda) a:3yuvipanda
[13:08:24] <wikibugs>	 6Labs, 10Tool-Labs: SGE queues all overloaded / jobs not submitting although load averages are low - https://phabricator.wikimedia.org/T110994#1592678 (10scfc) I did disable and later enable queues on some hosts (cf. T109485#1591475).  After that, only the queue for `tools-webgrid-lighttpd-1411` was disabled,...
[13:20:12] <wikibugs>	 6Labs, 10Tool-Labs: SGE queues all overloaded / jobs not submitting although load averages are low - https://phabricator.wikimedia.org/T110994#1592706 (10valhallasw) I think it's unlikely that was the cause, because *all* queues were in this weird state (qhost didn't report high load). I think it's because of...
[14:12:35] <wikibugs>	 6Labs, 3Labs-sprint-112, 3ToolLabs-Goals-Q4: Fix documentation & puppetization for labs NFS - https://phabricator.wikimedia.org/T88723#1592926 (10mark)
[14:52:38] <wikibugs>	 6Labs, 10Wikimedia-Mailing-lists: expand labs listinfo pages and link them to eachother - https://phabricator.wikimedia.org/T97480#1593106 (10Dzahn) I don't think it's right that this has been moved to "Shell/site". HTML templates allow for a list admin to redesign the listinfo pages as they see fit, only usin...
[15:01:56] <wikibugs>	 6Labs, 10Wikimedia-Mailing-lists: expand labs listinfo pages and link them to eachother - https://phabricator.wikimedia.org/T97480#1593147 (10JohnLewis) I think I flipped the workboard quickly while going through.
[15:03:03] <JohnFLewis>	 YuviPanda: could you take a look at ^ / https://phabricator.wikimedia.org/T97480 when you get a few seconds. might be a nice thing to do with the aliases for labs-announce on labs-l if you haven't already :)
[15:03:53] <YuviPanda>	 JohnFLewis: I have no idea how to do any of that... :(
[15:06:10] <valhallasw`cloud>	 somewhere in mailman
[15:06:11] <JohnFLewis>	 YuviPanda: it's html code changes on the listinfos (https://lists.wikimedia.org/mailman/edithtml/labs-l/listinfo.html)
[15:06:15] <valhallasw`cloud>	 :D
[15:26:01] <grrrit-wm>	 (03PS1) 10Jean-Frédéric: Do not add interwiki's on unused images pages to unbreak bot [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/235244 (https://phabricator.wikimedia.org/T110829) 
[15:35:49] <grrrit-wm>	 (03PS1) 10Jean-Frédéric: Do not add interwiki's on unused images pages to unbreak bot [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/235247 (https://phabricator.wikimedia.org/T110829) 
[15:37:02] <grrrit-wm>	 (03CR) 10Jean-Frédéric: [C: 032 V: 032] Do not add interwiki's on unused images pages to unbreak bot [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/235247 (https://phabricator.wikimedia.org/T110829) (owner: 10Jean-Frédéric)
[15:44:32] <andrewbogott>	 YuviPanda, valhallasw`cloud, looks like git_pull_cdnjs is failing again.  Not for lack of inodes this time
[15:44:42] <YuviPanda>	 andrewbogott: let me take a look
[15:44:51] <valhallasw`cloud>	 probably git pull doing weird merges again
[15:44:55] <andrewbogott>	 easy to reproduce on tools-web-static-01
[15:45:00] <YuviPanda>	 also why no shinken-wm here?
[15:45:11] <valhallasw`cloud>	 because you shut it down at some point
[15:45:15] <YuviPanda>	 andrewbogott: btw, can you do https://phabricator.wikimedia.org/T110698? 
[15:45:15] <valhallasw`cloud>	 with the NFS failure I think
[15:45:28] <YuviPanda>	 ah yeah
[15:45:39] <YuviPanda>	 but puppet was supposed to start it back up!
[15:45:50] <andrewbogott>	 YuviPanda: yeah, I can… backups are on labstore2001, right?  Is the filesystem there obvious?
[15:46:14] <YuviPanda>	 andrewbogott: should be...
[15:46:20] <YuviPanda>	 andrewbogott: is under /srv/eqiad I think
[15:46:24] <valhallasw`cloud>	 andrewbogott: yeah, there's a whole bunch of weird merge commits again. I'll git reset --hard for now.
[15:46:46] <andrewbogott>	 YuviPanda: and what are the actual mechanics of the restore?  Just scp back to labstore1002?
[15:47:07] <YuviPanda>	 andrewbogott: no, scp to localhost and then to the parsoid-spof instance
[15:47:11] <valhallasw`cloud>	 !log tools git reset --hard cdnjs on tools-web-static-01
[15:47:15] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master
[15:47:20] <andrewbogott>	 oh, sure
[15:47:29] <valhallasw`cloud>	 I don't know why we get merge commits, but I'm guessing upstream is force-pushing things
[15:47:37] <andrewbogott>	 ok.  YuviPanda, I’m sure telling me how to do this is harder than doing it yourself but I nonetheless appreciate knowing how :)
[15:47:54] <valhallasw`cloud>	 I'll investigate later...
[15:48:18] <YuviPanda>	 andrewbogott: :D so this is from an 'archived' project - I'm not sure if we ever actually ran that script?
[15:48:36] <andrewbogott>	 Oh, yeah, I did.  Crap, so maybe they aren’t in backups
[15:48:41] <andrewbogott>	 well, anyway, I’ll find ‘em
[15:48:51] <YuviPanda>	 andrewbogott: yeah, so it should be in wherever the archived stuff is
[15:49:08] <YuviPanda>	 andrewbogott: then just scp it to your local machine, then scp it to the parsoid-spof instance (in the VE project)
[15:49:20] <YuviPanda>	 andrewbogott: they're very well aware it's a SPOF (hence the name...) 
[15:50:04] <andrewbogott>	 YuviPanda: for future reference, all of that stuff is tarballed in /srv/others/orphan-volumes on labstore1002
[15:51:47] <YuviPanda>	 andrewbogott: ah, ok!
[15:53:08] <andrewbogott>	 godog: still having issues with a new instance?
[15:53:33] <JJ_>	 HELLO
[15:53:53] <YuviPanda>	 andrewbogott: we sorted it out
[15:54:07] <JJ_>	 ok
[15:55:19] <godog>	 andrewbogott: yup we did, PEBCAK as it turns out
[15:55:20] <andrewbogott>	 ah, ok :)
[15:55:20] <JJ_>	 good to hear
[15:55:20] <JJ_>	 glad everything is ok
[15:55:21] <JJ_>	 and sorted out
[15:55:21] <andrewbogott>	 JJ_ you’re a chatbot, yes?
[15:55:21] <JJ_>	 big up
[15:55:21] <JJ_>	 nah
[15:55:21] <JJ_>	 im human
[15:55:22] <andrewbogott>	 bah, I wish I could remember how to op myself
[15:55:22] <YuviPanda>	 godog: and $realm calculation being strange
[15:55:22] <JJ_>	 so confused with the internet
[15:56:19] <JJ_>	 are we coding things on PC which is directly affecting physical reality?
[15:56:21] <godog>	 YuviPanda: yeah! I found a ticket about that and updated
[15:58:39] <wikibugs>	 6Labs, 3Labs-sprint-112: Restore some files from /home/gwicke - https://phabricator.wikimedia.org/T110698#1593408 (10Andrew) On a good day this is an easy thing to do, but NFS is overextended right now so we need to do some infrastructure work before I can restore your archive.  Please ping me in a few days an...
[15:59:19] <wikibugs>	 6Labs, 3Labs-sprint-112: Restore some files from /home/gwicke - https://phabricator.wikimedia.org/T110698#1593415 (10yuvipanda) a:5yuvipanda>3None
[16:07:52] <grrrit-wm>	 (03PS1) 10Jean-Frédéric: Specify site to use when specifying NamespaceFilterPageGenerator [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/235260 (https://phabricator.wikimedia.org/T110420) 
[16:08:22] <grrrit-wm>	 (03CR) 10Jean-Frédéric: [C: 032 V: 032] Specify site to use when specifying NamespaceFilterPageGenerator [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/235260 (https://phabricator.wikimedia.org/T110420) (owner: 10Jean-Frédéric)
[16:22:51] <shinken-wm>	 RECOVERY - Puppet failure on tools-web-static-01 is OK Less than 1.00% above the threshold [0.0]
[16:23:33] <wikibugs>	 6Labs, 10Tool-Labs: sql script does not accept wildcards as parameter - https://phabricator.wikimedia.org/T75595#1593518 (10scfc) Thanks for the patch.  The quoting of `"${*:2}"` works for the case where an SQL statement is supplied as a command line argument, but fails for when `sql` is only used to access th...
[16:40:22] <wikibugs>	 6Labs, 10Tool-Labs: SGE queues all overloaded / jobs not submitting although load averages are low - https://phabricator.wikimedia.org/T110994#1593560 (10Merl) ``` np_load_avg=1.810000 (= 0.060000 + 0.50 * 14.000000 with nproc=4) >= 1.75 = np_load_avg +  job_load_adjustments * [weight number of jobs started in...
[16:47:26] <valhallasw`cloud>	 andrewbogott / YuviPanda, I get the feeling the cdnjs development model is 'we revert things by force pushing'
[16:47:53] <YuviPanda>	 I wonder if that's just git being too big for what they do
[16:48:07] <YuviPanda>	 valhallasw`cloud: I also wonder if we can ditch git::clone and just have a shallow clone that's force reset
[16:48:19] <andrewbogott>	 maybe the puppet git class needs to have an option of just doing ‘reset —hard origin’ after every checkout
[16:48:50] <valhallasw`cloud>	 we can svn clone the repo :P
[16:48:51] <andrewbogott>	 We don’t need to support local patches do we?
[16:48:51] <YuviPanda>	 andrewbogott: nope
[16:48:51] * YuviPanda hits valhallasw`cloud with gridengine :P
[16:48:52] <valhallasw`cloud>	 yeah, but it's probably better to raise alarms on force pushes
[16:48:52] <andrewbogott>	 was the problem with the checkout or with the rebase?
[16:48:56] <YuviPanda>	 but we don't actually care!
[16:48:59] <YuviPanda>	 if they force push
[16:49:00] <YuviPanda>	 right?
[16:49:03] <valhallasw`cloud>	 andrewbogott: https://github.com/cdnjs/cdnjs/issues/5587
[16:49:05] <valhallasw`cloud>	 well
[16:49:13] <valhallasw`cloud>	 force push is a great way to sneakily insert stuff
[16:49:27] <valhallasw`cloud>	 but we don't actually check what they do
[16:49:30] <valhallasw`cloud>	 so *shrug*
[16:50:03] <YuviPanda>	 yeah
[16:51:06] <valhallasw`cloud>	 andrewbogott: so what happens is: they force push to remove a package X, git merges so we end up with a merged version which includes package X. Then at some point they re-add package X and we get a merge conflict
[16:51:47] <andrewbogott>	 yeah, so reset origin would avoid that
[16:52:02] <valhallasw`cloud>	 but yeah, I think we should probably make the git::pull thing either pull-without-merge /or/ fetch-reset-hard
[16:52:38] <valhallasw`cloud>	 so  'git pull --ff-only' vs 'git fetch && git reset --hard origin/master'
[16:52:43] * andrewbogott thinks that ‘git pull’ should be be deprecated, anyway
[16:53:07] <andrewbogott>	 as long as there’s no danger of people being upset about local patches… I like the latter.
[16:53:11] <andrewbogott>	 We know for sure what we’re getting
[16:54:56] <valhallasw`cloud>	 I think --ff-only is probably better as default option
[16:55:05] <valhallasw`cloud>	 guaranteed to not do weird stuff or lose information without raising hell
[16:55:39] <andrewbogott>	 oh yeah, the git class definitely shouldn’t rebase by default!  But in this case it might be appropriate.
[16:55:58] <valhallasw`cloud>	 ah yeah, for this case the reset option is what we want
[17:00:57] <shinken-wm>	 PROBLEM - ToolLabs Home Page on toollabs is CRITICAL - Socket timeout after 10 seconds
[17:05:47] <shinken-wm>	 RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 878928 bytes in 2.319 second response time
[17:10:51] <YuviPanda>	 valhallasw`cloud: andrewbogott ^ just as a fyi, ^ is just mark rebuilding the RAID array. apparently we were one disk failure away from data loss since the outage
[17:11:02] <valhallasw`cloud>	 fun fun fun
[17:12:21] * valhallasw`cloud cheers mark on
[17:12:35] <andrewbogott>	 gluster just looks better and better
[17:13:14] <mark>	 some more people taking on labs raid/nfs storage would help too :)
[17:13:52] * mark wonders why he can't login on tools exec nodes
[17:14:07] <andrewbogott>	 mark, want me to add you?
[17:14:11] <mark>	 yes please
[17:14:35] <andrewbogott>	 I’m guessing your username on wikitech is ‘mark'
[17:14:41] <mark>	 I guess ;)
[17:15:00] <andrewbogott>	 oh, hm, you’re already project admin
[17:15:15] <andrewbogott>	 you can log in to other labs boxes but not tools?
[17:15:21] <andrewbogott>	 Or other tools boxes but not exec nodes?
[17:15:27] <mark>	 oh
[17:15:29] <mark>	 it's totally my fault
[17:15:32] <mark>	 sorry, can't trust a manager
[17:15:36] <andrewbogott>	 ok :)
[17:17:03] * andrewbogott -> lunch
[17:17:04] <shinken-wm>	 PROBLEM - ToolLabs Home Page on toollabs is CRITICAL - Socket timeout after 10 seconds
[17:17:44] <YuviPanda>	 (intermittent failures only - wfm ^)
[17:18:54] <mark>	 i'd still prefer it if that were not the case :P
[17:19:29] <andrewbogott>	 grrr, any idea why that’s flapping?  Every time I look it’s up but shinken disagrees
[17:19:51] <YuviPanda>	 andrewbogott: I think it just failes intermittently because of NFS failures and shinken catches it and we don't
[17:20:00] <mark>	 just slow
[17:20:40] <YuviPanda>	 yeah it can just timeout because it attempts to read NFS at exactly the wrong time?
[17:20:53] <mark>	 yes
[17:20:56] <YuviPanda>	 it's also a mess of PHP scripts that nobody wants to look at much :(
[17:20:59] <mark>	 i wonder what the test timeout is
[17:23:14] <YuviPanda>	 it's whatever the default check_http timeout is
[17:23:47] <valhallasw`cloud>	 YuviPanda: it's also the handler for all 500s of other tools :|
[17:26:04] <mark>	 non-tool labs rebuild is done btw
[17:28:42] <Luke081515>	 hm, the beta cluster is very slow at the moment, I think
[17:29:49] <shinken-wm>	 RECOVERY - Puppet failure on tools-web-static-02 is OK Less than 1.00% above the threshold [0.0]
[17:44:23] * mark is on a random exec node and sees it do quite an insane nr of NFS requests per second
[17:44:23] <mark>	 like 10kreq/s
[17:46:10] <gifti>	 is it mine by chance?
[17:46:14] <mark>	 no idea
[17:46:16] <mark>	 tools-exec-1403
[17:46:25] <gifti>	 ok, then not, phew :)
[17:46:42] <mark>	 but if there's anything anyone can do to reduce load on Tools NFS, i'd recommend doing ti :)
[17:47:03] <gifti>	 just now or in general?
[17:47:10] <mark>	 certainly now
[17:47:21] <mark>	 and it's never a great idea to put a lot of load on NFS I'd say, that's something we're discouraging :)
[17:47:31] <mark>	 but now because of a raid rebuild
[17:55:41] <Kelson42>	 YuviPanda: hi
[17:55:42] <Kelson42>	 valhallasw`cloud: thx, now it works
[17:56:25] <Kelson42>	 YuviPanda: on toolabs if "webservice start" timeouts... then "webservice start" tells you "Your webservice is already running"
[17:56:41] <Kelson42>	 YuviPanda: but this is not the case!?
[17:58:06] <mark>	 server_caps  delegreturn  getacl       setacl       fs_locations rel_lkowner  
[17:58:06] <mark>	 26        0% 0         0% 0         0% 0         0% 0         0% 2658196872 73% 
[17:58:47] <mark>	 something is doing a whole lot of locking on nfs
[17:59:22] <gifti>	 have you tools to find out more?
[18:01:16] <valhallasw`cloud>	 Kelson42: which tool? it can take a bit of time before the webservice is available
[18:02:36] <Kelson42>	 valhallasw`cloud: enwp10
[18:02:36] <Kelson42>	 valhallasw`cloud: ok, but if "webservice start" tells you "Timeout: could not start job", then the service should not start!
[18:02:59] <valhallasw`cloud>	 Kelson42: webservice seems up
[18:03:02] <Kelson42>	 tools.enwp10@tools-bastion-01:~$ webservice stop
[18:03:03] <Kelson42>	 Stopping web service
[18:03:03] <Kelson42>	 tools.enwp10@tools-bastion-01:~$ webservice start
[18:03:03] <Kelson42>	 Starting web service..............................Timeout: could not start job in 30stools.enwp10@tools-bastion-01:~$ 
[18:03:03] <Kelson42>	 tools.enwp10@tools-bastion-01:~$ webservice start
[18:03:05] <Kelson42>	 Your webservice is already running
[18:04:02] <Kelson42>	 valhallasw`cloud: yes, it works but the problem is that the webservice status is not trustable at all
[18:04:14] <valhallasw`cloud>	 I don't see how that's the case. The timeout message could be a bit clearer, but webservice status just asks qstat whether the webservice is running.
[18:04:35] <valhallasw`cloud>	 tools.enwp10@tools-bastion-01:~$ webservice status
[18:04:35] <valhallasw`cloud>	 Your webservice is running
[18:04:36] <valhallasw`cloud>	 and it is.
[18:05:25] <Kelson42>	 valhallasw`cloud: how to you explain in the log I have posted, that the webservice is running, although one line above it stops with a timeout?
[18:06:03] <valhallasw`cloud>	 because the webservice didn't stat in 30 seconds
[18:06:24] <valhallasw`cloud>	 it did, however, start after a longer period of time
[18:06:52] <Kelson42>	 valhallasw`cloud: ok, then indeed the message (the behaviour of "webservice start") is totally misleading
[18:08:46] <valhallasw`cloud>	 Kelson42: I disagree, but feel free to open a bug on Phabricator.
[18:10:53] <YuviPanda>	 valhallasw`cloud: Kelson42 I think maybe the web service monitor started it after your start failed 
[18:10:54] <YuviPanda>	 ?
[18:11:04] <YuviPanda>	 See service.log for an entry 
[18:11:07] <valhallasw`cloud>	 no, I think the start is just a bit slow because of NFS
[18:12:33] <RichSmith>	 Well, ClueBot gave up :P
[18:12:44] <Kelson42>	 YuviPanda: I have a lot of " No running webservice job found, attempting to start it"
[18:13:34] <YuviPanda>	 mark: do we have an eta for the rebuild?
[18:13:42] <mark>	 weeks
[18:13:44] <Kelson42>	 YuviPanda: I just want to write a small script restarting automatically the webservice if for some reason it dies... so need to rely on something to run "webservice start"....
[18:13:58] <mark>	 i just eased the rebuild speed
[18:14:08] <mark>	 so, i'm monitoring nfs load on a random tools exec node
[18:14:11] <mark>	 and all it's doing is file locking
[18:14:14] <YuviPanda>	 Kelson42: please don't do that. We have web service monitor that does that 
[18:14:38] <YuviPanda>	 Kelson42: right now there is NFS slowdowns causing issues probably. 
[18:14:39] <valhallasw`cloud>	 there's nothing in the service.log from today (just from yesterday and before that)
[18:15:36] <Kelson42>	 YuviPanda: any solution to autorestart services (avoid me to monitor this manually)?
[18:15:39] <valhallasw`cloud>	 so it's just slowness in starting the job
[18:15:49] <valhallasw`cloud>	 Kelson42: webservices /are/ autorestarted
[18:16:05] <Kelson42>	 valhallasw`cloud: ok, then its perfect. 
[18:16:22] <Kelson42>	 YuviPanda: valhallasw`cloud thx again for the support today. Wish you the best to fix last problems with NFS.
[18:16:31] <valhallasw`cloud>	 you're welcome
[18:21:08] <ostriches>	 wikitech wonky at the moment too?
[18:24:05] <ostriches>	 Can't get any OSM functions to work on wikitech. Already tried the login/logout dance.
[18:26:45] <valhallasw`cloud>	 ostriches: what do you mean with 'OSM functions'?
[18:26:45] <YuviPanda>	 ostriches: have you tried 'remove yourself from the project and get re-added'?
[18:27:14] <ostriches>	 valhallasw`cloud: OpenStackManager stuffs. Like adjusting public IPs, instances, etc.
[18:27:18] <ostriches>	 YuviPanda: No.
[18:27:41] <valhallasw`cloud>	 yeah, that's probably the issue that's solved by removing-and-readding
[18:27:46] <valhallasw`cloud>	 YuviPanda: btw, can you do that for me for tools?
[18:27:51] <valhallasw`cloud>	 and toolsbeta
[18:27:54] <YuviPanda>	 valhallasw`cloud: yes, let me do that
[18:28:23] <ostriches>	 YuviPanda: Removed myself from deployment-prep, can you readd?
[18:28:28] <ostriches>	 +admin
[18:28:30] <YuviPanda>	 ostriches: yes
[18:28:52] <YuviPanda>	 ostriches: what's your username?
[18:28:58] <ostriches>	 "Chad"
[18:29:49] <YuviPanda>	 the 'delete' button is dangerously close to 'add member'
[18:30:08] <ostriches>	 lol, yerp
[18:31:05] <YuviPanda>	 ostriches: done
[18:31:13] <YuviPanda>	 valhallasw`cloud: doing for you now, on tools
[18:32:05] <ostriches>	 YuviPanda: That did it, thx
[18:34:06] <YuviPanda>	 valhallasw`cloud: done for tools
[18:34:14] <YuviPanda>	 valhallasw`cloud: also, heh, I still can't spell or pronounce your name :P
[18:34:50] <valhallasw`cloud>	 hahaha. Yeah, I don't think anyone non-dutch can spell my name.
[18:34:55] <valhallasw`cloud>	 Merljin is very common
[18:35:16] <RichSmith>	 Is it pronouced like Merlin?
[18:35:50] <YuviPanda>	 valhallasw`cloud: yeah, I'm sticking to just saying 'valhalla' :P
[18:36:49] <valhallasw`cloud>	 RichSmith: sort of. It's indeed the same name, but with the dutch ij digraph in there. https://en.wikipedia.org/wiki/IJ_(digraph) has a voice sample
[18:47:04] <nuria>	 hello!!!! I have not logged in on labs in three months and now .. ahem... i can't seem to be able to ssh
[18:47:15] <nuria>	 i saw a note about bastion and nfs
[18:47:29] <nuria>	 but i am not sure what do i need to change on my ssh config
[18:50:34] <andrewbogott>	 nuria: can you tell me what’s happening when you try to log in?
[18:50:46] <nuria>	 andrewbogott: hello!, yes
[18:51:18] <nuria>	 https://www.irccloud.com/pastebin/ncxW1Xgq/
[18:52:42] <nuria>	 andrewbogott: i can connect to bastion
[18:53:02] <nuria>	 (bast1001.wikimedia.org)
[18:53:25] <andrewbogott>	 I think your labs username is different from your prod username
[18:53:34] <andrewbogott>	 do you know what the labs name is, or shall I look it up?
[18:54:31] <YuviPanda>	 valhallasw`cloud: re-added for toolsbeta too, btw
[18:54:38] <valhallasw`cloud>	 thanks
[18:55:06] <nuria>	 andrewbogott: both was 'nuria'
[18:55:15] <nuria>	 andrewbogott: orrr
[18:55:22] <andrewbogott>	 ok, then you’ll need to ssh nuria@<whatever>
[18:55:24] <nuria>	 andrewbogott: i had /home/nuria
[18:55:37] <andrewbogott>	 right now I see you trying to log in as ‘nuriaruiz'
[18:55:51] <andrewbogott>	 which labs does not recognize as a valid user.  Presumably that’s your name on your local system?
[18:56:53] <andrewbogott>	 looks like it’s working now
[18:56:56] <nuria>	 andrewbogott: ah, yes, i guess i forgot i did that before? 
[18:57:04] <nuria>	 andrewbogott: sorry about that!
[18:57:08] <nuria>	 andrewbogott: and many thanks
[18:57:12] <wikibugs>	 6Labs, 10Tool-Labs: SGE queues all overloaded / jobs not submitting although load averages are low - https://phabricator.wikimedia.org/T110994#1594077 (10valhallasw) Thanks you for that explanation, that helps a lot!  A bit more post-mortem work. From the accounting file, I got a list of finished jobs starting...
[18:57:15] <andrewbogott>	 no problem :)
[18:59:05] <wikibugs>	 6Labs, 10Tool-Labs: SGE queues all overloaded / jobs not submitting although load averages are low - https://phabricator.wikimedia.org/T110994#1594098 (10valhallasw) What's also odd is that the first report on IRC came in at 02:34 CEST = 00:34 UTC, while there doesn't seem to be anything in the accounting log...
[19:00:43] <nuria>	 andrewbogott: could i possibly have my homedir restored? 
[19:00:50] <nuria>	 andrewbogott: if it is a big deal np
[19:01:29] <valhallasw`cloud>	 YuviPanda: btw, multichill suggested we could choose to not care about multi-tenancy logstash, and just dump everything readable for all. Most log files already are, after all....
[19:01:53] <YuviPanda>	 most log files aren't world readable, are they?
[19:01:55] <YuviPanda>	 of course they are.
[19:01:58] <YuviPanda>	 hahahahaha
[19:02:05] * YuviPanda goes to find a wall somewhere to bash his head at
[19:02:19] <valhallasw`cloud>	 access.log is pretty world-readable by default, I think :-p
[19:02:26] <YuviPanda>	 yeah, but error.log
[19:02:35] <YuviPanda>	 I bet a lot of secrets are too
[19:03:18] <valhallasw`cloud>	 I'm not sure what in error.log should be considered very secret, to be honest
[19:03:41] <andrewbogott>	 YuviPanda, do you know what’s happening with /home in project analytics?
[19:03:56] <andrewbogott>	 it’s not marked for archiving but neither is it mounted on this instance...
[19:04:07] <YuviPanda>	 andrewbogott: yes, there's a bug for it.
[19:04:48] <YuviPanda>	 jesus, phab search sucks
[19:05:52] <YuviPanda>	 well I can't find it
[19:06:32] <wikibugs>	 6Labs, 10Tool-Labs: SGE queues all overloaded / jobs not submitting although load averages are low - https://phabricator.wikimedia.org/T110994#1594154 (10scfc) My logging in T109485 was done fairly soon after the corresponding action, i. e. I //enabled// the queues ~ 3:00Z after seeing the pending queue growin...
[19:07:01] <YuviPanda>	 ok, fuck you too phabricator
[19:07:02] <YuviPanda>	 sigh
[19:07:44] * valhallasw`cloud hugs YuviPanda
[19:08:51] <YuviPanda>	 andrewbogott: ok, so what is happening is that NFS homedir mounts were killed during the big NFS outage
[19:09:00] <YuviPanda>	 andrewbogott: but they needed a way to keep doing backups, so /data/project is mounted
[19:09:14] <YuviPanda>	 andrewbogott: so nuria your homedir might be backed up on /data/project - take a look and recover what you'd like?
[19:10:55] <nuria>	 YuviPanda: k, 
[19:14:22] <wikibugs>	 6Labs, 10wikitech.wikimedia.org, 3Labs-sprint-112: Can't list instances on Special:NovaInstance - https://phabricator.wikimedia.org/T110629#1594216 (10Andrew) @scfc, is this still happening for you?  (If so, that's good, because I don't have any other test cases)
[19:14:51] <nuria>	 YuviPanda: many thanks
[19:26:48] <wikibugs>	 6Labs, 10Tool-Labs, 10Continuous-Integration-Config: Job labs-toollabs-debian-glue is failing for labs/toollabs repository - https://phabricator.wikimedia.org/T110939#1594306 (10hashar) That is an issue in jenkins-debian-glue on Trusty: ``` ... 00:00:01.212 Checking out Revision f275d97d7010b3bb2709d4a5211e2...
[19:33:01] <wikibugs>	 6Labs, 10wikitech.wikimedia.org, 3Labs-sprint-112: Can't list instances on Special:NovaInstance - https://phabricator.wikimedia.org/T110629#1594357 (10scfc) Yes, it is.  After deleting all cookies for `wikitech.wikimedia.org` and logging in again, "Tim Landscheidt" still cannot see any instances and "Tim Lan...
[19:35:09] <wikibugs>	 6Labs, 10Tool-Labs, 10Continuous-Integration-Config: Job labs-toollabs-debian-glue is failing for labs/toollabs repository - https://phabricator.wikimedia.org/T110939#1594360 (10hashar) I manually triggered the debian-glue job. It uses the `jessie` distribution as provided by the Debian project (no apt.wikim...
[19:43:09] <YuviPanda>	 valhallasw`cloud: ugh, webservicemonitor failing with qstat not workign
[19:44:32] <valhallasw`cloud>	 YuviPanda: qstat not working?
[19:44:41] <valhallasw`cloud>	 what's not working....?
[19:44:50] <YuviPanda>	 valhallasw`cloud: yeah, tools-services-02 stopped being a submit host
[19:44:56] <valhallasw`cloud>	 >_<
[19:44:59] <YuviPanda>	 valhallasw`cloud: and since webservicemonitor shells out to qstat that failed
[19:45:02] <YuviPanda>	 valhallasw`cloud: just re-added it
[19:45:03] <YuviPanda>	 but
[19:45:05] <YuviPanda>	 WHAT THE FUCK
[19:45:43] <YuviPanda>	 :P
[19:45:45] <valhallasw`cloud>	 scfc did something with submit hosts last night
[19:45:50] <YuviPanda>	 valhallasw`cloud: so it was missing tools-checker-01 as well
[19:45:52] <valhallasw`cloud>	 working on the hostname issue
[19:47:22] <valhallasw`cloud>	 https://phabricator.wikimedia.org/T110982#1591388
[19:48:00] <wikibugs>	 6Labs, 10Tool-Labs, 10Continuous-Integration-Config: Job labs-toollabs-debian-glue is failing for labs/toollabs repository - https://phabricator.wikimedia.org/T110939#1594380 (10hashar) We could get the target distribution from the `debian/changelog` file using:  export distribution=$(dpkg-parsechangelog --s...
[19:48:07] <valhallasw`cloud>	 YuviPanda: ^
[19:48:18] <valhallasw`cloud>	 it's listed there
[19:48:49] <YuviPanda>	 valhallasw`cloud: ah, ok
[19:48:50] <valhallasw`cloud>	 YuviPanda: ^
[19:50:01] <valhallasw`cloud>	 gah, irccloud
[19:50:46] <wikibugs>	 6Labs, 10Tool-Labs, 10Continuous-Integration-Config: Change sid pbuilder image name to 'unstable' - https://phabricator.wikimedia.org/T111097#1594382 (10hashar) 3NEW a:3akosiaris
[19:50:49] <YuviPanda>	 valhallasw`cloud: ah, I see
[19:51:07] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review: Remove modules/toollabs/files/host_aliases - https://phabricator.wikimedia.org/T109485#1594391 (10yuvipanda) Re-enabled it for tools-services-02, it was running webservicemonitor.
[19:51:11] <YuviPanda>	 valhallasw`cloud: haha, inorite
[19:53:28] <wikibugs>	 6Labs, 10Tool-Labs, 10Continuous-Integration-Config: Job labs-toollabs-debian-glue is failing for labs/toollabs repository - https://phabricator.wikimedia.org/T110939#1594400 (10hashar) p:5Triage>3Low
[20:07:14] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1206 is CRITICAL 22.22% of data above the critical threshold [0.0]
[20:14:26] <valhallasw`cloud>	 "Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Reading data from Tools failed: NoMethodError: undefined method `[]' for nil:NilClass at /etc/puppet/manifests/realm.pp:65 on node tools-webgrid-lighttpd-1206.tools.eqiad.wmflabs"
[20:15:45] <valhallasw`cloud>	 andrewbogott: ^ anything to do with the juno upgrade?
[20:16:10] <andrewbogott>	 valhallasw`cloud: I don’t know! I will look.
[20:17:05] <andrewbogott>	 ‘realm’ comes from ldap I think
[20:17:14] <valhallasw`cloud>	 andrewbogott: might be DNS related (I think it's the ipresolve(hiera('labs_recursor'),4) line, but I have an old checkout)
[20:17:42] <valhallasw`cloud>	 that would explain the ruby error as well
[20:18:08] <valhallasw`cloud>	 yep, line 65 is "    $nameservers = [ ipresolve(hiera('labs_recursor'),4) ]"
[20:19:05] <andrewbogott>	 is this happening intermittently?  Puppet runs fine on tools-webgrid-lighttpd-1206 right now
[20:19:43] <valhallasw`cloud>	 might have been a hickup then -- I was responding to the shinken error
[20:20:59] <andrewbogott>	 there’s a similar thing that happens when instances get low on memory and factor errors out
[20:21:11] <andrewbogott>	 let me know if you see a pattern of that failure
[20:21:21] <andrewbogott>	 I was upgrading virt nodes just now, but I don’t know why that would affect dns...
[20:21:39] <Cyberpower678>	 YuviPanda, wasn't Coren supposed to be back by now?
[20:21:52] <andrewbogott>	 He’s out sick, he’ll be back when he’s back :(
[20:21:54] <valhallasw`cloud>	 I suppose it could be a memory issues (theres ~380M free which is not a lot)
[20:22:08] <valhallasw`cloud>	 Cyberpower678: also, why do you need coren specifically?
[20:22:17] <Cyberpower678>	 Oh dear.  Hopefully not these past two weeks he's been gone.
[20:22:19] <valhallasw`cloud>	 Cyberpower678: for your exec node question, please just open a task on phab
[20:27:46] <James_F>	 valhallasw`cloud: BTW, I tried to push a new grrrit-wm config patch last night and it failed miserably so I rolled back. Whenever you have some spare time (;-)) might be worth a poke?
[20:28:48] <Krinkle>	 Did the labs API break?
[20:28:53] <Krinkle>	 https://tools.wmflabs.org/nagf/?project=integration - 404 Project not found
[20:32:08] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1206 is OK Less than 1.00% above the threshold [0.0]
[20:32:44] <Krinkle>	 {"batchcomplete":"","query":{"novainstances":[]}}
[20:32:58] <Krinkle>	 https://wikitech.wikimedia.org/w/api.php?format=json&action=query&list=novainstances&niregion=eqiad&niproject=integration
[20:33:00] <Krinkle>	 wtf
[20:33:07] <Krinkle>	 niproject=cvn is working
[20:33:08] <Krinkle>	 weird
[20:33:25] <valhallasw`cloud>	 James_F: ehhhh, yes, sure. What went wrong?
[20:33:27] <Krinkle>	 hashar: 
[20:33:50] <valhallasw`cloud>	 James_F: did you just fab deploy?
[20:35:12] <ircnotifier>	 !log tools.lolrrit-wm valhallasw: Deployed c54dcb7da4aa9e56cd2f077171d0fd151d8e463a Follow-up ad0675b8: Use performance.* as the regex instead
[20:35:14] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master
[20:35:27] <James_F>	 valhallasw`cloud: I merged it, logged in, pulled/reapplied/job restarted and it didnt'.
[20:35:41] <James_F>	 valhallasw`cloud: joined as grrrit-wm1 not grrrit-wm, didn't join some channels, didn't emit events.
[20:35:48] <valhallasw`cloud>	 ah
[20:35:52] <James_F>	 valhallasw`cloud: And nothing in the logs.
[20:36:05] <James_F>	 In fact, exactly what just happened when you re-deployed. :-)
[20:36:35] <valhallasw`cloud>	 the logs are in logs/, that seems updated
[20:37:03] <valhallasw`cloud>	 but it doesn't get to the ssh state
[20:37:08] <James_F>	 Well, they didn't explain why it hadn't worked
[20:37:17] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review: Remove modules/toollabs/files/host_aliases - https://phabricator.wikimedia.org/T109485#1594599 (10scfc) Ah, https://wikitech.wikimedia.org/wiki/Hiera:Tools customizes `role::labs::tools::services::active_host`.  I had only looked at `hieradata/`, sorry.
[20:37:38] <valhallasw`cloud>	 ugh, ##wmt again
[20:37:42] <henna>	 d /win 2
[20:38:26] <valhallasw`cloud>	 although I can join
[20:38:27] <valhallasw`cloud>	 :/
[20:39:37] <JohnFLewis>	 valhallasw`cloud: kill it
[20:40:06] <wikibugs>	 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-111, 3Labs-sprint-112, 5Patch-For-Review: Update remaining virt nodes to OpenStack Juno - https://phabricator.wikimedia.org/T110886#1594605 (10Andrew) 5Open>3Resolved All labvirt100x hosts running Juno now, and the scheduler is back to normal.
[20:40:06] <wikibugs>	 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-111: Update Labs to OpenStack Juno - https://phabricator.wikimedia.org/T110047#1594607 (10Andrew)
[20:40:47] <ircnotifier>	 !log tools.lolrrit-wm valhallasw: Deployed 7cff198c3ac504e8b38520997bdd72d5cbc8481c remove ##wmt, cannot join
[20:40:50] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master
[20:41:21] <wikibugs>	 6Labs, 10Labs-Infrastructure, 3Labs-sprint-112: Update Labs to OpenStack Kilo - https://phabricator.wikimedia.org/T110045#1594610 (10Andrew) Scheduled for Wednesday 2015-09-09 16:00 UTC  Of course I need to write the related puppet config patches in the meantime.
[20:41:24] <valhallasw`cloud>	 aaah stupid grrit
[20:41:29] <valhallasw`cloud>	 yes, working now!
[20:42:04] <wikibugs>	 6Labs, 10Labs-Infrastructure: Keystone/Wikitech project membership messed up - https://phabricator.wikimedia.org/T110887#1594611 (10Andrew)
[20:43:04] <grrrit-wm>	 (03CR) 10Merlijn van Deen: "test" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/235356 (owner: 10Merlijn van Deen)
[20:43:15] <valhallasw`cloud>	 ...or is it? :9
[20:43:18] <valhallasw`cloud>	 yes.
[20:43:19] <valhallasw`cloud>	 good
[20:43:30] <valhallasw`cloud>	 James_F: so you did everything right, basically :-p
[20:43:55] <James_F>	 valhallasw`cloud: That's a really silly reason for it to fail. :-)
[20:43:56] <valhallasw`cloud>	 but the bot doesn't like not being able to join channels
[20:43:58] <valhallasw`cloud>	 I know
[20:44:28] <James_F>	 valhallasw`cloud: But thanks!
[20:45:01] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review: Remove modules/toollabs/files/host_aliases - https://phabricator.wikimedia.org/T109485#1594663 (10scfc) … so disabled ` tools-services-01` as submit host.
[20:48:13] <wikibugs>	 6Labs, 10Labs-Infrastructure, 3Labs-sprint-112: Update Labs to OpenStack Kilo - https://phabricator.wikimedia.org/T110045#1594683 (10Andrew)
[20:48:13] <James_F>	 valhallasw`cloud: Want me to push https://gerrit.wikimedia.org/r/#/c/235357/ now to show that I can? ;-)
[20:48:14] <wikibugs>	 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-111: Update Labs to OpenStack Juno - https://phabricator.wikimedia.org/T110047#1594682 (10Andrew) 5Open>3Resolved
[20:48:27] <valhallasw`cloud>	 James_F: protip: use fab deploy :-)
[20:48:37] <James_F>	 valhallasw`cloud: What is 'fab deploy'?
[20:48:46] <James_F>	 valhallasw`cloud: And why is it not documented?
[20:49:01] <valhallasw`cloud>	 James_F: because bug #1 ;-)
[20:49:11] <James_F>	 valhallasw`cloud: https://wikitech.wikimedia.org/wiki/Grrrit-wm :-P
[20:49:16] <valhallasw`cloud>	 fabric is a tool to automate deployment etc
[20:49:29] <James_F>	 OK.
[20:49:45] <James_F>	 And this is a tool that makes deploying grrrit-wm easier?
[20:50:16] <valhallasw`cloud>	 yes
[20:50:19] <Krinkle>	 We use fab to deploy Zuul changes as well (fab deploy_zuul; when inside integration/config.git), saves us a lot of time there at least
[20:50:40] <Krinkle>	 Though it's not much more than a glorified make file (e.g. make deploy)
[20:51:05] <valhallasw`cloud>	 James_F: https://wikitech.wikimedia.org/wiki/Grrrit-wm#Deploying ;-)
[20:51:29] <James_F>	 valhallasw`cloud: It auto-rebases `auth` onto master and deploys?
[20:51:35] <valhallasw`cloud>	 remotely, yes
[20:51:38] <James_F>	 Neat.
[20:51:43] <valhallasw`cloud>	 and auto-!log-s
[20:51:47] <James_F>	 Very neat.
[20:51:52] <valhallasw`cloud>	 fabfile.py is the actual code
[20:51:54] <Krinkle>	 so you run it from local workstation, not tools-login (and presumably starts ssh/)
[20:51:59] <valhallasw`cloud>	 yep
[20:52:31] <valhallasw`cloud>	 it does assume  'ssh tools-login.wmflabs.org' works as-is
[20:52:35] <valhallasw`cloud>	 I thik
[20:53:12] * James_F checks.
[20:53:42] <ircnotifier>	 !log tools.lolrrit-wm jforrester: Deployed 7cff198c3ac504e8b38520997bdd72d5cbc8481c remove ##wmt, cannot join
[20:53:44] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master
[20:53:52] <valhallasw`cloud>	 \o/
[20:53:57] <James_F>	 Hmm. Well, it seemed to work.
[20:54:16] <James_F>	 Though ideally it'd not do anything if master was already good to go…
[20:54:24] <James_F>	 s/good to go/live/
[20:54:29] <James_F>	 But neat.
[20:54:45] <valhallasw`cloud>	 yeah, that'd be even nicer, but it's mostly for a '+2, wait for merge, fab deploy' workflow
[20:55:06] * James_F nods.
[20:55:20] <James_F>	 When we move to Phabricator code review we can do it in oneline.
[20:55:30] <James_F>	 `arc merge && fab deploy`
[20:55:33] <James_F>	 Or whatever.
[20:55:42] <valhallasw`cloud>	 fab merge-deploy-ALL-the-things
[20:55:44] <valhallasw`cloud>	 ;-)
[20:55:50] * James_F grins.
[20:55:54] <James_F>	 Also, "When".
[20:55:56] * James_F coughs.
[20:56:13] <valhallasw`cloud>	 anyway. The wm1 wm2 wm386 thing is also a bit annoying, and I'm not completely sure what happens
[20:56:46] <valhallasw`cloud>	 I think it connects before the old connection is gone, or something like that
[20:57:04] <valhallasw`cloud>	 but again not very high priority...
[20:59:07] <valhallasw`cloud>	 ....or it's SGE doing something crazy again :(
[21:01:11] <valhallasw`cloud>	 !log tools killed one of the grrrit-wm jobs; for some reason two of them were running?! Not sure what SGE is up to lately.
[21:01:15] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master
[21:02:21] * valhallasw`cloud goes cry in a corner again
[21:05:30] <wikibugs>	 6Labs, 10wikitech.wikimedia.org, 3Labs-sprint-112: Can't list instances on Special:NovaInstance - https://phabricator.wikimedia.org/T110629#1594715 (10Andrew) Are you able to view things as you'd expect within Horizon?  Or is your project membership broken there as well?
[21:06:40] <wikibugs>	 6Labs, 10wikitech.wikimedia.org, 3Labs-sprint-112: Can't list instances on Special:NovaInstance - https://phabricator.wikimedia.org/T110629#1594717 (10Andrew) Also, what does your project filter look like?  Are you still able to select projects that you belong to?
[21:09:49] <Krinkle>	 valhallasw`cloud: Still got a clone. Reporting twice.
[21:09:59] <valhallasw`cloud>	 arggghh
[21:10:10] <valhallasw`cloud>	 SGE, what are you doing :(
[21:10:39] <valhallasw`cloud>	 !log tools.lolrrit-wm now qdel'ing the job, but there's still one left. ARRRGGHGH
[21:10:42] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master
[21:10:45] <valhallasw`cloud>	 oh, there we go.
[21:11:13] <valhallasw`cloud>	 !log tools.lolrrit-wm and then we `fab start-job` and then hopefully everything is allright again....?
[21:11:15] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master
[21:11:36] <valhallasw`cloud>	 Krinkle: thanks for the ping. Should be OK now.
[21:11:47] <Krinkle>	 Yep
[21:11:48] <Krinkle>	 thx
[21:20:51] <wikibugs>	 6Labs, 10Wikimedia-Site-Requests, 10wikitech.wikimedia.org, 5Patch-For-Review: MWEchoNotificationEmailBundleJob causes exceptions due to delays not being supported by non-redis job queues - https://phabricator.wikimedia.org/T110985#1594779 (10Krenair) 5Open>3Resolved
[21:34:35] <wikibugs>	 6Labs, 10wikitech.wikimedia.org, 3Labs-sprint-112: Can't list instances on Special:NovaInstance - https://phabricator.wikimedia.org/T110629#1594842 (10Andrew) ok -- I'm still interested in the answers to the above questions but I've also put a small live-hack in place.  Does it affect behavior?
[21:49:27] <wikibugs>	 6Labs, 10wikitech.wikimedia.org, 3Labs-sprint-112: Can't list instances on Special:NovaInstance - https://phabricator.wikimedia.org/T110629#1594918 (10scfc) I have tried https://horizon.wikimedia.org/ just now for the first time, and the list of instance there is perfectly fine.  In wikitech, yes, I was able...
[22:03:30] <wikibugs>	 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1411 unreachable and can't get password entries - https://phabricator.wikimedia.org/T110783#1594985 (10scfc) 5Open>3Resolved a:3scfc I have rebooted the instance via https://horizon.wikimedia.org/project/instances/, `ssh` worked fine afterwards, so I re-enabled...
[22:10:51] <[Crow]>	 Coren are you ATK by any chance? 
[22:11:06] <[Crow]>	 Or can someone restart CorenSearchBot on the Labs cluster?
[22:34:48] <wikibugs>	 6Labs, 6operations, 10wikitech.wikimedia.org: intermittent nutcracker failures - https://phabricator.wikimedia.org/T105131#1595100 (10chasemp) Saw this again today on mw1142.    As before it was accompanied by:  `[2015-09-01 20:52:21.847] nc_proxy.c:330 client connections 935 exceed limit 93`  http://graphit...
[22:53:05] <wikibugs>	 6Labs, 10Parsoid, 10VisualEditor, 10wikitech.wikimedia.org: Parsoid on wikitech fails - https://phabricator.wikimedia.org/T108776#1595175 (10Jdforrester-WMF)
[22:53:44] <wikibugs>	 6Labs, 3Labs-sprint-112: Restore some files from /home/gwicke - https://phabricator.wikimedia.org/T110698#1595214 (10GWicke) @Andrew, okay.   I'm sure you know this, but [it is possible to extract specific files without unpacking the entire tar](http://www.cyberciti.biz/faq/linux-unix-extracting-specific-files...
[23:27:14] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1402 is CRITICAL 55.56% of data above the critical threshold [0.0]