[00:46:38] 10Deployment-Systems, 10Release-Engineering-Team (Kanban), 10Scap (Tech Debt Sprint 2017-Q2), 10WorkType-NewFunctionality: Scap3 submodule space issues - https://phabricator.wikimedia.org/T137124#3593365 (10mmodell) [00:52:16] 10Gerrit, 10Release-Engineering-Team (Next), 10Scap (Tech Debt Sprint 2017-Q2), 10ORES, and 2 others: Simplify git-fat support for pulling from both production and labs - https://phabricator.wikimedia.org/T171758#3593368 (10mmodell) [00:52:59] 10Scap (Tech Debt Sprint 2017-Q2), 10scap2: scap3 should repack / pack-refs git repos under /srv/deployment - https://phabricator.wikimedia.org/T112509#3593370 (10mmodell) [00:54:05] 10Scap (Tech Debt Sprint 2017-Q2), 10scap2: Scap should touch symlinks when originals are touched - https://phabricator.wikimedia.org/T126306#3593372 (10mmodell) [00:54:29] 10Gerrit, 10Release-Engineering-Team (Next), 10Scap (Tech Debt Sprint 2017-Q2), 10ORES, and 2 others: Support git-lfs files in gerrit - https://phabricator.wikimedia.org/T171758#3593374 (10awight) [00:55:00] 10Release-Engineering-Team (Watching / External), 10Scap, 10ORES, 10Operations, 10Scoring-platform-team: ORES should use a git large file plugin for storing serialized binaries - https://phabricator.wikimedia.org/T171619#3593377 (10awight) [00:56:35] 10Scap (Tech Debt Sprint 2017-Q2), 10WorkType-NewFunctionality: Play elevator music while scap is running - https://phabricator.wikimedia.org/T170484#3593378 (10mmodell) [01:02:22] 10Scap: Make symlink-swapping optional in deploy promote - https://phabricator.wikimedia.org/T145889#3593388 (10mmodell) [01:02:24] 10Scap, 10Phabricator: scap should provide a way to skip symlink-swapping in promote - https://phabricator.wikimedia.org/T172486#3593385 (10mmodell) [03:42:36] (03CR) 10MaxSem: "This is not for branch creation, it's for commits after that. For example, yesterday I've messed master up while updating an extension: ht" [tools/release] - 10https://gerrit.wikimedia.org/r/376658 (https://phabricator.wikimedia.org/T175324) (owner: 10MaxSem) [05:14:48] PROBLEM - Mediawiki Error Rate on graphite-labs is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [10.0] [06:57:19] PROBLEM - Puppet errors on deployment-kafka01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [08:49:49] RECOVERY - Mediawiki Error Rate on graphite-labs is OK: OK: Less than 1.00% above the threshold [1.0] [10:02:04] PROBLEM - Free space - all mounts on deployment-kafka01 is CRITICAL: CRITICAL: deployment-prep.deployment-kafka01.diskspace.root.byte_percentfree (<100.00%) [11:08:55] thcipriani: just wondering (guess you won't reply for a while) but what were you doing for tags of the images? Anything I should follow? [16:25:10] PROBLEM - Long lived cherry-picks on puppetmaster on deployment-puppetmaster02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [19:26:30] (03CR) 10Umherirrender: "In the following example after remove of the use statement two newlines are there:" [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/375547 (https://phabricator.wikimedia.org/T167694) (owner: 10Legoktm) [20:35:07] (03CR) 10Chad: "Don't use git review then? I keep telling people it's a pile of horse shit." [tools/release] - 10https://gerrit.wikimedia.org/r/376658 (https://phabricator.wikimedia.org/T175324) (owner: 10MaxSem) [20:48:22] 10Release-Engineering-Team (Kanban), 10Release, 10Train Deployments: 1.30.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T174362#3594007 (10demon) a:03demon [20:48:24] 10Release-Engineering-Team (Kanban), 10Release, 10Train Deployments: 1.30.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T174361#3594009 (10demon) a:03demon [20:48:26] 10Release-Engineering-Team (Kanban), 10Release, 10Train Deployments: 1.30.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T174360#3594011 (10demon) a:03demon [20:48:28] 10Release-Engineering-Team (Kanban), 10Release, 10Train Deployments: 1.30.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T174359#3594013 (10demon) a:03demon [20:48:31] 10Release-Engineering-Team (Kanban), 10Release, 10Train Deployments: 1.30.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T174358#3594015 (10demon) a:03demon [20:48:32] 10Release-Engineering-Team (Kanban), 10Release, 10Train Deployments: 1.30.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T172806#3594017 (10demon) a:03demon [20:55:39] hi, is there somebody online who can restart the irc-rc gateway for beta? [21:01:23] the service itself is running, but connected to the ircd, there are not any channels active, and I can't find the irc bot [21:10:06] Where's it even run? [21:10:36] Sagan: whats the bot's name do you know? [21:10:49] Reedy: deployment-ircd [21:11:37] Zppix: rc-pmtpa [21:11:47] luke081515@deployment-ircd:~$ systemctl status ircecho [21:11:47] ● ircecho.service - IRC bot for the MW RC IRCD [21:11:58] but I don't have the rights for doing more than view the status [21:12:05] deployment-ircd is a MW Changes IRC Broadcast Server (mw_rc_irc) [21:12:08] Why ircd? [21:12:11] badly named [21:12:18] Can't you sudo? [21:12:28] nope, I'm only a project member, not admin [21:14:08] Sep 09 21:13:54 deployment-ircd systemd[1]: Started IRC bot for the MW RC IRCD. [21:14:43] why on earth are we running that silly irc gateway in beta? [21:14:54] Reedy: it works again, thx [21:15:06] no_justification: because we *can* :P [21:15:19] That's an absurd reason [21:15:33] I can think of a lot of things I *can* do that are better ideas [21:15:40] there was a reason when it was created i guess [21:15:44] it's only a small instance [21:16:04] The reason was "let's make beta EXACTLY like production" [21:16:12] Laudable, but kinda silly for some things. [21:16:43] ircd is a bad name [21:16:49] it's not an irc server [21:16:57] it hosts and ircd as well [21:17:02] *an [21:18:29] like kraz, the ircd and the bot is on the same host [21:31:43] 10Scap (Tech Debt Sprint 2017-Q2), 10WorkType-NewFunctionality: Play elevator music while scap is running - https://phabricator.wikimedia.org/T170484#3432960 (10Reedy) Will this work with my preferred internet of shit music playing device? [21:43:09] (03PS4) 10MarcoAurelio: Whitelist Dvorapa on Zuul CI [integration/config] - 10https://gerrit.wikimedia.org/r/375765 [21:44:45] (03CR) 10Zppix: [C: 031] Whitelist Dvorapa on Zuul CI [integration/config] - 10https://gerrit.wikimedia.org/r/375765 (owner: 10MarcoAurelio) [21:46:40] Reedy: do you have +2 on integration/config ? [21:48:12] I think so [21:48:29] I can never get the deploying to work though [21:52:17] Reedy: can you maybe review the change i just reviewed in config please [22:11:16] thcipriani: around? I think nodepool may be jammed up but wanted a quick check from releng folks before I page A.ndrew [22:15:55] !log `sudo journalctl -u nodepool --since today --no-pager` shows many LaunchStatusException failures. [22:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [22:17:03] bd808: yeah, I can take a look, glancing at the zuul page, sure seems like it might be stuck :9 [22:17:06] er :( [22:17:08] andrewbogott: are you around? All of the nodepool slots are stuck in "building" state [22:17:19] s/all/most/ [22:20:50] poop. I can't login to the labnet boxes to look at those logs [22:21:40] hrm, well I got an instance to delete...haven't seen a timeout in the logs. Seems like the instances are just not getting IP addresses? [22:22:37] I pinged andrewbogott in a non-irc channel. I can poke c.hase next [22:26:21] checking to see if I can launch a normal instance... [22:27:24] well starting to see "LaunchStatusException: Server 1ace8396-774a-4383-b2f4-37e9865d0da1 for node id: 814111 status: ERROR" [22:28:43] saw 8 of those all at once for whatever reason, probably hit a timeout on the openstack side and errored out, I guess [22:29:14] yeah. My bet is that something happened and now rabbitmq is fucked again [22:29:28] * bd808 hated nodepool and its stampedes [22:29:49] it's a little different this time, normally it times out deleting an instance and all instances are stuck in delete [22:30:16] this time it seems able to delete instances (or at least I can delete them manually without hitting a timeout) [22:30:58] and it's getting hostnames and node ids from openstack, it's just none of the images are getting IPs so not sure. [22:31:42] madhuvishy is coming to help look at it [22:32:00] are instances created through horizon coming up? [22:32:06] hey all [22:32:17] howdy [22:32:26] (03PS1) 10Umherirrender: Skip function comments with @deprecated [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/377025 [22:32:34] hi madhuvishy. symptom is instances not launching [22:33:07] I can't get to logs on labnet1002 to see why :/ [22:33:40] (03CR) 10Umherirrender: "Just an idea to skip some function comments in mediawiki/core like Title::getSquidURLs" [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/377025 (owner: 10Umherirrender) [22:33:51] bd808: right, looking. [22:33:53] I'm trying to launch an instance from horizon, but it's stuck in build too [22:34:13] > Deleting node id: 814113 which has been in building state for 0.32088120136 hours [22:35:37] ^ line from the log that doesn't really contain any new information. It's what I see in nodepool logs after I see the "LauchStatusException: [...] status: ERROR" line [22:38:23] madhuvishy: I finally figured out how to get to the logs. Lots of oslo.messaging timeouts. [22:38:35] restarting rabbitmq indicated I think [22:39:34] bd808: Yeah I am seeing the same [22:40:29] bd808: do you know where to restart rabbit from [22:40:37] (also labnet1001) [22:41:11] A fine question. I was just looking for that on wikitech. [22:41:24] https://www.irccloud.com/pastebin/Oz5BlwyU/ [22:42:21] madhuvishy: I think labcontrol1001. [22:42:55] ^ from the sal: "restarting rabbitmq-server on labcontrol1001" [22:43:23] yeah [22:43:32] I found https://wikitech.wikimedia.org/wiki/Incident_documentation/20150812-LabsOutage [22:43:40] yeah done [22:44:07] {uptime,7} :) [22:46:36] bd808: I'm still seeing timeout errors in the nova-network logs [22:47:42] things like "No calling threads waiting for msg_id : fe9c3a1e83c44c48b4f6c27e0e5da04e" are logged after nodepool gave up I think... [22:48:28] hrm, I'm also still seeing launch errors, but I'll be interested to see if the newest instances being built work...deleting old ones manually now [22:48:56] I'm trying an instance from horizon and it doesn't have an instance log yet :/ [22:49:11] bd808: yeah no instances are stuck spawning [22:49:24] chasemp is suggesting stopping nodepool [22:49:42] * thcipriani does [22:49:50] I just did [22:50:17] ok, explains why my command was hanging :) [22:51:07] stopping nodepool is on the theory that it keeps the stampede going [22:51:39] basically [22:52:38] chasemp: I restarted rabbit again [22:52:39] also if it's not the source of the problem it's definitely making it worse / hiding the other problems by mamking so much noise [22:52:45] things looking better I think? [22:52:53] not sure, I'm clearing things out [22:53:08] we'll see if nova can actually delete these instances [22:53:18] and the clearing is working now [22:53:19] afaik [22:53:27] i just deleted a stuck instance [22:53:37] yeah, there were so many in queue I guess it's coming up with new things to build even tho nodepool is down [22:53:55] yeah [22:53:59] I wonder if the new servers are somewhow slower to build/delete and taht has contributed to more stampede backlog [22:54:09] it sure seems like this became a regular event after the new batch [22:54:16] aah [22:54:20] seems feasible [22:54:30] dunno really [22:54:44] madhuvishy: can you delete all the instances in admin-monitoring project? [22:55:00] chasemp: yup doing [22:55:08] there may be causality, but the whole thing has been on the ragged edge of failure for a year. [22:55:08] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [22:55:29] so far I still am not convinced things are healthy [22:55:36] which is why we want it used less rather than more [22:55:40] me neither [22:56:13] nova-network logs still not looking good [22:56:39] bd808: want to call andrew? [22:56:57] chasemp: yeah, I'll do that [22:57:03] madhuvishy: nova-network or nova-api...? [22:57:15] ah [22:57:18] i'm seeing nova-network [22:58:16] Error: Unable to delete instances: fullstackd-1504997231, fullstackd-1504996321, fullstackd-1504995416, fullstackd-1504994504, fullstackd-1504993600, fullstackd-1504992695 [22:58:21] Error: Unable to retrieve instances. [22:58:30] I'm going to restart nova-network and nova-api in series I guess [22:58:46] that explains ^ I hope [22:58:50] I left him a voicemail [22:58:58] andrew said he's a few minutes out [22:59:22] heh. we called at the same time I guess? [22:59:27] he txtd [22:59:35] okay 1 instance deletion succeeded [23:01:16] everything that was in build went straight error for me, which could be good as state is catching up [23:01:40] I cleared out fullstack instances [23:02:01] nodepool thinks he has a bunch of stuff [23:02:11] thcipriani: there is a way to hard clear out the nodepool pool isn't there? [23:02:15] I kind of recall this from last go [23:02:24] madhuvishy: my delete is hanging or so [23:02:26] idk [23:02:38] eh...you can do: nodepool delete --now [nodepool-id] [23:04:30] there's also the nodepool alien-list command that'll show all the ones nodepool "lost track of" according to https://wikitech.wikimedia.org/wiki/Nodepool [23:06:56] The last time there were issues like this I just stopped the nodepool service for 10 minutes while all the queues cleared out... [23:07:02] (I still haven't actually read the scrollback though) [23:07:37] andrewbogott: I think it's been stopped for about 8 minutes now [23:07:44] ok [23:09:41] sorry I missed the original message, I was in a part of MN that apparently doesn't have cell coverage :/ [23:10:23] andrewbogott: it seemed like even post stopping nodepool things were still happening in contintcloud project like new builds popping up for a minute [23:11:20] It would definitely take a while for all the queued messages to trickle out. [23:11:39] things are mildly responsive for a minute [23:11:44] I did see nodepool go through "graceful-stop" and stuff was still happening in the debug log briefly after you'd said you stopped it [23:11:54] I'm trying to clean up contintcloud andrewbogott on labcontrol [23:11:58] right now the scheduler says it can't reach any of the compute nodes [23:12:09] andrewbogott: hm [23:14:27] I'm cleaning out nodepools pool locally [23:14:37] deletes finally started working andrewbogott for me at least [23:15:03] I'm restarting compute on labvirt1005, if that helps then I'll probably restart nova-compute everywhere [23:15:08] andrewbogott: k [23:15:27] andrewbogott: it seemed like nova-network had an issue where even after rabbit was in theory ok it needed a kick too fyi [23:15:36] I think teh reconnect logic must be fragile [23:15:53] yeah, I don't know why that is, it seems to depend on the particular order that new services come up [23:15:59] probably it's conductor's fault [23:16:12] ok I cleaned out contintcloud on both sides [23:16:21] andrewbogott: I'm gonig to restart nova-fullstack [23:16:26] ok [23:16:27] or when you think it may work :) [23:16:34] ok trying [23:17:15] The scheduler is able to reach more compute nodes now... [23:17:28] andrewbogott: one small note, salt-master proc on labcontrol was eating a whole cpu doing nothing it seemed like [23:17:29] so I stopped it [23:17:38] should get restarted on puppets run but it was weird [23:17:51] (03PS1) 10Umherirrender: Unifiy @returns and @throw in function docs [tools/codesniffer] - 10https://gerrit.wikimedia.org/r/377028 [23:17:55] andrewbogott: post nova-compute restart? [23:17:58] most things are re-appearing on the scheduler [23:18:01] without restarting [23:18:12] so it was just taking forever to get back on rabbit and get organized I guess [23:18:13] nova-fullstack still can't build [23:18:21] or delete [23:19:58] compute nodes are still grieving for timed-out messages [23:22:37] why did this melt so badly? [23:23:43] I still don't really know what's happening. It might not be any different from historic rabbit freakouts [23:40:49] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [23:43:02] thcipriani: from nodepool start up [23:43:03] 2017-09-09 23:42:27,930 ERROR nodepool.NodeCompleteThread: Exception handling event for integration-slave-trusty-1001: [23:43:04] Traceback (most recent call last): [23:43:04] File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 90, in run [23:43:06] with self.nodepool.getDB().getSession() as session: [23:43:08] File "/usr/lib/python2.7/dist-packages/nodepool/nodepool.py", line 1603, in getDB [23:43:10] return self.config.db [23:43:12] I saw that [23:43:12] AttributeError: 'NoneType' object has no attribute 'db' [23:43:14] * chasemp shrugs [23:46:15] thcipriani: what dos this mean in nodepool logs [23:46:16] Deficit: ci-jessie-wikimedia: 113 (start: 135 min-ready: 12 ready: 22 capacity: 0) [23:46:17] vs [23:46:22] Deficit: ci-trusty-wikimedia: 10 (start: 13 min-ready: 2 ready: 3 capacity: 0) [23:46:35] ci-jessie-wikimedia 135? [23:47:13] that is needed instances to clear out backlog I guess? [23:47:31] but is it...issuing to start 135? [23:48:01] 2017-09-09 23:46:33,327 DEBUG nodepool.NodePool: 2017-09-09 23:46:33,327 DEBUG nodepool.NodePool: [23:48:23] it doesn't look like it is...the nodepool list queue is still at 25 [23:48:53] allocation request from the gearman queue, I would guess [23:49:04] from gearman -> nodepool [23:49:13] * chasemp nods [23:49:17] nodepool list | grep -c build => 25 [23:49:43] thcipriani: so ... docker seems cool ;) [23:49:49] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 38.46% of data above the critical threshold [140.0] [23:50:31] dockerize everything [23:50:52] productive comment: how do we help with the nodepool->docker effort? [23:51:19] tx thcipriani, it seems like things re going build to active now [23:51:56] chasemp: I see stuff moving again! Thanks again (and again and again :() for all the help. [23:52:05] legoktm: I think that addshore had started some work on that... [23:52:41] last I heard there were some concerns about making the mega-image that would be need to make one image that can test all the things [23:53:05] Maybe that means that some other approach should be taken, but I don't know all the details [23:53:21] legoktm: leave it to me [23:53:27] I'm back in the UK now with better wifi [23:53:31] :D [23:53:44] addshore was dockerizing phan, I put some work into moving to moving ops/puppet. hashar started work with some debian image builder to build a super image and we talked about it Friday (this image was 4.2 GB and I said it was a bad idea) [23:54:11] I can start working on it a bit more rapidly as of Monday! [23:54:39] Sleep now though [23:54:53] legoktm: if you have any jobs particularly you went to do ping me or email me [23:55:00] addshore: the python tox jobs [23:55:28] those are well contained and usually our testing ground...those were some of the first jobs that went to nodepool [23:55:28] Link me to some jobs ;) [23:55:30] those would definitely be a nice discreet chunk to peel off [23:55:42] I'm going to bed now though! [23:55:47] addshore: "tox-jessie" it's literally just install python, and run "tox" [23:55:54] addshore: night :) [23:56:09] Sweet [23:56:58] all of the initial work for moving stuff is contained in https://github.com/wikimedia/integration-config/tree/master/dockerfiles moving tox jobs should fit into the pattern that's established there nicely. [23:58:14] thcipriani: I pinged you earlier but didn't see a response if you had one, have you decided how you are tagging images yet? [23:58:25] are there docs on how to build and test images in jenkins? [23:58:46] I wrote up initial docs [23:58:47] legoktm: that's manual for now [23:58:48] * thcipriani digs [23:58:54] but yeah it's pretty manual [23:58:56] Although there is avscript [23:59:06] https://www.mediawiki.org/wiki/Continuous_integration/Docker [23:59:53] That still talks about .build files now which we don't use any more