[00:01:14] phe: Things are stable now; I can reasonably start it now. [00:03:47] Coren, many thanks [00:04:03] YuviPanda: thanks [00:05:51] Coren, will the permission fixed ? actually I see in /data/project/phetools/cache/hocr/ [00:05:57] drwx--S--- 2 root tools.phetools 4096 Apr 2 00:03 02/ [00:06:16] phe: Yes, on a directory by directory basis. [00:17:15] Coren: I’m switching webservice2 to webservice now... [00:23:53] 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4, 3ToolLabs-Q4-Sprint-1: Make webservice2 default webservice implementation - https://phabricator.wikimedia.org/T90855#1172637 (10yuvipanda) Boom, and this is done :D [00:38:00] got an email saying: 2015-04-02 00:35:55 error: /usr/local/bin/webservice2 --release precise lighttpd start [00:38:11] anybody known what does it mean? [00:38:35] SMalyshev: i got the same. something is amiss. [00:39:21] afeder: SMalyshev yeah, that’s my perl patch gone bad, I think... [00:39:22] looking [00:40:36] afeder: SMalyshev nothing’s going wrong, though, so is ok. [00:40:56] alright, thx [00:44:05] YuviPanda: sorry for disappearing yesterday. My ISP decided to drop the connection for maintenance. :) [00:44:36] I'll keep monitoring and see if the issue still happens once the dust settles. [00:45:07] (This is the periodic tool downtime we discussed.) [00:48:16] Pathoschild: yeah, I remember :) [00:52:41] Coren: my earlier bigbrother change I pointed you to apparently broke it... [00:52:48] * YuviPanda just did a partial revert [00:53:20] * YuviPanda puts service manifests on the top of his list after this is done [00:54:56] all good now [01:05:44] stupid error on my part... [01:15:00] 10Tool-Labs, 3ToolLabs-Goals-Q4: Make webservice default to trusty on toollabs - https://phabricator.wikimedia.org/T94788#1172701 (10yuvipanda) 3NEW [01:20:22] 10Tool-Labs: Phase out precise instances from toollabs - https://phabricator.wikimedia.org/T94790#1172717 (10yuvipanda) 3NEW [01:20:34] 10Tool-Labs: Phase out precise instances from toollabs - https://phabricator.wikimedia.org/T94790#1172724 (10yuvipanda) p:5Triage>3Normal [01:21:31] 10Tool-Labs: Move tools-master and tools-shadow to trusty - https://phabricator.wikimedia.org/T94791#1172731 (10yuvipanda) 3NEW [01:22:48] 10Tool-Labs: Phase out precise instances from toollabs - https://phabricator.wikimedia.org/T94790#1172737 (10yuvipanda) [01:22:50] 10Tool-Labs, 3ToolLabs-Goals-Q4: Make tools-login / bastion hosts redundant and move them to trusty - https://phabricator.wikimedia.org/T91863#1172738 (10yuvipanda) [01:22:56] 10Tool-Labs, 3ToolLabs-Goals-Q4: Make webservice default to trusty on toollabs - https://phabricator.wikimedia.org/T94788#1172740 (10yuvipanda) [01:22:58] 10Tool-Labs: Phase out precise instances from toollabs - https://phabricator.wikimedia.org/T94790#1172717 (10yuvipanda) [01:23:44] 10Tool-Labs: Make jsub / qsub default to trusty instances - https://phabricator.wikimedia.org/T94792#1172741 (10yuvipanda) 3NEW [01:25:02] !log tools created tools-bastion-02 [01:25:08] Logged the message, Master [02:00:33] PROBLEM - Puppet failure on tools-bastion-02 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [0.0] [02:05:45] RECOVERY - Puppet failure on tools-bastion-02 is OK: OK: Less than 1.00% above the threshold [0.0] [02:21:01] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:43:33] 6Labs, 10Tool-Labs: Planned labs maintenance on tools-db: Puppetization + log file change - https://phabricator.wikimedia.org/T94643#1172810 (10Springle) Yes, the recovery plan can be as simple as put-the-old-config-back. Since there is no replication slave (right?) for tools-db, we should dump the data first. [02:43:34] 6Labs, 10Tool-Labs: Planned labs maintenance on tools-db: Puppetization + log file change - https://phabricator.wikimedia.org/T94643#1172811 (10coren) >>! In T94643#1172810, @Springle wrote: > Since there is no replication slave (right? That's Phase II. :-) [02:43:34] 6Labs, 10Tool-Labs: Planned labs maintenance on tools-db: Puppetization + log file change - https://phabricator.wikimedia.org/T94643#1172812 (10coren) Should we schedule this tentatively for Thursday next? [02:44:28] 10Tool-Labs: Move tools-master and tools-shadow to trusty - https://phabricator.wikimedia.org/T94791#1172813 (10coren) Thankfully, that shouldn't be /too/ hard thanks to the master/shadow setup. Reinstalling shadow as Trusty, then forcibly switching over to test seems like the simplest approach. [02:45:55] I'm looking at it now. [02:53:02] Coren: webservice2 is the default now \o/ [02:53:02] And all seems well [02:53:16] Something is going oddly with labstore1001 for no visible reason. All IO has stopped, it seems. [03:03:29] But nothing is complainig. [03:03:30] (On the box) [03:03:30] I can't do "webservice stop"? https://gist.github.com/afeder/44aaddc2e45813f5e004 [03:09:09] tools-login seems to be down or not responding [03:18:27] it's very very slow to login rather, same things for any nfs io [03:18:27] There's an issue I'm working on right now. [03:18:29] Something is consuming pretty much all the disk bandwidth and I'm trying to track it down. [03:18:29] Oh [03:18:31] that's why this isn't loading... [03:18:36] hope you're not spamming f5 Negative24 :) [03:18:38] Nah its not even loading [03:18:38] well at least my instance isn't impacted [03:18:39] Coren: you’re on top of this? [03:18:39] * Magog_the_Ogre is glad it isn't him [03:18:39] I can’t log in… anywhere [03:18:39] andrewbogott: I'm looking at it, can't say I'm on top of it yet. [03:18:39] ok [03:18:40] andrewbogott: You getting a ssh_exchange_identification error? [03:18:40] Coren: are you confident it’s an NFS issue, or should I look for other explanations? [03:18:40] Nah, things just hang for me. [03:18:40] It's not NFS proper. The box is crumbling under IO from no immediately discernable source. NFS works fine, but it's starved out of disk bandwidth. [03:18:41] Have we even had a postmortem for the last outage? [03:18:41] I managed to log into tools-login, but it took nearly 5 minutes. [03:18:41] Feels like floppies. [03:18:41] ohai I just logged on [03:20:54] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 696263 bytes in 3.117 second response time [03:21:28] Coren: I restarted nova-network, I thought unrelatedly. But things are snappy for me now [03:22:29] Hm. What? [03:22:36] * Coren attempts to find a possible link [03:23:23] Looks good now [03:23:42] Yeah, I see normal activity again. A bit high - but to be expected as things unclog. [03:24:44] But how in Baal's name could a network issue cause silly disk load like I was seeing? [03:48:56] The server has been flaky since the switch. [05:13:02] Now I’m back to not being able to reach anything [05:13:03] Coren: Can't you change the update interval to something shorter? [05:13:03] PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [05:13:04] guessing that's why? [05:13:05] Yeah, I forget how long it takes these damn things to boot [05:13:10] andrewbogott: Ah. Looks like it was a hardware issue. Drive resync. [05:13:11] * Coren panics vaguely. [05:13:11] dmesg [05:13:12] wait, you mean labstore1001 isn’t coming up? [05:13:13] It is, but has flaky access to one of the shelves. [05:13:14] Firing up 1002 [05:13:15] yikes [05:13:15] * andrewbogott stands back [05:13:16] Not yet, I'm trying a full powercycle first. [05:13:17] paranoid [05:13:17] * Coren wishes he could powercycle the shelves. [05:13:17] Good sign: bios sees 72 drives. [05:13:18] If a drive fell off the map, would it cause that symptom? Crazy high iowaits? [05:13:18] If the box had trouble communicating with a shelf it would. [05:13:20] Allright. It's seeing the raids again, and rebuilding. [05:13:20] Yeay [05:13:21] I'll be able to recover everything. A few more minutes should do it - I'm doing it slowly and step by step. [05:13:21] $ ssh tools-login.wmflabs.org [05:13:21] Connection to 208.80.155.130 timed out while waiting to read [05:13:22] Krinkle: hardware fail on labstore. Recovering now. [05:13:22] Ok I understand there are issues ...https://tools.wmflabs.org/magnustools/multistatus.html show three of them [05:13:22] andrewbogott: If you're still up: current status is hardware is visible again, ext4 is currently doing the checking-journal-and-inconsistencies-pretending-it's-not-an-fsck-thing. [05:13:23] Coren: still up, for a bit. I think I’ll wait until morning to completely understand what went wrong though. [05:13:23] I don't know how long this will take. I better go grab me some coffee [05:13:23] The short of it: one of the shelves stopped talking to the server, though it started again for a while then hung again. [05:13:23] So everything was stalled waiting for it. [05:13:24] rebooting had it still be not talking. power cycling had it reapear. [05:13:24] I'm still not able to reach https://tools.wmflabs.org/, or ssh://tools-login.wmflabs.org, ssh://integration-slave1410.eqiad.wmflabs [05:13:24] I am aware. [05:13:25] [ 3619.985537] EXT4-fs (dm-9): recovery complete [05:13:25] *finally* [05:13:25] woo! [05:13:26] restarting NFS [05:13:26] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 403905 bytes in 9.561 second response time [05:13:26] Things seem to be sort of working… [05:13:27] It'll take some time for everything to recover. [05:13:38] I think I will go to bed now, while hope still springs. [05:14:03] I hope you’re able to follow suit shortly! [05:14:14] I don't think ther's much more to do at this point. [05:14:33] assuming jobs start back up and such... [05:14:38] But now I have severe trust issues with the hardware and/or wiring. [05:14:43] yeah :( [05:15:35] ok, good night — I will sleep with fingers crossed. [05:15:39] We probably found our cause for last night's mysterious outage too. [05:15:42] o/ [05:24:04] GerardM-: Good news, multistatus shows everything recovered nicely. [05:25:42] Coren ... thanks ... you guys have been unlucky lately [05:26:03] In all probability, that same issue is at the root of at least the last two outages. [05:26:15] I'll have chris go an inspect the hardware in the DC tomorrow. [05:27:49] I must admit that the fileystem mount that took 30 minutes got me worried for a while - especially since it was completely quiet while working. [05:59:53] Wee, cron emailed me just about every error there could be :p [06:32:32] PROBLEM - Puppet failure on tools-trusty is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0] [06:38:18] FYI the system is more stable ... errors I had in the last two days are now gone [06:40:39] hmmm spoken too hastily [06:49:53] PROBLEM - Puppet failure on tools-exec-05 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [06:54:38] The puppet issues are not relevant. [06:55:06] storage appears to be reliable; and now I desperately need sleep. [06:55:14] o/ [07:02:36] RECOVERY - Puppet failure on tools-trusty is OK: OK: Less than 1.00% above the threshold [0.0] [07:14:51] RECOVERY - Puppet failure on tools-exec-05 is OK: OK: Less than 1.00% above the threshold [0.0] [07:40:18] 10Tool-Labs-tools-Other: Fix tool kmlexport - https://phabricator.wikimedia.org/T92963#1173062 (10Thgoiter) [[ http://tools.freeside.sk/monitor/http-kmlexport.html]] looks quite bad for this morning: 4 breakdowns between 1 and 7 am (UTC+2), two of them for more than one hour. Thanks in advance for your effort. [09:32:01] PROBLEM - Puppet failure on tools-exec-04 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [0.0] [09:45:28] 6Labs, 6operations, 10ops-codfw: rack and connect labstore-array4-codfw in codfw - https://phabricator.wikimedia.org/T93215#1173279 (10fgiunchedi) p:5Triage>3Normal [10:42:09] RECOVERY - Puppet failure on tools-exec-04 is OK: OK: Less than 1.00% above the threshold [0.0] [11:33:21] mmhh anyone else getting 'Failed to modify instance' when trying to add classes to a vm on wikitech? [12:23:22] godog: I’ll try. [12:24:06] on "webservice start", i get "(network.c.358) can't bind to port: 14030 Address already in use" what could be the problem? [12:28:07] andrewbogott: any ideas? ^ [12:28:33] afeder: no, sorry. We’ll have to wait for yuvi. [12:28:44] ok, thx [12:36:00] andrewbogott: Back after some sleep. [12:36:18] Coren: maybe you can help afeder ^ ? [12:36:23] Things didn't go asplodey. Yeay. [12:36:51] afeder: I'll be with you in a few minutes once I've gotten some cafeine. :-) [12:36:59] Coren: great [12:38:11] PROBLEM - Puppet failure on tools-login is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [12:40:54] can't dns resolve anything under wmflabs.org [12:41:20] OH hey, that's new. [12:41:32] * Coren looks into it. [12:41:40] andrewbogott: Are you working on DSN right this minute? [12:41:54] Coren: no [12:42:04] I’m looking at opendj which seems… extra busy [12:42:28] which, y’know, opendj backs pdns which servers .wmflabs.org [12:42:30] so probably the same issue [12:43:43] I can confirm no name resolution from the outside; but in the worst possible way (no answer, but also no errors) [12:43:54] labs-ns1 works though. [12:45:11] Resolution from the inside works. [12:45:28] Maybe labs-ns0 is ill and that's what's hammering on ldap? [12:45:46] it looks like sudoer checks that’s hammering. [12:45:51] But I’ll restart pdns anyway [12:46:44] better? [12:46:58] No. ns0 still gives no responses to queries. [12:46:59] Oh, this is my mistake I think, messing with ldap always messes with pdns as well :( [12:47:29] ns1 works, but because ns0 givs NOERROR nothing falls back to it. [12:48:32] ok, one second... [12:51:24] There we go. [12:51:33] If you did something, that was the right thing. [12:51:43] Yeah, restarted everything :) [12:51:49] PROBLEM - Puppet failure on tools-exec-gift is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [12:52:03] It’s impossible to remember that changing anything with opendj causes pdns (running elsewhere) to fall to pieces [12:52:09] well, impossible for /me/ to remember [12:52:23] Now… who is dos’ing ldap? [12:52:34] jzerebecki: You should get DNS again [12:56:00] Coren: did anything change with diamond lately? Like maybe it’s running 10x as often as it did yesterday? [12:56:13] andrewbogott: Nothing I did. [12:57:05] afeder: Try again? [12:57:13] hm, it’s hard to tell how long this has been happening. The logs are so busy that they’re rotated out every few minutes and deleted after 4 hours :( [12:57:13] PROBLEM - Puppet failure on tools-bastion-01 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [0.0] [12:57:16] Coren: works, thx [12:57:45] PROBLEM - Puppet failure on tools-webgrid-generic-01 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [0.0] [12:57:48] andrewbogott: They are perhaps overly verbose? [12:58:58] opendj shows diamond sudoing about 10 times/second [12:59:16] That shouldn’t be enough to break ldap though, maybe I’m mistaken about what’s going wrong [12:59:17] andrewbogott: From what host? [13:01:40] several… maybe many [13:01:46] andrewbogott: is diamond getting restarted in a loop perhaps? [13:01:51] It’s hard to parse, the sudoer check and the hostname aren’t on the same line [13:03:00] if y’all want to look at neptunium:/var/opendj/instance/logs/access you can see what I’m talking about [13:03:07] note that I don’t know for a fact that this is unusual [13:03:21] I’m just trying to figure out why ldap is unresponsive to wikitech, sometimes [13:04:10] looks like once per minute per involved instance [13:05:03] That would be consistent with normal behaviour. It would also be atrocious that diamond shells out to sudo every run and thus ends up having dependencies on pretty much every subsystem. :-( [13:05:20] I think it does, for some of those checks [13:05:26] Ew. [13:05:26] yeah some collectors sudo [13:05:39] hm, so maybe this is a red herring, but it sure makes it hard to watch that log for the things I care about :( [13:05:52] yes, ew :) [13:05:57] nrpe too [13:06:15] people use sudo for things they should really used privileged separated daemons or carefully designed suid executables :( [13:06:19] I take it diamond can't be configured to use capabilities instead? [13:06:50] well it's a python program, that doesn't work very well [13:07:01] Ah, true. [13:07:05] what the collectors usually need is cap_sys_admin too [13:07:19] That's... not helpful. :_) [13:07:25] I know :/ [13:07:39] andrewbogott: At any rate, those sudo would be normal so I wouldn't expect any change to have come from that. [13:07:50] yep [13:08:02] RECOVERY - Puppet failure on tools-login is OK: OK: Less than 1.00% above the threshold [0.0] [13:08:18] how is diamond's sudo configured in labs? [13:08:20] via ldap? [13:08:47] sudo-ldap still reads /etc/sudoers, no? [13:08:52] paravoid: Well, a plausible fix would be a tiny sudo wrapper with an explicit whitelist that is explicitly limited to diamond and used instead of sudo. It would't be a security improvement, but it'd reduce the brittleness and number of dependencies. [13:09:16] paravoid: It does, but I think it builds a cojoined list from both then checks /that/. [13:09:25] eww [13:09:50] s/tiny sudo wrapper/tyny suid wrapper/ [13:10:16] Blah typing. Not enough sleep. [13:11:38] paravoid: That said, it was a guess. I guess it's easy enough to see if a local sudoers match still hits ldap. [13:12:32] local sudoers on labs has 'rool ALL=(ALL) ALL' so I can just strace a sudo from root and see if it does. [13:13:28] yes thx wmflabs.org dns works again for me [13:14:09] connect(6, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("208.80.154.6")}, 16) = 0 [13:14:09] Followed by the predictable write(), poll() etc. :-( [13:14:25] So yeah, even a local sudoers match hits ldap. [13:16:07] paravoid: "When multiple entries match for a user, they are applied in order. Where there are multiple matches, the last match is used (which is not necessarily the most specific match)." So yeah, that pretty much makes checking LDAP mandatory. [13:16:26] b;ergh [13:19:35] (Also, I don't want to know what exactly sudo-ldap does for 'order' in LDAP) [13:20:58] andrewbogott: [02/Apr/2015:13:13:19 +0000] MODIFY REQ conn=31223 op=6 msgID=7 dn="dc=i-0000081c.eqiad.wmflabs,ou=hosts,dc=wikimedia,dc=org" [13:21:01] [02/Apr/2015:13:13:19 +0000] MODIFY RES conn=31223 op=6 msgID=7 result=20 message="The provided LDAP attribute puppetvar contains duplicate values" etime=0 [13:21:24] andrewbogott: 81c being filippo-test-trusty [13:21:47] godog: that could be it, although I don’t know why it’s intermittent [13:22:08] RECOVERY - Puppet failure on tools-bastion-01 is OK: OK: Less than 1.00% above the threshold [0.0] [13:22:36] andrewbogott: curious, is it intermitted for you? on 'monitoring' project it fails for me every time [13:22:48] RECOVERY - Puppet failure on tools-webgrid-generic-01 is OK: OK: Less than 1.00% above the threshold [0.0] [13:22:56] Yeah, I just now succeeded in checking and unchecking a class. [13:23:04] But ok, maybe it always fails w/variable changes? [13:23:07] * andrewbogott looks at the code [13:32:26] hi, I'd like to setup a labs instance to experiment with puppet, shall I request one through https://phabricator.wikimedia.org/T76375 or should I create it myself (if I have the necessary privs) (I'd the latter case I could also use a pointer how to do that :-) [13:33:08] moritzm: You can create instances in any project you administer, provided you have the quota left. [13:33:18] godog: found it. [13:33:51] moritzm: I can add you to the ‘puppet’ project — thats generally the place for screwing around with puppet stuff. [13:34:30] moritzm: what is your wikitech username? [13:34:55] andrewbogott: sweet! what was it? [13:35:07] andrewbogott: "Moritz Mühlenhoff" (sorry for the umlaut, when registering I wasn't aware it was used as a general purpose login :-) [13:35:44] I clearly should have used Φαίδων :P [13:35:46] The setting I added, ‘use_dnsmasq’ is both a default for new instances and a configurable setting. Apprently that never happened before. there’s a bit of code here which (apparently just to be sure) re-sets all defaults any time an instance is modified. [13:35:54] That seems wrong anyway, I’ll probably just yank that section. [13:37:27] omg, wikitech just gave me a useful failure message [13:37:40] andrewbogott: pics or it didn't happen. [13:37:54] Coren any plan to restart copy/rsync or whatsoever it was of /data/project/phetools/cache/ ? [13:38:07] moritzm: you’re now an admin of the puppet project, you should be able to create an instance. [13:38:21] phe: After our DC guys has looked at the hardware. [13:38:33] godog: try now? [13:40:01] phe: Sorry about the timing and bad luck. [13:41:08] Coren: http://bogott.net/misc/ss.png [13:41:22] lulz [13:41:34] Coren: can you have a look at https://phabricator.wikimedia.org/T94811 [13:41:36] andrewbogott: thanks, I'll give it a try [13:42:11] petan: I can take a quick look to see if I can find anything obvious. [13:42:37] it worked fine until that NFS outage [13:43:04] andrewbogott: aye, that worked! [13:43:04] I'm not sure how likely it is to be directly related, except for the fact that the webservice was probably restarted at that time. [13:44:27] petan: If nothing else, it's not that php doesn't work anymore per say since index.php is definitely still being run as php [13:44:53] which index.php? [13:44:59] that one in wm-bot's folder is not [13:45:03] I see source code in browser [13:45:05] petan: The one at the root with just the meta refresh [13:46:29] Ah no, it's sent out as source too - it's just that the source happens to be only html. [13:46:31] Coren, godog: https://gerrit.wikimedia.org/r/#/c/201461/1 [13:46:32] Nevermind. :-) [13:47:29] andrewbogott: on https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet I appear under members, but not under admins, does the change need a bit to take effect? [13:48:49] moritzm: those Nova_Resource pages are inscrutable. Best to look at the actively-generated info in the ‘Manage XXX’ links in the sidebar. [13:48:58] petan: Just restarting the webservice did the trick, afaict. It looks like it got in a broken state on its first restart. [13:49:08] Nova_Resource pages are useful for gathering stats and such, but seem to only sync up rarely. [13:50:10] andrewbogott: thanks, will keep that in mind [13:51:07] petan: BTW, I suggest that you put files containing credentials outside of public_html if you can as best practice [13:51:22] mhm ok [13:52:15] andrewbogott: +1 even though http://i.imgur.com/xVyoSl.jpg [13:52:26] godog: thanks :) [13:53:05] I got the business during SWAT yesterday for self-merging a mediawiki patch so I’m trying to be polite :) [13:53:18] hehe [13:57:56] moritzm: typically folks find accessing instances harder than creating them :) In case you’re baffled about that, here are some docs: https://wikitech.wikimedia.org/wiki/Help:Access#Accessing_instances_with_ProxyCommand_ssh_option_.28recommended.29 [14:02:23] andrewbogott: thanks [14:28:22] 10Wikimedia-Labs-General, 6operations: role::puppet::self broken on new labs instances - https://phabricator.wikimedia.org/T94834#1173719 (10fgiunchedi) 3NEW [14:29:53] Coren: mind +2’ing that patch so I can get it in for SWAT? [14:30:31] andrewbogott: Hmm, sure. I'm more used to our "you +2 it you merge it" workflow. :-) [14:30:39] Coren: me too [14:30:44] um… +2 but don’t submit, I think. [14:31:20] Yeah - I actually had to catch myself from doing it. :-) [14:31:29] thanks [14:34:17] bbiab, breakfast+shower [14:34:48] oh man, breakfast [14:50:56] Yeay! Snow. [14:56:51] 6Labs, 10Wikimedia-Labs-Infrastructure, 10Continuous-Integration: Jessie puppet self instance has puppet erroring out ( Could not find dependency File[/etc/ldap/ldap.conf] ) - https://phabricator.wikimedia.org/T94840#1174061 (10hashar) 3NEW [14:58:38] 6Labs, 10Wikimedia-Labs-Infrastructure, 10Continuous-Integration: Jessie puppet self instance has puppet erroring out ( Could not find dependency File[/etc/ldap/ldap.conf] ) - https://phabricator.wikimedia.org/T94840#1174061 (10hashar) [14:59:13] 6Labs, 10Wikimedia-Labs-Infrastructure, 10Continuous-Integration: Jessie puppet self instance has puppet erroring out ( Could not find dependency File[/etc/ldap/ldap.conf] ) - https://phabricator.wikimedia.org/T94840#1174077 (10hashar) [14:59:16] 10Wikimedia-Labs-General, 6operations: role::puppet::self broken on new labs instances - https://phabricator.wikimedia.org/T94834#1174079 (10hashar) [15:01:43] 10Wikimedia-Labs-General, 6operations: role::puppet::self broken on new labs instances - https://phabricator.wikimedia.org/T94834#1174093 (10hashar) [15:06:13] 10Wikimedia-Labs-General, 6operations: role::puppet::self broken on new labs instances - https://phabricator.wikimedia.org/T94834#1174109 (10hashar) modules/puppet/manifests/self/client.pp has: ``` # We'd best be sure that our ldap config is set up properly # before puppet goes to work, though. clas... [15:07:34] moritzm: and here is my ssh config file for labs / prod https://phabricator.wikimedia.org/P281 [15:07:56] moritzm: it should have all the trick to get you connected to bastions for prod and labs [15:08:04] though ops use a different bastion for prod :D [15:10:35] 10Wikimedia-Labs-General, 6operations: role::puppet::self broken on new labs instances - https://phabricator.wikimedia.org/T94834#1174124 (10Joe) what filippo is reporting is that at least one of those files is not defined in puppet by default, so that class fails, even if the file is somehow managed (like, in... [15:15:38] hashar: thanks [15:18:43] off for the rest of the *waves* [15:34:47] godog: your puppet-self instance should be fixed, although the ultimate cause is not. Gotta write some code [15:36:53] andrewbogott: ow, what was it? [15:37:27] A side-effect of https://gerrit.wikimedia.org/r/#/c/201461/, not sure why yet [15:39:45] waaaaat wow I didn't expect that [15:40:17] but again puppet loves to take me by surprise [15:43:07] andrewbogott: So, the news to date: everything is peachy-dory and Chris can see nothing wrong. Things would have been perfect were it not for the whole shelf mysteriously not talking to the servers for a while. *grumble, grumble* [15:43:25] dang [15:43:47] godog: I think that existing settings are only preserver /if/ they are visible in the gui [15:44:42] andrewbogott: ah! so they got cleared [15:44:52] yeah :( [15:45:07] But my test case didn’t, because it’s in the gui [16:40:40] 6Labs, 10Wikimedia-Labs-wikitech-interface: OpenStackManager 'configure instance' clears any settings not visible in the configure GUI. - https://phabricator.wikimedia.org/T94851#1174452 (10Andrew) 3NEW a:3Andrew [16:41:10] 10Wikimedia-Labs-General, 6operations: role::puppet::self broken on new labs instances - https://phabricator.wikimedia.org/T94834#1174469 (10Andrew) 5Open>3Resolved a:3Andrew This was a result of https://gerrit.wikimedia.org/r/#/c/201461/, an attempt to not override GUI settings with default new-instance... [16:43:09] Coren: when you return could I get a +2 on https://gerrit.wikimedia.org/r/#/c/201296/? [16:43:25] andrewbogott: I'm here, doing the incident report. [16:43:36] * Coren checks. [16:43:38] ok. No rush on that review [16:43:50] Indeed, I may never run it again. Just want it checked in for posterity [16:57:28] 10MediaWiki-extensions-OpenStackManager: Edits as 127.0.0.1 - https://phabricator.wikimedia.org/T45603#1174537 (10Krenair) 5Open>3Resolved Hasn't occurred since July. [16:58:53] "The instance region, e.g. pmtpa" [16:58:55] :/ [17:01:52] 10Tool-Labs: Convert updatetools.pl into a puppetized Python service with monitoring - https://phabricator.wikimedia.org/T94858#1174558 (10scfc) 3NEW [17:02:04] andrew: Anything to add to https://wikitech.wikimedia.org/wiki/Incident_documentation/20150401-LabsNFS-Overload ? [17:02:14] andrewbogott: ^^ [17:11:17] Coren, oh...you already approved it [17:12:00] Well, pmtpa *was* a good example once. :-) I don't think that's an issue. [17:12:11] (But sorry I didn't notice your comment earlier) [17:13:01] yeah the pmtpa thing I wasn't going to bring up [17:14:17] why might you not have an OpenStackNovaHost class anyway? [17:16:07] I'm sorry - I'm not used to the mw review cycle workflow; I hadn't realized style issues were something to look for - I was mostly +2'ing code that was known to be working because it already worked. [17:16:26] (And which contained to glaring holes) [17:17:08] Also that merge-without-submit thing took me by surprise. :-( [17:18:01] Coren, I comment on all style issues and don't usually place -1s for them [17:19:38] Coren, Jenkins merges when you CR+2 on most repositories [17:19:49] it doesn't on... very few [17:19:59] 99% of my gerrit work is on puppet which is clearly one of those. :-) [17:33:40] Coren: the outage report looks good to me [17:41:18] 6Labs: Add a second pdns/mysql server - https://phabricator.wikimedia.org/T94865#1174728 (10Andrew) 3NEW a:3Andrew [17:45:19] 6Labs: /etc/ssh/userkeys/ubuntu puppet notices on labs instances - https://phabricator.wikimedia.org/T94866#1174750 (10Andrew) 3NEW a:3yuvipanda [17:45:53] https://tools.wmflabs.org/gerrit-patch-uploader/ needs restart [17:46:45] Nemo_bis: Restaretd [17:46:56] Thanks [18:37:31] RECOVERY - Puppet failure on tools-exec-12 is OK: OK: Less than 1.00% above the threshold [0.0] [18:43:53] Coren: andrewbogott wow lots of excitement yesterday night... [18:44:04] It was lots of "fun" [18:44:06] I was moving into a house and didn't check IRC. Sorry. [18:44:08] too much! [18:44:44] But it looks like we found a possible cause for the stalls? [18:45:08] Ther's no certainty, but the identical symptoms sure to point an accusing finger. [18:45:22] Hmm right [18:45:28] we are still on precise right? [18:45:35] Yes. [18:45:56] I'm also curious if thin volumes could be a problem too. We have had these stalls only since we started using those [18:46:05] I didn't want to switch if I could avoid it at all. Adding extra variables when you have a problem is teh suck. [18:46:11] But we will find out if that is the case if this recurs [18:46:13] Yeah totally [18:46:44] YuviPanda: I'm 100% sure it isn't the thin volumes - there is no possible way they would cause a secific set of drives to stop their IO. [18:47:02] Oh true. [18:47:24] Of course, we have no idea _why_ that shelf stopped behaving. [18:47:45] But since a soft reboot didn't fix it but a hard reboot did, it points at something at or below firmware [18:48:21] Right [18:48:39] Did we get the replacement shelf? I remember that was talked about a long time ago [18:48:53] We do; it's nearby in the DC [18:49:10] did* [18:50:04] Ah cool [18:50:48] I've posted the incident report. There was this really bizzare experience when Andrew restarted nova-network and things started working again. :-) [19:06:16] 10Tool-Labs: Configure web services in such a way that users don't have to (re)start it ever - https://phabricator.wikimedia.org/T94883#1175128 (10Multichill) 3NEW [19:07:24] 10Tool-Labs: Configure web services in such a way that users don't have to (re)start it ever - https://phabricator.wikimedia.org/T94883#1175140 (10Multichill) [19:07:26] 6Labs, 10Tool-Labs, 7Tracking: Make toollabs reliable enough (Tracking) - https://phabricator.wikimedia.org/T90534#1175139 (10Multichill) [19:14:03] 10Tool-Labs: Configure web services in such a way that users don't have to (re)start it ever - https://phabricator.wikimedia.org/T94883#1175175 (10Anomie) Is [[https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Grid#Bigbrother|bigbrother]] failing, or is it not being used? [19:15:50] * Coren goes to fetch moar cafeine. [19:17:09] 10Tool-Labs: Configure web services in such a way that users don't have to (re)start it ever - https://phabricator.wikimedia.org/T94883#1175198 (10jeremyb-phone) [19:17:37] Hey, did anyone lost a file named 'core' about 400 mega bytes? cuz I found it in my directory. [19:19:01] pardon me. [19:22:34] Heh. It's probably one of your jobs' remains. What's it a corefile of? [19:25:23] 10Tool-Labs: Configure web services in such a way that users don't have to (re)start it ever - https://phabricator.wikimedia.org/T94883#1175219 (10Multichill) >>! In T94883#1175175, @Anomie wrote: > Is [[https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Grid#Bigbrother|bigbrother]] failing, or is it not being u... [19:28:17] 10Tool-Labs: Configure web services in such a way that users don't have to (re)start it ever - https://phabricator.wikimedia.org/T94883#1175221 (10coren) >>! In T94883#1175219, @Multichill wrote: > A user shouldn't be bothered with this. Setting up a second piece of software that might be able to restart the web... [19:34:31] What do I put in my ~/.bigbrotherrc to make it start lighttpd on a trusty grid node? Using `webservice2 --release trusty lighttpd` reports an error. [20:13:26] [13nagf] 15Krinkle pushed 2 new commits to 06master: 02https://github.com/wikimedia/nagf/compare/950dfd6689d0...b91fb1622132 [20:13:26] 13nagf/06master 14214e89b 15Timo Tijhof: NagfView: Implement $statusCode parameter for error() page [20:13:26] 13nagf/06master 14b91fb16 15Timo Tijhof: NagfView: Implement getStatusPage() and use 404 for non-existent projects... [20:15:18] wikimedia/nagf#24 (master - b91fb16: Timo Tijhof) The build passed. - http://travis-ci.org/wikimedia/nagf/builds/56941700 [20:16:27] 10Tool-Labs: Configure web services in such a way that users don't have to (re)start it ever - https://phabricator.wikimedia.org/T94883#1175358 (10Multichill) >>! In T94883#1175221, @coren wrote: >>>! In T94883#1175219, @Multichill wrote: >> A user shouldn't be bothered with this. Setting up a second piece of so... [20:31:59] 10Tool-Labs: Configure web services in such a way that users don't have to (re)start it ever - https://phabricator.wikimedia.org/T94883#1175415 (10yuvipanda) I'm going to agree that bigbrother should not exist. If a tool goes OOM while serving a particular request, then *that* request alone should fail, and the... [20:32:19] bd808: just ‘webservice —release trusty' [20:32:22] * YuviPanda grumbles about regexes [20:32:45] the fact that it is almost but not exactly the same thing you could use on the commandline makes it even worse [20:32:59] there is that [20:33:11] which is why I tried the command line in there [20:35:55] bd808: yup. and webservice2 and webservice used to be different until yestedray... [20:36:03] heh [20:36:06] and now they’re the same file, although webservice start starts it in lighty. [20:39:11] YuviPanda: so bigbrother is saying that it is restarting now and not logging an error, but the service is not actually starting [20:39:25] bd808: which tool is this? [20:39:36] oh wait... it started [20:39:51] it logged "Restarting" and sent me an email [20:40:06] and then ~60s later is actually did start it [20:40:16] with another line in the log file [20:40:29] tool is "bash" [20:40:41] so it said restarting on its first ‘round’ (it does a round every minute, I think) and then didn’t actually do that and then did it on the second round... [20:40:41] which does nothing of value at all yet [20:40:59] could be [20:41:16] sigh. I hope to have a replacement for it within the end of this month [20:41:25] it’s conceptually different than what we should have had, IMO [20:41:36] bigbrother is teh suk. It's a thing I slapdashed together in an hour to "fix" (aka kludge around) tools running out of memory. [20:42:17] we can just move to jessie and put everything under the control of systemd right ;) [20:42:54] That... is an even worse idea. :-) [20:43:17] Yuvi's concept of a service manifest that joins all of this is the way to go. [20:44:54] heroku style! [20:45:04] yup [20:45:14] I actually wonder if I should make it fully compatible and call it a procfile [20:45:36] I was looking at some foos heroku knockoffs the other day [20:45:45] *foss [20:45:54] YuviPanda: Well, if you'd want to go that way I'd rather we considered making heroku understand gridengine and contribute that upstream instead. [20:46:09] I think GridEngine should be left way back in the past :) [20:46:13] Well, open source implementations. [20:46:28] YuviPanda: That is a pipe dream. Let's stay on earth for now. [20:46:43] well, I”d be sorely dissapointed if toollabs was still running GridEngine in 5 years time. [20:47:03] let’s see. [20:47:08] I sure as hell hope we'll have something to replace bigbrother before the year is out, let alone 5 years from now! :-) [20:47:16] https://github.com/deis/deis [20:47:45] Coren: oh totally, bigbrother and co should be dead by end of month [20:47:57] Anyways, I'm exhausted and burned out. I need actual rest. [20:48:11] Coren: <3 yes! thanks for taking care of the outage yesterday night [20:48:22] I'll be keeping an eye on labstore for a while still, but will do that out of band. :-) [20:48:26] o/ [20:48:39] bd808: yeah, haven’t managed to play with those a lot primarily because of the coreos / docker requirement. Plus I haven’t had the time... [20:52:29] you guys are experimental [20:54:28] Coren: I never had a file named 'core' [20:58:38] YuviPanda: what is the grid engine replaced with then? [20:58:47] gifti: ask me in 4 years? [20:58:51] haha [21:03:35] 6Labs, 3ToolLabs-Q4-Sprint-1: Comprehensive monitoring / alerting for labstore* instances - https://phabricator.wikimedia.org/T94606#1175607 (10yuvipanda) [21:04:16] 6Labs, 5Patch-For-Review, 3ToolLabs-Q4-Sprint-1: Comprehensive monitoring / alerting for labstore* instances - https://phabricator.wikimedia.org/T94606#1175617 (10yuvipanda) We already have checks for network saturation as well. [21:10:37] 10Wikimedia-Labs-General: ganglia.wmflabs.org unreacheable - https://phabricator.wikimedia.org/T64729#1175655 (10Dzahn) 5Invalid>3Open recycling ticket to report ganglia.wmflabs.org is down again. http://ganglia.wmflabs.org/ 502 Bad Gateway nginx/1.5.0 [21:14:07] 10Wikimedia-Labs-General: ganglia.wmflabs.org unreacheable - https://phabricator.wikimedia.org/T64729#1175682 (10yuvipanda) 5Open>3Invalid a:3yuvipanda Closing because we don't use ganglia at all on labs. Use graphite.wmflabs.org instead. We should kill the Domain too. [21:22:21] 6Labs: /etc/ssh/userkeys/ubuntu notices for every puppet run on labs instances - https://phabricator.wikimedia.org/T94866#1175724 (10Krinkle) [21:24:11] Coren_away, YuviPanda: I made the mistake of doing a git synchronize last night while NFS was being flaky. The resulted in my git installation being borked. I tried to fix but I can't figure it out because git is hard. By chance does anyone have my old .git directory from last week sitting around? [21:24:35] my tool name is magog fwiw [21:24:51] Magog_the_Ogre: hmm, it’s *possible* that there’s a copy, but Coren_away’s tapped out for the day after a very intense last few days [21:25:21] :-/ [21:27:43] are there any git experts around here? [21:28:19] Magog_the_Ogre: so a .git should be recoverable without a backup, mostly. want me to take a look? [21:28:28] sure [21:28:37] 10Wikimedia-Labs-General: ganglia.wmflabs.org unreacheable - https://phabricator.wikimedia.org/T64729#1175755 (10scfc) >>! In T64729#1175682, @yuvipanda wrote: > Closing because we don't use ganglia at all on labs. Use graphite.wmflabs.org instead. > > We should kill the Domain too. Cf. T85318#1054268. [21:28:56] I tried reinstalling git so I'm sure I made things worse, fwiw [21:28:59] right now I'm getting fatal: You are on a branch yet to be born [21:29:03] haha [21:29:08] tried committing something, didn't work. [21:29:08] Magog_the_Ogre: what’s the path? [21:29:18] and what did you mean by ‘reinstallin ggit' [21:29:35] path = /data/project/magog [21:29:50] I mean I wiped out my old git folder and tried the same thing when I initially installed it [21:29:51] Magog_the_Ogre: oh, your home directory itself is a git repo. ok [21:30:27] I see what you mean [21:31:18] Magog_the_Ogre: check under ‘ogree’ directory? I just did a fresh clone there... [21:32:51] what did you do? [21:32:59] is there any way to get that back into my head directory? [21:39:57] !ping [21:39:57] !pong [21:39:59] bah [21:40:13] fine, fine. [21:42:33] Magog_the_Ogre: lost connectivity, did I miss anything? [21:45:36] Guest2989, I assume you're Yuvi? [21:45:44] I said what did you do? [21:45:44] is there any way to get that back into my head directory? [21:45:44] * werdna has quit (Ping timeout: 256 seconds) [21:45:45] ah yes [21:45:54] ah [21:45:54] Magog_the_Ogre: I just did: 1. git remote -v, 2. copy the URL, 3. ‘git clone ogree' [21:45:58] Magog_the_Ogre: so look at the folder and see what’s all missing? we can maybe do some manual magic to copy that .git back out if it doesn’t look too out of date [21:46:23] I copied the file I updated back into the original directory [21:46:27] seemed to work well [21:47:23] Magog_the_Ogre: alright, so the thing to do would be 1. copy the 'old' directory into /tmp, clone again onto the original directory, and copy things around more. do you want me to do that or do that yourself? [21:48:49] heh [21:48:52] this will do [21:58:53] 6Labs, 6operations, 7Puppet, 7Regression: Puppet: "Package[gdb] is already declared in file modules/java/manifests/tools.pp" - https://phabricator.wikimedia.org/T94917#1175889 (10Krinkle) 3NEW a:3Krinkle [22:03:21] YuviPandaa, I would REALLY appreciate if you could do this. I am pretty lost right now. I made a backup directory tmpgit [22:03:35] tmp is actually already in use by my bot [22:03:42] either that, or tell me which files to move around [22:03:53] Magog_the_Ogre: cool, I'm in a meeting, I'll do this in about when it's done? 30mins? [22:04:01] np at all [22:04:13] I am cp -r into my tmpgit directory [22:04:30] 6Labs, 6operations, 7Puppet, 7Regression: Puppet: "Package[gdb] is already declared in file modules/java/manifests/tools.pp" - https://phabricator.wikimedia.org/T94917#1175918 (10Dzahn) you would think the "if ! defined" already protects against this, but apparently not. ``` if ! defined ( Package['gd... [22:07:55] 6Labs, 6operations, 7Puppet, 7Regression: Puppet: "Package[gdb] is already declared in file modules/java/manifests/tools.pp" - https://phabricator.wikimedia.org/T94917#1175939 (10Dzahn) https://groups.google.com/forum/#!topic/puppet-users/OF_WYN41dMI [22:45:23] YuviPandaa: Have you had a change to look at the proxy nginx? [22:45:28] *chance [22:47:05] YuviPandaa: and why is your nick different? [23:14:44] !ops [23:16:09] Barras2: can you do it over in #wikimedia-operations as well, plz? [23:16:51] Barras: ^ [23:17:07] greg-g: No ops there, but I will get it :) [23:17:30] Barras: thanks muchly :) [23:19:56] greg-g: No staff our when they are needed 9.9 [23:34:10] 10Tool-Labs: Configure web services in such a way that users don't have to (re)start it ever - https://phabricator.wikimedia.org/T94883#1176302 (10yuvipanda) I'm tempted to close this as a duplicate of T90561 [23:36:12] 6Labs, 5Patch-For-Review, 3ToolLabs-Q4-Sprint-1: Comprehensive monitoring / alerting for labstore* instances - https://phabricator.wikimedia.org/T94606#1176311 (10yuvipanda) We should probably make these paging as well, though. [23:36:35] 6Labs, 5Patch-For-Review, 3ToolLabs-Q4-Sprint-1: Comprehensive monitoring / alerting for labstore* instances - https://phabricator.wikimedia.org/T94606#1176313 (10yuvipanda) Also, I'm wondering if there should be *all* graphite alerts, or we should have active alerts as well. hmm.