[00:22:49] YuviPanda, we have duplicate wikibugs messages, from wikibugs and wikibugs_ both sending updates. I guess either you're testing something, or one of those accounts needs to be kicked/stopped. :) [00:23:21] quiddity: no idea, you should ask the wikibugs maintainers :) [00:23:28] (legoktm and valhalla) [00:24:26] legoktm is on vacation, and val[tab] is clearly offline, so.... ;p [00:24:31] also, you're listed at https://www.mediawiki.org/wiki/Wikibugs#Help [00:24:38] * YuviPanda takes self out of there :D [00:25:03] i'll email them both. [00:25:10] quiddity: I’ll take a look this time tho [00:25:35] okie. ping me if I should email them, otherwise I'll leave it to you [00:26:02] wikibugs: are you here, even? [00:26:47] quiddity: uh, there’s a secret version running somewhere I can’t find it... [00:27:02] heh. fun! [00:27:21] i know how devs love mysteries... >.> [00:27:26] not today :) [00:27:33] whois wikibugs [00:27:37] err [00:27:44] * quiddity gives YuviPanda a handful of / [00:28:41] hmm, well wikibugs_ didn't rejoin #wikimedia-collaboration so that's good, at least. [00:28:49] nor -dev [00:29:07] (those were the only 2 i'd checked and seen duplicates flowing in) [00:29:41] i spoke too soon. it has now rejoined. [00:30:20] YuviPanda, i'll phile a bug. [00:35:55] YuviPanda, tweak as needed https://phabricator.wikimedia.org/T97101 [00:36:04] quiddity: so I shut them all down [00:38:02] quiddity: try now? [00:38:24] quiddity: did that output anything? [00:38:36] 10Wikibugs: wikibugs and wikibugs_ are both joining/rejoining channels and sending duplicate updates - https://phabricator.wikimedia.org/T97101#1233023 (10yuvipanda) [00:38:41] wikibugs: back! [00:39:04] success! [00:39:24] ty sir :) [00:39:28] 10Wikibugs: wikibugs and wikibugs_ are both joining/rejoining channels and sending duplicate updates - https://phabricator.wikimedia.org/T97101#1233015 (10yuvipanda) (should have a cleaner 'restart' script) [00:39:39] 10Wikibugs: wikibugs and wikibugs_ are both joining/rejoining channels and sending duplicate updates - https://phabricator.wikimedia.org/T97101#1233025 (10yuvipanda) 5Open>3Resolved a:3yuvipanda [00:56:14] (03PS1) 10Yuvipanda: Added LICENSE file [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206341 [00:56:34] (03CR) 10Yuvipanda: [C: 032 V: 032] Added LICENSE file [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206341 (owner: 10Yuvipanda) [00:56:51] (03PS1) 10Yuvipanda: Add .gitreview [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206342 [00:59:13] (03CR) 10Yuvipanda: [C: 032 V: 032] Add .gitreview [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206342 (owner: 10Yuvipanda) [02:24:40] PROBLEM - Puppet failure on tools-webgrid-07 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [02:54:41] RECOVERY - Puppet failure on tools-webgrid-07 is OK: OK: Less than 1.00% above the threshold [0.0] [06:29:58] PROBLEM - Puppet failure on tools-master is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [06:59:59] RECOVERY - Puppet failure on tools-master is OK: OK: Less than 1.00% above the threshold [0.0] [07:38:05] PROBLEM - Host tools-exec-06 is DOWN: PING CRITICAL - Packet loss = 40%, RTA = 2017.69 ms [07:42:57] RECOVERY - Host tools-exec-06 is UP: PING OK - Packet loss = 0%, RTA = 95.93 ms [08:07:57] FLAPPINGSTART - Host tools-exec-06 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [08:18:41] PROBLEM - SSH on tools-exec-05 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:29] RECOVERY - SSH on tools-exec-05 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [08:37:23] PROBLEM - Host tools-exec-07 is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 4587.64 ms [08:56:36] PROBLEM - Host tools-exec-05 is DOWN: CRITICAL - Host Unreachable (10.68.16.34) [08:57:26] RECOVERY - Host tools-exec-05 is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [09:02:19] FLAPPINGSTART - Host tools-exec-07 is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [09:13:00] FLAPPINGSTOP - Host tools-exec-06 is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [09:43:06] PROBLEM - Host tools-exec-06 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2322.80 ms [09:47:22] PROBLEM - Host tools-exec-05 is DOWN: CRITICAL - Host Unreachable (10.68.16.34) [09:52:57] RECOVERY - Host tools-exec-06 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [10:41:30] * Nemo_bis likes https://ganglia.wikimedia.org/latest/?r=year&cs=&ce=&c=Virtualization+cluster+eqiad&h=&tab=m&vn=&hide-hf=false&m=load_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [10:42:59] FLAPPINGSTART - Host tools-exec-06 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [10:44:47] cpu_wio has huge variance across hosts https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&c=Virtualization+cluster+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_wio&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [10:53:35] PROBLEM - Host tools-exec-04 is DOWN: CRITICAL - Host Unreachable (10.68.16.33) [10:58:34] RECOVERY - Host tools-exec-04 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [11:01:25] 6Labs, 7Tracking: WikiBrain - https://phabricator.wikimedia.org/T96950#1233422 (10Shilad) As Aaron said, this software will push the limits of your largest VM (16GB). I'd feel much safer if I knew our system was sandboxed and had no possibility of affecting other software running on tools.wmflabs.org. [11:02:21] FLAPPINGSTOP - Host tools-exec-07 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [11:26:19] FLAPPINGSTOP - Host tools-exec-03 is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [12:01:05] FLAPPINGSTOP - Host tools-exec-01 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [12:16:09] PROBLEM - Host tools-exec-01 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2140.74 ms [12:21:08] RECOVERY - Host tools-exec-01 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [13:07:22] PROBLEM - Host tools-exec-07 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2064.69 ms [13:12:23] RECOVERY - Host tools-exec-07 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [13:15:54] andrewbogott_afk: Ping me the minute you're up and about, plz. :-) [13:32:40] Coren: can we drain three exec nodes at once, or is that too big of a bite? [13:34:17] proposed: tools-exec-07, tools-exec-10, tools-exec-11 [13:35:37] andrewbogott: If you migrate them and they only "stall" during, that's not an issue. [13:35:54] andrewbogott: It's not much worse than the ksm stalls anyways. [13:36:45] Coren: draining them will allow me to reboot a bit faster. Migration uses so much network IO, I can only do a few at a time. So I was hoping that while I migrate the bastions away from 1003 you could drain those hosts so that I don’t have to migrate them. [13:37:25] FLAPPINGSTOP - Host tools-exec-05 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [13:37:30] 6Labs, 5Patch-For-Review, 7Puppet: Labs: Could not find dependency File[/usr/lib/ganglia/python_modules] for File[/usr/lib/ganglia/python_modules/gmond_memcached.py] - https://phabricator.wikimedia.org/T95107#1233591 (10scfc) The whole `has_ganglia` business feels … complicated to me. In `manifests/site.pp`... [13:37:59] FLAPPINGSTOP - Host tools-exec-06 is UP: PING WARNING - Packet loss = 0%, RTA = 887.95 ms [13:41:55] Oh! Sorry - you meant drain them vm-side! [13:41:58] Yeah, can do. [13:42:21] Worst case is a couple jobs get queued. [13:43:19] !log tools draining tools-exec-07,10,11 to allow virt host reboot [13:43:25] Logged the message, Master [13:47:12] andrewbogott: I'm going to give a couple minutes for the nonrestartable jobs to drain before I kill stuff. [13:47:29] (All the restartable jobs have been migrated away already) [13:47:30] ok — it’ll be a few before the bastions finish migrating anyway [13:47:59] Whenever you are ready, just say so - it takes 30 secs to force flush the hosts. [13:50:59] * Coren runs to grab some 'breakfast' and back. [13:53:07] back. [13:54:18] Grumble, Grumble, all my bots died [13:56:14] Betacommand: We're doing the evil shuffle right now to attempt to rotate all the instances away from the faulty virt hosts to fix them. [13:56:39] Betacommand: Expect a few hours of shakiness, then quiet. :-) [13:57:22] * Coren is still hella annoyed that a major LTS release has a kernel that effs up virtualization by default. [13:57:45] Coren: Like Ive said, if there is a bug ill catch it first [13:58:10] Betacommand: You're not the first or only one this time - this affected a LOT of instances because a lot of virt hosts. :-( [13:58:25] andrewbogott: 1001 and 1002 are smooth. [13:58:36] andrewbogott: I've been keeping an eye on things running on them. [13:58:40] Coren: I think I was the first [13:59:10] Betacommand: Perhaps. But to be fair, that's because some of your bots are particularily sensitive. :-) [13:59:28] Inherited code. :-) [13:59:51] Coren: My bot is actually dam near impossible to break [14:00:07] BCBot4 almost never goes down [14:00:34] Its lived thru most nfs failures and other crap [14:01:26] I admit that having the virtual CPU just go away for bits of time is hard on anything that tries to talk to the outside world in a timely fashion. [14:06:19] PROBLEM - Host tools-exec-03 is DOWN: CRITICAL - Host Unreachable (10.68.16.32) [14:08:54] shinken-wm: No it's not, skinken. It's just stunned. [14:09:49] It's probably pining for the fjords. [14:11:20] RECOVERY - Host tools-exec-03 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [14:14:10] 10Tool-Labs: toolsbeta: set up puppet-compiler / temporary-apply - https://phabricator.wikimedia.org/T97081#1233657 (10scfc) Perhaps I should explain how I use Toolsbeta after I essentially usurped it :-). I have "copied" Tools roles to Toolsbeta. As setting up new instances takes hours for the package install... [14:18:41] PROBLEM - Host tools-exec-05 is DOWN: CRITICAL - Host Unreachable (10.68.16.34) [14:22:25] RECOVERY - Host tools-exec-05 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [14:23:03] PROBLEM - Host tools-exec-06 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 2419.55 ms [14:25:40] <_joe_> !log deployment-prep installing hhvm 3.6.1 on mediawiki-deployment01 [14:25:44] Logged the message, Master [14:27:05] andrewbogott: The exec nodes are drained. Fire at will! [14:27:26] Coren: thanks. There’s one more instance that needs to migrate away and is taking ages for some reason. [14:27:30] Doing 1006 in the meantime [14:29:13] * Coren looks at what from tools runs there [14:29:21] Ah, nothing. [14:32:36] PROBLEM - Host tools-exec-11 is DOWN: CRITICAL - Host Unreachable (10.68.17.144) [14:36:09] PROBLEM - Host tools-exec-01 is DOWN: PING CRITICAL - Packet loss = 12%, RTA = 2635.48 ms [14:41:05] RECOVERY - Host tools-exec-01 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [14:43:07] PROBLEM - Host tools-exec-06 is DOWN: CRITICAL - Host Unreachable (10.68.16.35) [14:47:57] RECOVERY - Host tools-exec-06 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [14:49:32] andrewbogott: Anything I can do to help you? [14:49:44] Coren: not at the moment; just waiting for this migration to finish [14:50:15] after 1003 is done I’ll want to repool and depool more exec nodes [14:50:18] whenever that is [14:54:13] PROBLEM - Host tools-exec-01 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2111.06 ms [14:57:56] PROBLEM - Host tools-exec-06 is DOWN: CRITICAL - Host Unreachable (10.68.16.35) [14:59:45] hello, my Ruby script runs perfectly when executed directly, but for some reason when I try to submit it to the grid with jsub it complains about libc.so.6: version `GLIBC_2.17' not found [15:00:12] and that it's required by ruby [15:01:03] unfortunately all the information I have, but I believe this can be fixed by updating the libc6 package [15:04:07] MusikAnimal: That's actually a FAQ. The error message is obscure, but the cause is simple: you need more memory allocated to your job. :-) [15:04:33] http://www.urbandictionary.com/define.php?term=norweigan+blue [15:04:39] really, damn Ruby [15:04:50] Hm. Or wait, that's not /quite/ the right message. [15:04:57] You might just need trusty. [15:05:12] Where are you testing your script by hand? [15:05:32] well I first tried with the cron but it evidently failed [15:05:39] oh "where" [15:05:39] PROBLEM - Puppet failure on tools-webgrid-07 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [15:05:45] my home directory [15:05:50] for musikbot [15:06:07] RECOVERY - Host tools-exec-01 is UP: PING OK - Packet loss = 0%, RTA = 2.73 ms [15:06:23] I meant on which server. :-) But since we default to trusty now, you should probably request -l release=trusty on your jsub too. Try that. [15:08:07] yes! that ran it, however I believe it's using the wrong version of Ruby, not the Ruby I have set for my home directory with rbenv [15:09:17] MusikAnimal: You may need to set up the environment - jobs are started with a totally "clean" environment by default. [15:09:36] oh dear [15:09:38] But also, you shouldn't be running bots from your personal account beyond simple testing. [15:10:20] MusikAnimal: The very easiest way to do a proper environment setup is to make a shell script that exports all you need and then invokes your code - and jsub the script. [15:11:45] well this is under musikbot, not musikanimal [15:12:32] as for the shell script, where does this job "environment" live? e.g. I will need Ruby 2 installed and the gem dependencies [15:13:59] I mean environment variables. Did you need to set up any in order for your ruby and gems to be used? Or is this just dependent on which directory you start your script from? [15:16:15] I had to export rbenv and the ruby bin directory to the path. What determines which ruby to use is set by ~/.ruby-version on a per-directory basis [15:17:01] Allright, so your invoking script should also set those exports (including the path) [15:17:30] ok trying that now [15:27:58] FLAPPINGSTOP - Host tools-exec-06 is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [15:29:21] andrewbogott: Any eta? [15:29:42] I don’t know what the story is — tools-exec-03 has been ‘migrating' all this time. [15:29:53] If you want to drain that one too then I can just skip it and reboot [15:30:39] andrewbogott: I think it's best. [15:30:53] ok — let me know when it’s ready [15:31:16] !log tools draining -exec-03 to ease migration [15:31:19] Logged the message, Master [15:32:03] I'd rather keep the job queues full for a little while than three nodes disabled for a long time. :-) [15:32:30] * Coren gives the nonrestartables a few minutes to drain. [15:33:02] Coren: woohoo! exporting vars and setting ruby within the shell script worked! thank you!! [15:34:16] excellente! however it first complained about not being able to run the shell script so I did `chmod -x` because I don't know what user jsub is running it as [15:34:38] It's running as you. [15:34:47] Well, the tool's uid. :-) [15:37:15] Coren: better yet, since that instance is wonky, can you just delete it and rebuild? [15:37:24] that will give me slightly more peace of mind [15:38:37] Coren: I guess I had it set so that even I couldn't run it =P [15:38:43] I'm all set now, thank you for the help!! [15:38:47] andrewbogott: I can just delete it and not rebuild it; we'll add a new exec node instead (otherwise we get into a pissing match with the host keys, etc) [15:38:53] now the long wait for bot approval [15:39:07] MusikAnimal: That's what I'm here for. [15:39:22] Coren: works for me! [15:39:49] (03CR) 10Ricordisamoa: "ping" [labs/tools/wikicaptcha] - 10https://gerrit.wikimedia.org/r/197514 (owner: 10Ricordisamoa) [15:40:59] !log tools -exec-03 goes away for good. [15:41:02] Logged the message, Master [15:41:43] Ok rebooting... [15:42:01] andrewbogott: I see -04 in migrating state too. Expected? [15:42:26] yes, expected. [15:42:27] That’s for the 1004 reboot. [15:47:36] PROBLEM - Host tools-exec-11 is DOWN: CRITICAL - Host Unreachable (10.68.17.144) [15:48:21] PROBLEM - Host tools-exec-07 is DOWN: CRITICAL - Host Unreachable (10.68.16.36) [15:48:21] PROBLEM - Host tools-exec-10 is DOWN: CRITICAL - Host Unreachable (10.68.17.65) [15:48:26] Coren: err, is that diamond spam transient while exec-03 is shutting down? [15:48:45] PROBLEM - Host tools-exec-04 is DOWN: CRITICAL - Host Unreachable (10.68.16.33) [15:49:07] valhallasw`cloud: No, it's a virt host being rebooted. The exec nodes were already drained and disabled. [15:50:38] RECOVERY - Puppet failure on tools-webgrid-07 is OK: OK: Less than 1.00% above the threshold [0.0] [15:50:44] Coren: err, but it's mails about exec-03, and exec-03 doesn't exist anymore....? [15:51:18] oh well, I'll just ignore it :p [15:51:26] RECOVERY - Host tools-exec-07 is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [15:51:27] Ah, no -exec-03 itself is gone for good, but it'll take some time for shinken to notice. [15:51:36] RECOVERY - Host tools-exec-10 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [15:51:56] I thought you meant -07, -11 and -10 which had just popped up when you asked the question - those are reboots. :-) [15:52:16] RECOVERY - Host tools-exec-11 is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [15:52:48] Coren: no, but these are mails that are actually sent out by exec-03 (!) which suggests that it's still running, but just unaccessible [15:52:59] Received: from diamond by tools-exec-03 with local (Exim 4.76) [15:53:15] andrewbogott: ^^ does that make any sense to you? [15:53:25] valhallasw`cloud: AFAIK, the instance is deleted. [15:53:37] RECOVERY - Host tools-exec-04 is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [15:53:39] Coren: ssh 10.68.16.32 [15:53:53] it's definitely still running [15:54:06] !log tools reenabled tools-exec-07, -10 and -11 after reboot of host [15:54:09] Logged the message, Master [15:54:24] valhallasw`cloud: It seems to be stuck in a weird state (that is, in fact, the reason why it was deleted in the first place) [15:54:30] Coren: it might be in an in-between state due to the weird migration... [15:54:47] Do you know it’s nova id? I’ll try to murder it more completely. [15:55:31] https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000000d6.eqiad.wmflabs , I think [15:56:33] (exec-02 is d5, exec-04 is d8, so it might also be d7) [15:56:44] Spooooky email beyond the virtual graaaave. [16:02:31] Coren: ok, I finished off -03 [16:03:16] Could it be because there were two hosts called -03 at some point? Both d6 /and/ d7 were tools-exec-03 at some point, it seems [16:03:22] Coren: can you now drain tools-exec-02, 08, 12? [16:03:37] valhallasw`cloud: no, it was some fluke of live migration. [16:03:45] *nod* makes sense [16:03:46] andrewbogott: Yep. [16:03:53] thanks for killing it :-) [16:05:40] !log tools -exec-02, -08 and -12 draining [16:05:43] Logged the message, Master [16:09:10] Coren: ready for reboot? [16:09:40] Hmmm. Three recent jobs sticking around - give 'em a minute or two more? [16:09:48] ok [16:11:28] andrewbogott: Drained. Fire at will. [16:11:31] !log jouncebot kicking jouncebot with the instructions here https://wikitech.wikimedia.org/wiki/Jouncebot [16:12:03] oh, no morebots [16:12:11] the kicking didn't work anyways [16:13:17] greg-g: Some virt hosts are being rebooted; I think the morebots one was in this batch. [16:14:26] 6Labs, 7Puppet: Fix Puppet timestamp updater for wikitech - https://phabricator.wikimedia.org/T97082#1233956 (10hashar) My bad I thought notify would still emit a message to the client but it is only on the puppet master :( [16:16:27] PROBLEM - Host tools-exec-12 is DOWN: CRITICAL - Host Unreachable (10.68.17.166) [16:17:34] PROBLEM - Host tools-exec-08 is DOWN: CRITICAL - Host Unreachable (10.68.16.37) [16:18:30] PROBLEM - Host tools-exec-02 is DOWN: CRITICAL - Host Unreachable (10.68.16.31) [16:20:32] why doesn't eg 16:15 < icinga-wm> PROBLEM - Host labvirt1004 is DOWN: PING CRITICAL - Packet loss = 100% announce here as well? [16:21:16] greg-g: Hm, because that's a "production" host. [16:21:18] RECOVERY - Host tools-exec-12 is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [16:21:26] RECOVERY - Host tools-exec-08 is UP: PING OK - Packet loss = 0%, RTA = 1.35 ms [16:21:28] RECOVERY - Host tools-exec-02 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [16:21:53] Coren: which is kind of important to wmf labs :) [16:22:28] Stop making sense! [16:23:44] Stop making sense! Stop making sense! Stop making sense, making sense! [16:24:34] I... think I'm scared now. Is it Friday afternoon yet? [16:24:40] Coren: you can repool when ready. [16:24:51] morebots is a normal tool, it should’ve survived just like anything else in tools... [16:25:33] !log tools repooled -exec-02, -08, -12 [16:26:41] 6Labs, 7Puppet: Fix Puppet timestamp updater for wikitech - https://phabricator.wikimedia.org/T97082#1233968 (10scfc) No, no, the blame lies with me :-). You proposed to keep the `notify` resources, //I// opted for `notice` functions. I'll probably fix this by server side functions. But testing is a bitch,... [16:27:48] labs-morebots: you ok? [16:27:48] I am a logbot running on tools-exec-08. [16:27:49] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [16:27:49] To log a message, type !log . [16:28:21] Ah, it was probably queued. [16:28:58] And found room to run once I repooled them. [16:29:05] !log tools repooled -exec-02, -08, -12 [16:29:11] Logged the message, Master [16:30:15] Coren, I'm trying (testing) to move someone's bot to labs and need pip, do I just install it? [16:30:42] Coren: I restarted all the morebots [16:30:44] KTC you can install packages local to the tool. [16:30:55] KTC: use virtualenv [16:39:31] andrewbogott: Next step? [16:39:55] Coren: next step is, wait until Monday and ask the deployment people if they’re happy. [16:40:09] And if they are, I’ll restart scripted migration. [16:40:23] Ooo. All the hosts have been rebooted? [16:40:26] yep [16:41:02] Yeay! [16:41:15] I'm not seeing stalls at all right now. [16:42:04] Thank $deity I managed to isolate and find what the bug was - this could have been a nightmare. [16:42:28] (And that we lucked out that it was really /that/ bug and not some other random one) [16:43:17] Yeah, so far I think your diagnosis was correct (and the whole problem) [16:44:49] Well, for sure, we know it was really an issue - let's just pray that it was the /only/ issue. :-) [17:35:35] 6Labs, 7Tracking: New Labs project requests (Tracking) - https://phabricator.wikimedia.org/T76375#1234152 (10yuvipanda) [17:36:13] 6Labs, 5Patch-For-Review, 7Puppet: Labs: Could not find dependency File[/usr/lib/ganglia/python_modules] for File[/usr/lib/ganglia/python_modules/gmond_memcached.py] - https://phabricator.wikimedia.org/T95107#1234154 (10Dzahn) should be disabled now. is the error gone? [17:37:13] Hey guys, can someone help me to connect to moses01-small.eqiad.wmflabs? I can connect to bast1001.wikimedia.org, but not to moses from there. [17:37:50] bmansurov: if you get a message about /home, I think you have to reboot [17:38:40] Nemo_bis: I don't get a message about /home. This is what I get: ssh: Could not resolve hostname moses01-small.eqiad.wmflabs: Temporary failure in name resolution [17:45:40] (03PS4) 10Yuvipanda: Return a unique list of channels (remove dupes). [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/205881 (owner: 10Subramanya Sastry) [17:45:57] (03CR) 10Yuvipanda: [C: 032] "Agree with Subbu about clarity > efficiency in this case." [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/205881 (owner: 10Subramanya Sastry) [17:46:00] (03Merged) 10jenkins-bot: Return a unique list of channels (remove dupes). [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/205881 (owner: 10Subramanya Sastry) [17:47:07] !log tools.lolrrit-wm yuvipanda: Deployed eb0c42177bf160a9990bde3b3a6d8500da786124 Return a unique list of channels (remove dupes). [17:47:09] Logged the message, Master [17:54:22] 6Labs, 7Tracking: WikiBrain - https://phabricator.wikimedia.org/T96950#1234238 (10Halfak) Thank you! [18:08:52] 6Labs: Puppetize a kernel version check as requirement to installing nova-compute - https://phabricator.wikimedia.org/T97152#1234288 (10coren) 3NEW [18:11:17] !log cvn Promoted Rxy from member to projectadmin [18:11:19] Logged the message, Master [18:14:53] 10Tool-Labs-xTools: Everything is displayed as 0 - https://phabricator.wikimedia.org/T97153#1234304 (10Josve05a) 3NEW [18:17:17] hi folks, I just received two e-mails from your phabricator, these are about some of my patches which I have in (upstream) Gerrit [18:17:34] so you might want to tweak your settings so that it doesn't mail on these events [18:18:08] jkt: If I remember correctly, that's a setting you can tweak (what you receive email for, and whatnot) [18:19:11] well I probably have an account on your Phabricator because I wanted to play with it many months ago [18:19:29] so I don't really feel "abused" or anything, of course [18:19:47] it's just that these e-mails probably shouldn't be going out [18:20:10] for me it's just a notification "hey, random user of gerrit decided to upgrade and included some of my patches" [18:23:13] jkt: Go to https://phabricator.wikimedia.org/settings/panel/emailpreferences/ and you can change your settings about what you receive email for, incuding turning off email notifications entirely. [18:32:46] Coren: a kernel check seems like a good idea… is that done anywhere else in puppet currently? [18:33:34] andrewbogott: I honestly don't know but it's a fact so it's easy to check for ($kernelversion) [18:34:05] Hm. Might need $kernelrelease instead to get the - bit [18:35:20] https://phabricator.wikimedia.org/T97152 [18:36:44] 10Tool-Labs-xTools: Everything is displayed as 0 - https://phabricator.wikimedia.org/T97153#1234451 (10MusikAnimal) This issue comes and goes. Currently the tools appear to be working fine. We once believed it was related to the replication database, but evidence has since suggested otherwise. Sometimes the edit... [18:38:56] (03PS1) 10Krinkle: channels: Continuous-Integration is now Continuous-Integration-Infrastructure [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/206425 (https://phabricator.wikimedia.org/T96908) [18:39:13] (03CR) 10jenkins-bot: [V: 04-1] channels: Continuous-Integration is now Continuous-Integration-Infrastructure [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/206425 (https://phabricator.wikimedia.org/T96908) (owner: 10Krinkle) [18:40:30] (03PS2) 10Krinkle: channels: Continuous-Integration is now Continuous-Integration-Infrastructure [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/206425 (https://phabricator.wikimedia.org/T96908) [19:03:07] !log cvn Restarted cvn-app4. Somehow it became unreachable for over an hour. [19:03:12] Logged the message, Master [19:07:05] 6Labs, 10incident-20150422-LabsOutage: Puppetize a kernel version check as requirement to installing nova-compute - https://phabricator.wikimedia.org/T97152#1234599 (10greg) [19:14:52] 6Labs, 5Patch-For-Review, 7Puppet: Labs: Could not find dependency File[/usr/lib/ganglia/python_modules] for File[/usr/lib/ganglia/python_modules/gmond_memcached.py] - https://phabricator.wikimedia.org/T95107#1234619 (10Tgr) Can't test right now, but I did test the patch in a self-hosted puppetmaster when I... [19:18:45] 10Tool-Labs-xTools: Everything is displayed as 0 - https://phabricator.wikimedia.org/T97153#1234633 (10Technical13) I'll happily troubleshoot this specific situation once we are moved into our own xtools.wmflabs.org instance. I'll have it email me everything when 0 results happen until I can find the cause. [19:28:53] Coren: whats the command to find info about a completed job from the queue [19:29:04] qacct -j [19:48:24] 10Wikimedia-Labs-wikitech-interface, 6operations: Can not log into wikitech.wikimedia.org - https://phabricator.wikimedia.org/T96240#1234739 (10Andrew) [19:48:27] 6Labs: Zillion expired tokens in keystone database - https://phabricator.wikimedia.org/T96256#1234737 (10Andrew) 5Open>3Resolved Today, 2441057. I declare this resolved. [19:49:19] 10Wikimedia-Labs-Infrastructure, 5Patch-For-Review: Public IPs not being updated from OpenStack Nova plugin - https://phabricator.wikimedia.org/T52620#1234743 (10Andrew) Turning on audits should have worked, but the audit facility is throwing an exception. I'm going to wait until the Juno upgrade before I try... [19:50:34] 6Labs, 6operations: OOM on virt1000 - https://phabricator.wikimedia.org/T88256#1234749 (10Andrew) 5Open>3Resolved This was probably https://phabricator.wikimedia.org/T96256 -- it doesn't seem to be happening anymore. [19:52:32] 6Labs, 5Patch-For-Review, 7Puppet: Labs: Could not find dependency File[/usr/lib/ganglia/python_modules] for File[/usr/lib/ganglia/python_modules/gmond_memcached.py] - https://phabricator.wikimedia.org/T95107#1234762 (10Dzahn) p:5Triage>3Normal [19:53:47] 6Labs: Support instance creation/deletion via nova commandline - https://phabricator.wikimedia.org/T97163#1234764 (10Andrew) 3NEW a:3Andrew [19:54:45] andrewbogott: for nodepool.deb , I have fixed a trivial bug which is pending review upstream, so I haqve applied it as a quilt patch https://gerrit.wikimedia.org/r/#/c/205571/ :) [19:55:14] andrewbogott: I have rebuild the .deb http://people.wikimedia.org/~hashar/debs/nodepool/ . I don't think we need on apt.wikimedia.org yet [19:55:22] 6Labs: Abolish use of ec2id - https://phabricator.wikimedia.org/T95480#1234786 (10Andrew) Note also that nova notifications don't include ec2id. So in order to support instance creation/deletion via hook we should really move away from ec2ids in ldap. [19:56:00] hashar: ok [19:56:13] 6Labs, 10MediaWiki-extensions-OpenStackManager, 10Tool-Labs, 10Tool-Labs-tools-Article-request, and 9 others: Labs' Phabricator tags overhaul - https://phabricator.wikimedia.org/T89270#1234788 (10faidon) Ping? It's been quite a while. [19:58:48] 6Labs: Support instance creation/deletion via nova commandline - https://phabricator.wikimedia.org/T97163#1234815 (10Andrew) p:5Triage>3High [19:59:48] 6Labs: Abolish use of ec2id - https://phabricator.wikimedia.org/T95480#1234821 (10Andrew) [19:59:50] 6Labs: Nova Instance creation hook for ldap - https://phabricator.wikimedia.org/T91987#1234823 (10Andrew) [19:59:52] 6Labs: Support instance creation/deletion via nova commandline - https://phabricator.wikimedia.org/T97163#1234764 (10Andrew) [20:00:20] 6Labs: Abolish use of ec2id for new instances - https://phabricator.wikimedia.org/T95480#1234830 (10Andrew) [20:03:41] 6Labs: Support instance creation/deletion via nova commandline - https://phabricator.wikimedia.org/T97163#1234872 (10Andrew) Right now, new instances get two ldap records. The first is from OSM: # i-00000b91.eqiad.wmflabs, hosts, wikimedia.org dn: dc=i-00000b91.eqiad.wmflabs,ou=hosts,dc=wikimedia,dc=org object... [20:09:54] 6Labs: Support instance creation/deletion via nova commandline - https://phabricator.wikimedia.org/T97163#1234903 (10Andrew) In no particular order: - Change OSM editing feature so that it edits fqdn-style records rather than ec2id-style records - Rename all ec2-style records to fqdn-style records - Change s... [20:11:41] 6Labs: Support instance creation/deletion via nova commandline - https://phabricator.wikimedia.org/T97163#1234909 (10Andrew) In order to switch to the new cert names, it's nice to have a consistent fqdn. So... can we make that depend on 'move all instances to new dns server'? [20:18:12] 6Labs, 10incident-20150422-LabsOutage: Puppetize a kernel version check as requirement to installing nova-compute - https://phabricator.wikimedia.org/T97152#1234922 (10Andrew) 5Open>3Resolved a:3Andrew Existing hosts are happy with that patch, so this is resolved. [20:19:36] 6Labs: Designate should support split horizon resolution to yield private IP of instances behind a public DNS entry - https://phabricator.wikimedia.org/T95288#1234925 (10hashar) It seems Designate supports PowerDNS and BIND, both support DNS split horizon. From http://designate.readthedocs.org/en/latest/architec... [20:20:01] 6Labs: Switch all of labs to the pdns server, remove the use_dnsmasq flag. - https://phabricator.wikimedia.org/T97170#1234926 (10Andrew) 3NEW a:3Andrew [20:21:02] 6Labs: Designate should support split horizon resolution to yield private IP of instances behind a public DNS entry - https://phabricator.wikimedia.org/T95288#1234935 (10Andrew) [20:21:25] 6Labs, 10Continuous-Integration-Infrastructure: Designate should support split horizon resolution to yield private IP of instances behind a public DNS entry - https://phabricator.wikimedia.org/T95288#1234936 (10hashar) [20:22:44] 6Labs: Support instance creation/deletion via nova commandline - https://phabricator.wikimedia.org/T97163#1234940 (10hashar) [20:26:39] 6Labs, 10Continuous-Integration-Infrastructure: Designate should support split horizon resolution to yield private IP of instances behind a public DNS entry - https://phabricator.wikimedia.org/T95288#1234943 (10Andrew) Before I dive headlong into the pdns config docs... can we just accomplish this by puppetizi... [20:28:14] 6Labs, 10Continuous-Integration-Infrastructure: Designate should support split horizon resolution to yield private IP of instances behind a public DNS entry - https://phabricator.wikimedia.org/T95288#1234945 (10yuvipanda) We can but that's totally terrible :) Also how will we generate them? We would need an en... [20:30:10] 6Labs, 10Continuous-Integration-Infrastructure: Designate should support split horizon resolution to yield private IP of instances behind a public DNS entry - https://phabricator.wikimedia.org/T95288#1234975 (10hashar) DNS masq aliases work on the IP addresses regardless of the DNS entry being queried. For to... [20:35:19] (03PS1) 10Yuvipanda: Add blurb about cdnjs community to template [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206447 [20:38:07] 6Labs, 10Continuous-Integration-Infrastructure: Designate should support split horizon resolution to yield private IP of instances behind a public DNS entry - https://phabricator.wikimedia.org/T95288#1235020 (10coren) I'll investigate a proper way to do split horizon dns with the pdns/designate combo [20:38:10] (03PS1) 10Yuvipanda: Actually generate html output [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206452 [20:38:31] 6Labs, 10Continuous-Integration-Infrastructure: Designate should support split horizon resolution to yield private IP of instances behind a public DNS entry - https://phabricator.wikimedia.org/T95288#1235022 (10coren) a:3coren [20:39:39] 6Labs, 10Continuous-Integration-Infrastructure: Designate should support split horizon resolution to yield private IP of instances behind a public DNS entry - https://phabricator.wikimedia.org/T95288#1235023 (10Andrew) Coren has, um, volunteered to make this happen in the pdns. As to whether this happens in p... [20:44:10] (03PS2) 10Yuvipanda: Actually generate html output [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206452 [20:55:08] (03PS1) 10Yuvipanda: Design tweaks [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206456 [20:55:33] (03CR) 10Krinkle: [C: 031] Add blurb about cdnjs community to template [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206447 (owner: 10Yuvipanda) [20:56:16] (03CR) 10Krinkle: Design tweaks (031 comment) [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206456 (owner: 10Yuvipanda) [20:57:27] (03PS2) 10Yuvipanda: Design tweaks [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206456 [20:59:16] Krinkle: :D ty [20:59:22] Krinkle: fixed them all. https://tools.wmflabs.org/cdnjs/ now. [20:59:26] I’m going to merge and announce [20:59:54] (03CR) 10Yuvipanda: [C: 032 V: 032] Add blurb about cdnjs community to template [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206447 (owner: 10Yuvipanda) [21:00:07] (03CR) 10Yuvipanda: [C: 032 V: 032] Actually generate html output [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206452 (owner: 10Yuvipanda) [21:00:20] (03CR) 10Yuvipanda: [C: 032 V: 032] Design tweaks [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206456 (owner: 10Yuvipanda) [21:01:39] YuviPanda: body paddin-top should be 70px instead of 52px per http://getbootstrap.com/examples/theme/ [21:01:46] so that there's a bit of space between hte heading and the jumbotron [21:02:07] there's a 2px glitch in between currently [21:02:27] sure [21:02:53] YuviPanda: You wrote the portal from scratch? [21:03:00] Krinkle: yup [21:03:02] (03PS1) 10Yuvipanda: More space between navbar and jumbotron [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206457 [21:03:04] Nice [21:03:20] Krinkle: the current one depended on 3 different external services and a local mongodb install... [21:03:23] (cdnjs.com that is) [21:03:29] so I wrote one that is a simple static file [21:03:43] created by a makefile or something? [21:03:44] Krinkle: ^^ see patch [21:03:50] Krinkle: python script [21:03:51] jinja2 [21:03:52] (03CR) 10Krinkle: [C: 031] More space between navbar and jumbotron [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206457 (owner: 10Yuvipanda) [21:04:00] (03CR) 10Yuvipanda: [C: 032 V: 032] More space between navbar and jumbotron [labs/tools/cdnjs-index] - 10https://gerrit.wikimedia.org/r/206457 (owner: 10Yuvipanda) [21:04:17] I’m going to let it run on a cronjob [21:09:30] Krinkle: https://etherpad.wikimedia.org/p/toollabs-cdnjs is announcement email. Anything else you think I should add? [21:10:28] YuviPanda: Well check back in 30 min [21:10:56] Krinkle: cool. [21:12:09] (03CR) 10Merlijn van Deen: [C: 032] channels: Continuous-Integration is now Continuous-Integration-Infrastructure [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/206425 (https://phabricator.wikimedia.org/T96908) (owner: 10Krinkle) [21:12:25] (03Merged) 10jenkins-bot: channels: Continuous-Integration is now Continuous-Integration-Infrastructure [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/206425 (https://phabricator.wikimedia.org/T96908) (owner: 10Krinkle) [21:14:27] !log tools.wikibugs Updated channels.yaml to: 1d785dc3ad22a434749f8ec0d466180f3de9ea52 channels: Continuous-Integration is now Continuous-Integration-Infrastructure [21:14:29] Logged the message, Master [21:31:44] 6Labs: Create a system to automatically flag resource consumer tasks for admins of Labs - https://phabricator.wikimedia.org/T97184#1235193 (10Ladsgroup) 3NEW a:3Ladsgroup [21:36:56] 6Labs: Create a system to automatically flag resource consumer tasks for admins of Labs - https://phabricator.wikimedia.org/T97184#1235215 (10yuvipanda) Thanks for your interest in doing this and sorry for my delay in responding. However, I am not sure if this is actually useful for tools. We hardly ever have t... [21:54:54] 10Tool-Labs, 3ToolLabs-Goals-Q4: Make tools-static redundant - https://phabricator.wikimedia.org/T96966#1235313 (10yuvipanda) [21:54:56] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4, 7Tracking: Make sure that toollabs can function fully even with one virt* host fully down - https://phabricator.wikimedia.org/T90542#1235311 (10yuvipanda) [21:54:58] 10Tool-Labs, 3ToolLabs-Goals-Q4: Make tools-mail redundant - https://phabricator.wikimedia.org/T96967#1235312 (10yuvipanda) [22:18:03] PROBLEM - Puppet failure on tools-static-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [22:24:51] PROBLEM - Puppet failure on tools-static-02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [22:43:03] RECOVERY - Puppet failure on tools-static-01 is OK: OK: Less than 1.00% above the threshold [0.0] [22:44:44] !log tools.jouncebot Upgraded virtenv packages [22:44:49] Logged the message, Master [22:47:57] !log tools.jouncebot Failing to start with "ImportError: /usr/lib/x86_64-linux-gnu/libxml2.so.2: version `LIBXML2_2.9.0' not found (required by /data/project/jouncebot/virtenv/local/lib/python2.7/site-packages/lxml/etree.so)" [22:47:59] Logged the message, Master [22:48:46] YuviPanda: Got a minute or tow to school me on getting a python virtualenv setup properly for running on the tools grid? [22:49:51] RECOVERY - Puppet failure on tools-static-02 is OK: OK: Less than 1.00% above the threshold [0.0] [22:57:02] bd808: sure. [22:57:10] bd808: I think you’re being thit by trusty vs precise things. [22:57:21] I just built a new trusty venv [22:57:35] how to I make sure the job runs on a trusy node? [22:57:37] https://www.irccloud.com/pastebin/Nz4WvVzk [22:57:39] boo [22:57:42] stupid irccloud [22:57:48] bd808: did you add -l release=trusty to your jsub / jstart command? [22:57:52] otherwise it defaults to precise [22:58:01] I will :) [22:58:46] I did git it running locally on tools-bastion-02 so progress! [22:58:58] now to get the darn thing to stop :/