[01:51:17] * Coren gives up for tonight. [01:51:25] andrewbogott: I wouldn't reboot that box yet. [01:51:36] (Not that we'd especially want to) [01:52:09] Thanks for looking, I'm sorry that it's such a headache [01:56:02] No, no, it's important. I don't like ticking timebombs and a box that may or may not reboot is... not fun. [04:01:07] is there a list of tools, by order of popularity/usage ? [04:02:15] I couldn't see anything obvious at wikitech or wmflabs itself... [04:46:19] quiddity: We can break Labs for a few hours and see which tools people complain about most. [04:46:30] * quiddity twitches [04:46:35] I guess you'd want visitor stats? [04:46:38] Not sure we have those. [04:46:42] Even though we should. [04:49:31] Mhm. I'm looking for any other angle than Alphabetical, which helps me get a handle on the hundreds of tools listed at http://tools.wmflabs.org/ [04:49:55] any angle other than* [07:36:55] !Ping [07:36:56] !ping [07:36:56] !pong [08:08:49] Trying to get a job running consistently getting libgcc_s.so.1 must be installed for pthread_cancel to work [08:10:30] multiwiki: heya! that means it is running out of memory [08:10:39] multiwiki: try adding -mem 1G (or some other amount of memory) to your jsub command [08:13:12] thanks also 38408192 Nov 20 08:06 core can I get rid of this binary file its keep growing [08:14:23] yup yup you can [08:14:29] that just means your code is core dumping [08:18:51] thanks [08:18:59] yw! [11:59:50] !ping [11:59:50] !pong [12:01:25] quiddity: https://tools.wmflabs.org/hay/directory/#/ [12:02:05] guillom: i would like to post a blog post next week, handle through you ? [13:54:54] 3Wikimedia Labs / 3deployment-prep (beta): Puppet failures on deployment-bastion - 10https://bugzilla.wikimedia.org/73520#c1 (10Antoine "hashar" Musso (WMF)) 5NEW>3ASSI a:3Ori Livneh Seems to be caused by https://gerrit.wikimedia.org/r/#/c/173353/ keyholder: add /etc/keyholder.d and `keyholder arm` sub... [15:21:10] 3Wikimedia Labs / 3deployment-prep (beta): Beta not picking up merged change - 10https://bugzilla.wikimedia.org/73659 (10Gilles Dubuc) 3NEW p:3Unprio s:3normal a:3None This changeset was merged more than 2 hours ago: https://gerrit.wikimedia.org/r/#/c/172953/3 Beta is still serving the old version o... [15:24:00] matanya: yes, you can draft it on Meta: https://meta.wikimedia.org/wiki/Wikimedia_Blog/Drafts and send me a link by email when it's ready :) 2-3 days of notice is ideal. Let me know if you have any questions! [15:28:05] multiwiki: Also, that your code dumps core is a very bad symptom you may want to find the underlying cause to - it means that it crashes and that will definitely impact reliability. [15:37:28] andrewbogott: Is it possible some of the disks on virt1009 came from virt1008 originally? [15:38:07] Coren: I don't know. I wouldn't think so, but there may have been some shuffling when everything was renamed and shipped from tampa. Chris might know more [15:39:34] andrewbogott: I found the problem, if not the root cause. Try this on virt1009 and despair: [15:39:40] mdadm --examine --scan [15:40:33] No wonder the poor thing is confused. [15:40:33] there are two different /dev/md/0s? [15:40:38] and /dev/md/1? [15:40:46] Right, a native one and a foreing one. [15:40:57] man [15:41:16] so, I guess if those HP servers ever arrive, we should plan to evacuate virt1009 and rebuild it. [15:41:21] :( [15:41:38] It looks like some of the disks meant to be part of the array actually come from elsewhere. Note how mdadm --examine /dev/sda2 reports two drives of the array missing? [15:42:55] Aha! /dev/sdh. It was built on virt1008 [15:43:12] The disks got mixed up! [15:43:18] We're lucky this works at all [15:43:32] And this means that 1008 has/will have the same problem. [15:44:09] I can reconstruct the physical volumes and force an array rebuild, though. [15:44:15] so, I still don't see how this happened… two running servers, and then the drives were hot-swapped? [15:44:25] oh? Without data loss? [15:44:52] Yeah; right now we're runing in degraded mode anyways - those drives are disabled because they're part of the wrong array. [15:45:12] Perhaps during the move and only the more recent version of the OS notices and complains about the discrepancy? [15:46:16] ... oddly enough, I don't see virt1009 disks in 1008's arrays. Dafu? [15:49:26] maybe an old server was scrapped and something else put in its place [15:49:42] I'm sure we did some of that [15:51:58] But yes, I should be able to offline the drive, recontruct it as part of the right array, and online it again. [15:54:22] great! [15:55:54] This should make all traces of the virt1008 array disapear, and make grub-probe happy. [15:57:12] You may be interested to know that the UUID of the "real" virt1008 array does not match that of the drive on 1009 - it might just be a case of recuperated drives that were never properly reinitialized. [15:59:27] Once the drives are feeling more at home… do we need to force grub to reinstall itself? [15:59:42] andrewbogott: It's best to. [15:59:42] Or do you think grub is fine, just alerting us to the issue? [16:00:02] You said the raid is running in degraded mode? Why doesn't icinga show a drive as having failed then? [16:00:17] I'm pretty sure grub has refused to do /anything/ because of the issue - the previous bootloader is probably still there and working, but I don't know if I'd chance it. [16:01:00] andrewbogott: They're not failed, they're just *gone*. They are holes in the array since the last boot and not drives meant to be part of it but dead. [16:01:44] andrewbogott: Holey sheets. We're luckier than we thought - there's indeed /dev/sdh that comes from the wrong array but /dev/sdf is actually dead *too* [16:01:51] It was DOA so never part of the array. [16:02:35] Yeay redundancy. [16:03:01] I can fix sdh, but sdf will need a new drive I think. [16:04:49] [4409276.037509] scsi 6:0:5:0: [sdf] Unhandled error code [16:04:50] [4409276.037515] scsi 6:0:5:0: [sdf] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK [16:04:56] Doesn't look good. [16:12:11] That failed drive shouldn't affect the fix though. [16:16:59] andrewbogott: sdh is now part of the "correct" array an in recovery mode. sdf still quite dead. [16:36:25] 3Wikimedia Labs / 3deployment-prep (beta): Beta not picking up merged change - 10https://bugzilla.wikimedia.org/73659#c1 (10Antoine "hashar" Musso (WMF)) 5NEW>3RESO/DUP The beta cluster Jenkins slave (deployment-bastion) ends up having all its executing slots locked by the Gearman plugin. The only way t... [16:59:06] (03CR) 10Legoktm: [C: 032] Add support for creating tarballs of skins [labs/tools/extdist] - 10https://gerrit.wikimedia.org/r/174468 (owner: 10Legoktm) [16:59:15] legoktm: want me to merge the ops change? [16:59:20] yuvipanda: no, not yet [16:59:23] legoktm: ok [16:59:29] legoktm: I'll be around for an hour or so more [16:59:33] yuvipanda: we're gonna do that in an hour [16:59:35] ok :D [16:59:40] heh [16:59:53] 3Wikimedia Labs / 3Infrastructure: puppet labsstatus not reported when using role::puppet::self - 10https://bugzilla.wikimedia.org/63296#c7 (10Antoine "hashar" Musso (WMF)) 5NEW>3RESO/WON The reason I filled this bug was to use OpenstackManager as a dashboard of puppet run. Over the last few weeks, Yuvi... [17:03:07] andrewbogott: 8913 created to track sdf; you want me to cc? [17:03:19] Coren: yes please [17:04:13] {{done}} [17:06:09] matanya, that's great, thanks :) [17:07:23] andrewbogott: FYI, grub-update worked now that there are no phantom arrays. We're still in a tenuous position because we have to drives' worth of lost redundancy but that's fixable. [17:08:04] So, an RT for chris to replace the drives? [17:10:55] yuvipanda: how do I force a puppet run? [17:11:03] legoktm: 'sudo puppet agent -tv' [17:11:19] ty [17:20:50] !log fab disabled puppet on fab instance, going to set up redirect [17:20:50] fab is not a valid project. [17:20:55] !log phabricator disabled puppet on fab instance, going to set up redirect [17:21:00] Logged the message, Master [17:21:11] !log phabricator (disregard previous message) disabled puppet on fab2 instance, going to set up redirect [17:21:12] Logged the message, Master [18:27:33] (03PS1) 10Legoktm: Fix format string [labs/tools/extdist] - 10https://gerrit.wikimedia.org/r/174752 [18:28:00] (03CR) 10Legoktm: [C: 032] Fix format string [labs/tools/extdist] - 10https://gerrit.wikimedia.org/r/174752 (owner: 10Legoktm) [18:28:14] (03CR) 10Legoktm: [V: 032] Fix format string [labs/tools/extdist] - 10https://gerrit.wikimedia.org/r/174752 (owner: 10Legoktm) [18:34:55] (03PS1) 10Ori.livneh: add passwords::phabricator [labs/private] - 10https://gerrit.wikimedia.org/r/174758 [18:35:27] (03CR) 10Ori.livneh: [C: 032 V: 032] add passwords::phabricator [labs/private] - 10https://gerrit.wikimedia.org/r/174758 (owner: 10Ori.livneh) [18:38:34] !log extdist moved all extensions to /srv/dist/extensions, enabled skin dist support and other fun things [18:38:35] Logged the message, Master [18:42:22] (03PS1) 10Ori.livneh: add labs mwdeploy_rsa [labs/private] - 10https://gerrit.wikimedia.org/r/174761 [18:42:38] (03CR) 10Ori.livneh: [C: 032 V: 032] add labs mwdeploy_rsa [labs/private] - 10https://gerrit.wikimedia.org/r/174761 (owner: 10Ori.livneh) [18:50:07] andrewbogott: is there a reason icmp pings seem disabled by default inside labs? [18:50:16] I see tools has a specific security rule allowing it... [18:50:37] legoktm: can you add that to extdist too? port range -1 to -1, source 10.0.0.0/8, protocol icmp [18:50:37] yuvipanda: I think it was used for puppet freshness tests back in the day, but no longer. [18:50:47] So -- not a good reason that I know of [18:50:49] andrewbogott: right, but shinken uses ping to test reachability [18:50:55] andrewbogott: is there a way for us to turn it on by default? [18:51:19] In what sense is it off? Blocked by security groups? [18:51:22] andrewbogott: yup [18:51:53] !log extdist added port range -1 to -1, source 10.0.0.0/8, protocol icmp to security roles [18:51:56] Logged the message, Master [18:51:59] Hm, I thought that the default security group for a new project allowed it [18:52:01] yuvipanda: done, and ping works now. [18:52:10] heh [18:52:11] ** RECOVERY alert - extdist2/ is ** [18:52:26] legoktm: :) [18:52:41] andrewbogott: hmm, extdist was a while back, I guess [18:53:51] yuvipanda: looking at a random selection of projects (well, two) and it's enabled in the default group for both [18:53:55] hmm [18:53:58] wasn't on ext [18:53:59] dist [18:54:05] I guess it just skipped it or somesuch [18:54:11] I'll just enable it if I run into other projects where it's disabled [18:55:18] yuvipanda: so the alerts will just say "PROBLEM" and it's up to me to figure out what's broken? :P [18:55:32] legoktm: nope, if an individual 'service' breaks they will say what it is [18:55:41] legoktm: only the host notification is like this [18:55:46] I'll fix shortly [18:55:50] ok [18:56:16] legoktm: but email delivery is kind of shit atm [18:56:18] legoktm: http://shinken-test.wmflabs.org/host/extdist2 [18:56:26] legoktm: you've puppet failures on ext2 [18:56:33] yeah [18:56:38] the biglogs thing [18:56:43] legoktm: oh, yeah, right [18:56:47] yuvipanda: it wants a login? [18:56:48] ldap? [18:56:53] legoktm: no, 'guest/guest' [18:57:08] such secure, much wow [18:57:36] lolol [18:57:55] legoktm: I couldn't find a way to disable auth [18:58:10] legoktm: this is temperoary, of course :) eventually I'll have to write an auth plugin [18:58:12] maybe even SUL [18:58:47] legoktm: on the plus side, it's no longer just betacluster machines having puppet problems ;) [18:58:52] legoktm: can you remove the biglogs role? [18:58:58] >.< [18:58:59] yeah [18:59:05] legoktm: remove and force ar un? [18:59:06] *a run [19:01:43] !log extdist removed biglogs role from extdist2 and now puppet is happy [19:01:45] Logged the message, Master [19:02:07] yuvipanda: ^ [19:02:15] legoktm: yayyway [19:04:12] yuvipanda: how often does shinken check/re-check? [19:04:17] legoktm: every 5mins [19:04:22] legoktm: but it looks at data for last 10min [19:04:28] also service check emails are busted. [19:04:29] let me lok [19:07:13] (03PS1) 10Legoktm: test_skin_repo_list passes now [labs/tools/extdist] - 10https://gerrit.wikimedia.org/r/174766 [19:08:04] (03CR) 10Legoktm: [C: 032] test_skin_repo_list passes now [labs/tools/extdist] - 10https://gerrit.wikimedia.org/r/174766 (owner: 10Legoktm) [19:15:42] legoktm: I'm going to mess around with extdist1.eqiad.wmflabs for something. [19:15:45] unrelated to extdist [19:15:51] sure [19:39:16] (03PS1) 10Legoktm: Make tarballs a ton smaller by not including .git [labs/tools/extdist] - 10https://gerrit.wikimedia.org/r/174777 [19:39:33] (03CR) 10Legoktm: [C: 032] Make tarballs a ton smaller by not including .git [labs/tools/extdist] - 10https://gerrit.wikimedia.org/r/174777 (owner: 10Legoktm) [19:44:56] !screen [19:44:56] script /dev/null [19:51:46] legoktm: help me test something? [19:51:49] legoktm: remove the icmp rule again? [19:52:11] ok [19:53:04] !log extdist removed the icmp security rule thing for yuvipanda [19:53:06] Logged the message, Master [19:53:14] cool, now let's see if shinken notices [20:01:24] legoktm: add that rule again? :) [20:01:41] yuvipanda: can I just make you a admin of extdist? I thought you were... [20:01:51] legoktm: i am, but lazy to re-login [20:02:09] legoktm: but if you're busy I can do that [20:02:17] !log re-added icmp rule [20:02:18] re-added is not a valid project. [20:02:24] !log extdist re-added icmp rule [20:02:25] Logged the message, Master [20:02:41] yuvipanda: I just got an email without the hearts [20:02:55] fuck yeah :) [20:02:57] woot [20:03:07] it's just... slow? [20:07:13] sure quiddity :) [20:13:19] !log extdist force re-generated all tarballs to get rid of .git directories [20:13:21] Logged the message, Master [20:29:50] Coren: what do you think about showing dots in webservice (re)start until the webservice has started? [20:30:35] annika_: I see no issue with the concept, though that's very 1980's. :-) [20:30:44] tch tch :) [20:31:06] andrewbogott: md0 rebuilt, md1 is rebuilding. [20:32:03] Coren: what happens then? Does the raid rebalance onto the new drives? [20:32:37] andrewbogott: It's raid 10; the gone drives just meant we had some stripes with no mirror. [20:33:01] andrewbogott: In practice, read performance was impacted too and should recover somewhat. [20:33:16] Cool. [20:33:28] I have one more remaining post-upgrade issue and then we're happily icehoused. [20:33:31] andrewbogott: We still have one drive missing from the array; but that'll be returned once Chris has an opportunity to physically intervene. [20:33:37] Coren: you already do it when stopping/restarting [20:33:39] (I guess the bad drive thing isn't technically an icehouse issue) [20:34:14] andrewbogott: No, but it's one I'm really glad the upgrade surfaced - we were in danger of losing virt1009 entirely at any time. [20:35:47] I'm still confused how a virt1008 drive ended up in virt1009 and yet was never properly integrated in the array. [20:36:27] !ping [20:36:27] !pong [21:07:04] andrewbogott: ping. you were looking earlier to be notified of any/all changes for a particular host. did you find that? [21:08:06] yuvipanda: can I get a little more context? [21:08:26] andrewbogott: a few days ago, you were looking for a way to have icinga notify you if there were any alerts at all for the virt* machines [21:08:30] andrewbogott: did you find a way? [21:08:38] Ah, that. No, I didn't. [21:08:50] ah, yeah, looks like there's no way. [21:08:50] sigh [21:19:31] yuvipanda: I'm receiving many 100s of bounces addressed to you. [21:19:40] Over 1000 already [21:19:40] bounces?! [21:19:41] why [21:19:50] why am *I* not getting them? I thought it was shinken being dead [21:19:52] let me stop them now [21:20:15] yuvipanda: Because they are bounces back to root. You /should/ be getting them in your opsness. [21:20:16] Coren: ok, should've stopped now. [21:21:04] Coren: hmm, I'm not getting the originals nor the bounces... [21:21:07] Coren: can you forward me one? [21:21:12] Coren: and sorry for flooding inboxes. [21:21:21] I need to figure out a better way to test email delivery [21:22:29] yuvipanda: Sent. [21:22:54] I'm not sure what you tried, but I can tell you it didn't work right. :-) [21:23:21] Coren: yeah, *that* one was just me testing on a 'wrong' email address (gmail.org, remnant of replacing wikimedia.org). [21:23:23] sorry about that :) [21:23:42] service state change emails still not being delivered, and I think I know why, but I'm hoping I'm wrong [21:23:46] and verifying now [21:26:40] sigh, ok [21:26:45] I'll have to fix shinkengen [21:26:56] hmm, or not [21:29:21] Coren: I'm trying to debug an email delivery problem. I was expecting about 200 emails from shinken, but only got... 4. [21:29:36] Coren: any idea where I could / should be looking? it's just using cmdline 'mail' to send emails [21:50:06] yuvipanda: You may want to look at the /var/log/exim4/ logs [21:50:38] Coren: yeah, it wasn't an exim issue. as I suspected it was a shinken issue and I might have a fix. [22:00:54] yuvipanda: uh, are these alerts for real? ** PROBLEM alert - extdist1/Puppet staleness is CRITICAL ** [22:00:58] legoktm: nope [22:01:03] legoktm: just testing [22:01:05] ok [22:01:08] legoktm: AND YAY EMAIL NOTIFICATION WORKS [22:02:01] Yuvi; bounce flood again. [22:02:05] bah [22:02:11] what, to whom? [22:02:12] hundreds, for guest@guest [22:02:19] oh. [22:02:42] hmm, I'll find a way to null that contact out. [22:02:59] but email delivery at least seems to work for everyone else [22:03:27] Coren: hmm, if it was bouncing for guest@guest it would've been flooding for a month now... [22:03:48] And yet, there they go. [22:03:53] * yuvipanda tries to stop that as well [23:02:26] Coren: load on the only available webgrid is rising. is there any reason not to make the other available for new jobs, too? [23:36:59] It is, but it runs trusty so you have to start webservices manually (webservice script doesn't yet understand asking for -lrelease=trusty)