[00:35:21] hi, is anyone around? [05:44:30] anyone around? [05:45:42] can't ssh into dptypes.pmtpa.wmflabs, but http is fine? [05:52:03] andrewbogott: ^ [05:52:13] andrewbogott: is hung on the instance's ssh [05:52:17] i got through bastion just fine [05:57:05] YuviPanda: what project is that? [05:57:11] andrewbogott: design [05:57:18] andrewbogott: although bd808 tells me he can't get to *anything* from bastion [05:57:37] andrewbogott: which I can confirm [05:57:51] I can't even get to bastion now [05:58:36] I can. What are you seeing? [05:58:37] yeah bastion1 not answering for me anymore. [05:59:24] andrewbogott: http://p.defau.lt/?w1V1oqFwej8XsSLXzSa_wg [05:59:32] andrewbogott: https://dpaste.de/Egdj [05:59:52] before that it was https://dpaste.de/v27k [06:00:04] does the bastion host resolve in dns? [06:00:41] andrewbogott: yes [06:01:13] andrewbogott, YuviPanda: switching my ssh relay from bastion1 to bastion2 seems to make things better. [06:01:22] ah hmmn [06:02:19] andrewbogott: bd808 indeed, seems to work for me too [06:02:48] andrewbogott: bastion1 missing from dns for me http://p.defau.lt/?IDh0rqWQGQxLyLY5lmZ8uA [06:08:46] Well… the dns thing I can't see or explain. [06:08:58] But, bastion1 is maybe oom or something, I'm going to reboot it [06:15:47] bd808, YuviPanda, better? [06:16:18] btw, bd808, there is no bastion1.wmflabs.org dns entry, just bastion.wmflabs.org [06:16:50] andrewbogott: I think I have an alias in my ssh_config [06:16:55] That confused me [06:17:22] andrewbogott: yeah, works now. just a OOM then :) [06:21:51] andrewbogott: works for me too. Thanks! [08:06:29] andrewbogott: can you try accessing chicken.wmflabs.org? [08:06:37] andrewbogott: doesn't resolve for me [08:06:41] andrewbogott: is using dynamicproxy [08:06:45] YuviPanda: I'm pretty swamped at the moment, sorry [08:06:50] andrewbogott: ok [08:08:24] andrewbogott: just a fyi, instance-proxy also does not resolve. Not just me, but for two other people as well (prtksxna and vbamba). [08:08:27] do look into it when you can [08:15:09] en.wikipedia.beta.wmflabs.org is a dog slow [08:15:18] whoever can look into that [08:16:00] gry: it's not resolving for me at all [08:16:03] like other parts of wmflabs [08:18:27] wmflabs.org unreachable for me too [08:18:41] hi YuviPanda [08:18:46] hi spagewmf [08:18:48] spagewmf: yup. [08:18:55] spagewmf: we should all poke andrewbogott again :) [08:19:03] http://www.downforeveryoneorjustme.com/en.wikipedia.beta.wmflabs.org says down [08:19:11] looks like a DNS outage [08:19:17] tools is up though [08:19:24] we're working on it, don't panic [08:20:08] andrewbogott: ah, ok :) [08:20:10] * YuviPanda unpanics [08:20:14] hmm, I should be on ops. [08:21:23] andrewbogott: thanks man YuviPanda plus the bastion{,2,3}.wmflabs.org gateways all unreachable [08:22:55] spagewmf: I see some work on it being done in -ops. I guess things will be back on soon :) [08:23:30] better now? [08:23:50] andrewbogott: \o/ thanks! yup :) [08:24:35] should be back on virt0 [08:24:39] running on virt1000 [10:28:34] !log tools Reset error status of task@tools-exec-09 ("can't get password entry for user 'local-voxelbot'"); "getent passwd local-voxelbot" works on tools-exec-09, possibly a glitch [10:28:37] Logged the message, Master [10:30:43] !log tools tools-login: rm -f /var/log/exim4/paniclog (OOM) [10:30:44] Logged the message, Master [10:42:12] !log tools tools-mail: rm -f /var/log/exim4/paniclog ("User 0 set for local_delivery transport is on the never_users list", cf. [[bugzilla:61583]]) [10:42:13] Logged the message, Master [13:06:45] scfc_de: just started getting the following for a self hosted puppet instance [13:07:18] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class role::labs::instance for wikidata-builder1.pmtpa.wmflabs on node wikidata-builder1.pmtpa.wmflabs [13:09:05] any ideas ? :< [13:10:05] addshore: I think Andrew and Coren introduced that new class. Have you tried rebasing your Puppet changes on a new pull from operations/puppet? [13:10:22] no! good point and will do! :) cheers! [13:11:07] scfc_de beat me to it. Yeah, pull and you'll be full of happies. (We cleaned ::base from the instance-specific stuff and put them in that new class) [13:11:20] :) [13:11:50] * addshore is looking forward to the big move [13:11:55] hope you both are too ;p [13:12:18] I'm very much looking forrward to it being complete. :-) [13:15:12] Coren, working already? [13:15:49] Not really, I'm in morning coffee mode. Do you have happy fun news? :-) [13:15:58] um… sort of? [13:16:11] It still doesn't work, but I have more information about why it doesnt. [13:16:21] And I fixed one serious problem, which turned out to not be the only problem :( [13:18:08] yesterday the metadata was lying to instances and telling them the wrong instance ID, making cert-signing impossible... [13:18:32] Today, that's fixed. But instances still don't know their domain. hostname -d returns "" [13:18:58] The raw ubuntu image doesn't have that problem somehow. [13:19:56] Do you have a broken image accessible? [13:22:01] sure... [13:22:26] well, not that you can log into, but you can see the log :/ [13:22:44] verbose3 has a good log, I 'set -x' the firstboot script [13:23:24] Why not log into? You didn't stuff the new server key in there? [13:25:53] I did... [13:25:59] Or, at least, I tried to [13:35:36] Coren, ryan turned on the firewall for virt1000 today, so it's moderately possible that that's interfering somehow. I don't know why that wouldn't have broken virt0 at the same time though [13:35:52] Yeah, I don't think it's related. [13:36:02] Where did you add the key? I'm not seeing it in postinst.sh? [13:36:39] Oh, um… labsconsole takes a list of scripts to inject. So that's local to virt1000 [13:36:57] Ah. [13:37:02] I suspect that the instance is getting stuck somehow before it gets to that script... [13:37:10] Dunno… the network seems generally busted [13:37:12] Ima make a new image in a bit to test some things. [13:37:37] OK. Having just done it a couple of times I happy to do so... [13:38:07] Oh, okay. Watch for a changeset to postinst.sh shortly then. :-) [13:38:12] 'k [13:40:29] Oh, well, I guess if we don't have ldap then the root key doesn't help, right? Because no login accounts are enabled. [13:44:06] Root can get in even if ldap is broken since it's a local account. [13:44:16] https://gerrit.wikimedia.org/r/114458 [13:45:14] I don't think the new-install key helps since palladium doesn't have access to instances. But I'll just add your key. [13:49:38] You can reach the instances from iron, at least. [13:50:22] hm, ok. [13:53:32] ok! Coren, this doesn't have your packaging change yet, but try root@208.80.155.129 [13:53:44] works for me, should for you too [13:54:26] It does. [13:54:43] * Coren investigates. [13:55:15] Hmm [13:56:47] Salt is installed in the image? [13:57:18] * Damianz wonders how viable it is to do that adding button to force run puppet thing... could be awesome for migration time [13:57:58] * Damianz notes he should fix the whole group authentication feature request he raised like years ago which would make that simple to authorize in salt-api from ldap [14:11:26] andrewbogott: From what I can tell, we got a few kinks in the boot order that trample the domain name (which is correctly gotten from dhcp) [14:11:41] where does hostname -d get the domain? resolv.conf? [14:11:49] * Damianz_ pokes freenode with a sharp stick [14:15:19] If it were a boot order thing, wouldn't the system know its domain name by now? [14:15:47] andrewbogott: kernel. [14:17:26] andrewbogott: I think it got trampled. I'm looking into it now. [14:19:47] Coren, often on startup the instances complain about not being able to see /dev/sda2. They hang for a while, then get on with things. [14:19:54] Is that a related issue, do you suppose? [14:21:12] andrewbogott: No. There is some apparent confusion with block device names that prevents swapon for working, but that's fixable. [14:21:20] and wouldn't be the problem. [14:21:23] ok [14:21:38] I'm looking into subtle differences in the bootstrap atm. [14:24:01] Amongst other strangeness, our current working image doesn't actually seem to be using resolvconf from dhclient. [14:36:23] Coren, there's an updated version of vm-builder available, I'm going to rebuild with that, just to see... [14:36:47] andrewbogott_: I doubt that's our current issue, but it won't harm for sure. [14:41:08] andrewbogott: Did you fiddle with resolv.conf on newtest2? [14:41:28] Coren: I didn't. It's crazy, right? [14:45:16] So yeah, now I know exactly /what/ goes wrong, just not quite yet why. [14:45:57] progress! [14:46:51] So, right now, when the system boots it gets all it needs from DHCP, and it's correctly in the lease, but for some reason something then tramples all over the (correct) resolv.conf with a broken one that mentions pmtpa. [14:48:34] hm, that sure sounds similar to https://gerrit.wikimedia.org/r/#/c/113092/ [14:49:31] Ah-ha! [14:49:38] * Coren investigates further. [14:50:15] Need to reboot newtest2. okay with you? [14:50:28] sure, go ahead. [14:50:48] Note that that patch clears resolv.conf.d/original but not resolv.conf itself. At the time ryan was sure that was sufficient but I don't remember why anymore [14:51:44] ... that's kind of backwards actually -- /original is the backup of what resolvconf tramples. [14:53:02] Also, I just realized we're doing something really wrong in postint.sh. Can't use redirection with chroot [14:55:03] huh, it occurs to me that I don't know how to empty a file without redirects. rm and touch? [14:57:14] andrewbogott: truncate would work. But I'm just eschewing the chroots entirely since we're only playing with files and not running things. [14:57:37] https://gerrit.wikimedia.org/r/114470 [14:58:26] Also, the files don't need to be present and empty; removing them works just fine. [14:58:29] (I just double checked) [14:58:48] ok. [14:59:05] Make an image with this and we will have lots of joy. [14:59:06] I'm in process of building a new image, so I'll remove those files by hand when it finishes. [14:59:15] To verify... [15:55:05] andrewbogott: No new news? [15:55:21] Coren: It's better! [15:55:34] Point me at one? [15:55:37] Somehow the new image doesn't include puppet… probably a side-effect of me upgrading somehow... [15:55:52] So it doesn't actually come up properly, but it does know it's hostname and such. [15:56:49] Should still allow you to apt-get puppet and run it, no? [15:59:50] yep, I'm doing that now, seeing what it needs. [16:01:24] Ok, this also complains about sudo-ldap. [16:01:33] So… either we add a password, or install it ahead of time. [18:15:50] scfc_de, http://ganglia.wmflabs.org/latest/?c=tools&h=tools-db&m=load_one&r=hour&s=by%20name&hc=4&mc=2 tools-db seems to have a high load [18:17:35] Coren, ^ [18:17:55] Coren: I can't log into tools-db.pmtpa.wmflabs, either from bastion3 or with HBA from tools-login. [18:18:09] * Coren checks. [18:19:20] Yeah, the box is suffering from overload. [18:20:55] Coren: I get "Permission denied (publickey).". [18:20:58] linkwatcher is really punishing the poor DB [18:21:15] scfc_de: Unrelated. That's a leftover from the NFS outage that'll clean up next reboot. [18:21:27] Coren, Are you going to shut down link watcher? [18:21:42] Cyberpower678: No, why would I? [18:21:58] Because my spambot slowed down, like a lot. [18:23:13] Executed in 4.98 seconds. [18:23:21] I'm not going to start deciding whether tool X is "more important" than tool Y. We're resource starved a bit until the move is over; we'll just have to keep a stiff upper lip and soldier on. [18:23:44] I'm not going to complain about it. [18:33:31] Coren: any eta on getting tools.wmflabs migrated? Ill be glad to move my stuff [19:17:11] Betacommand: Soon. We're gunning for ~2 weeks. [19:17:46] Ive noticed a few OOM cron emails recently [19:21:09] Betacommand: that's because of people running heavy jobs on tools-login [19:26:49] valhallasw: any way to find out who and trout them? [19:29:31] scfc_de and Coren are doing a good job doing that for you [19:30:18] Iz whack-a-mole though. [19:31:37] maybe build a crontab scanner that lists all commands that are not jsub? :p [19:31:46] Coren: public shaming is sometimes effective :) [19:32:40] valhallasw: Ive got all but one script that use jsub. and that one only runs for about 15 seconds once a day [19:35:18] and the only reason I do that is I need the emailed output from that cron job :P [19:40:22] Betacommand: that *should* be possible with SGE (it's also possible to request a shell, for instance) [19:40:30] -sync y makes jsub wait [19:41:51] valhallasw: I decided its just easier to run the single script that just emails a few files, and that prints the results [19:42:20] Ive had to fight sge on several issues [19:43:17] qrsh should do the job [19:54:42] Nemo_bis: if he's the only maintainer, what is your question? [19:54:51] valhallasw: For crontab scanner, there's https://bugzilla.wikimedia.org/54720 (and I use that in my poor-mans-icinga). But I also want to avoid too many false positives and neither block people who know what they are doing from doing that. [19:55:17] * Nemo_bis has no questions [19:57:17] "error: can't open output file "/dev/stdout": No such file or directory" - interesting :-). [19:57:47] valhallasw: Did you get the SGE error mail as well? [19:57:51] scfc_de: oh, I wouldn't block people, but just sending people an e-mail 'You have some non-jsub lines in your crontab. Please consider migrating them to SGE' might already help [19:58:20] scfc_de: no? I /was/ playing around to get output to stdout [19:58:35] but -o /dev/stdout did not work :-) [19:58:47] qrsh /did/ work [19:59:31] Oh, I didn't mean block = removing access, but block = a wrapper around crontab that would require each line to have "jsub" in it, for example. [19:59:59] valhallasw: "GE 6.2u5: Job 2634942 failed"; don't know why SGE didn't mail the submitter? [20:00:50] scfc_de: not sure, it also didn't return. The command I used was jsub -sync y -o /dev/stdout -e /dev/stderr /home/valhallasw/test.sh [20:15:11] !log tools tools-login: Disabled crontab for local-chobot and left a message at [[:ko:사용자토론:ChongDae#Running bots on tools-login, etc.]] [20:15:13] Logged the message, Master [20:26:30] scfc_de: is there a particular reason why I should use jsub instead of qsub provided I'm sending it ready-made shell scripts w/options? [20:26:55] just curious, don't tend to use jsub [20:29:26] Nettrom: the main difference is jsub puts all output in a single file (jobname.out) instead of jobname.out.XXXXXX [20:29:53] and I think jsub specifies the required memory at 256M if it's not given on the command line [20:30:03] otherwise it's just a thin qsub wrapper [20:30:52] valhallasw: I redirect stdout and stderr in my scripts and specify vmem and running time, so it sounds like I won't gain much :) [20:31:13] (tell SGE to redirect stdout and stderr, that is) [20:50:59] Nettrom: Nope, if you are comfortable with qsub you can use it directly. Jsub is mostly just a more-newbie-friendly wrapper that provides useful defaults. [20:58:13] Nettrom: One other difference is that jsub sets the mail address (-M) to $username@tools.wmflabs.org; the default qsub address doesn't work. [21:16:37] scfc_de: that's a nice point to know, I'll make a note about that, thanks! [21:23:46] !log tools tools-login: Disabled crontab for local-chobot and left a message at [[User talk:Reza#Running bots on tools-login, etc.]] ([[:fa:بحث_کاربر:Reza1615]] is write-protected) [21:23:48] Logged the message, Master [21:24:11] Eh, that should have been local-rezabot; I'll fix it at /SAL. [21:25:41] Alchimista: sup. [21:27:29] Alchimista: protip: you can pass username and password to nickserv using the server password instead of a message to nickserv [21:33:04] Everyone seems to be having connection issues today O_o [21:42:21] !ping [21:42:21] !pong