[08:40:13] hi. Do we have utf-8 support enabled by defualt in the tools DB? [09:32:32] !log integration deleting trusty instance slave1003, no real time to play with Trusty right now. Will recreate it as a Precise instance to add a new slave [09:32:34] Logged the message, Master [09:59:48] petan: schvalne jestli se robot ozve, kdyz reknu ping? [12:29:25] !log integration created integration-slave1003 (Ubuntu Precise) and switched it to use local puppet master. [12:29:27] Logged the message, Master [13:02:15] Coren, MySQL workbench just lost it's connection to the tools-db server [13:02:16] I can't reconnect. [13:03:02] It's up again [13:03:09] So nevermind [13:03:17] Cyberpower678: I see no outage; must have been network issues. [13:03:55] Considering everything else worked including this IRC session, it couldn't have been on my end. But regardless, it works again. [13:08:08] a930913: it's been more than 6 days since last incident! :) [13:14:59] YuviPanda: I was updating it around 1400 UTC, but seeing as I'm going to sleep now :p [13:18:42] a930913: :P [15:27:32] 3Wikimedia Labs / 3Infrastructure: tools-exec-cyberbot does not resolve - 10https://bugzilla.wikimedia.org/65948 (10Tim Landscheidt) 3NEW p:3Unprio s:3major a:3None tools-exec-cyberbot currently does not resolve to an IP: | scfc@tools-login:~$ nslookup tools-exec-cyberbot | Server: 10.68.16.... [15:51:11] ok, so back to my labs isues [15:51:21] how can I ssh into this instance? https://wikitech.wikimedia.org/wiki/Nova_Resource:I-0000025a.eqiad.wmflabs [15:52:19] dMaggot: ssh to the bastion, then from there ssh to that instance [15:52:25] what do I do with an instance whose puppet status is "failed"? [15:52:32] valhallasw: what's the DNS of that instance [15:52:48] bastion.wmflabs.org, I think [15:52:49] not 100% sure [15:52:54] @instance I-0000025a.eqiad.wmflabs [15:52:58] valhallasw: err, I mean, what's the host name of that instance [15:52:58] the instance you linked to is wlm-apache1.eqiad.wmflabs [15:53:02] @labs-instance I-0000025a.eqiad.wmflabs [15:53:23] @labs-info I-0000025a [15:53:31] mhm [15:53:41] maybe you need bastion-eqiad.wmflabs.org [15:53:53] @labs-info bastion [15:53:57] It says 'wlm-apache1' right up top of the page you linked? [15:53:57] @labs-info bastion1 [15:54:07] Maybe I misunderstand the question [15:54:40] my problem is that I get "ssh: Could not resolve hostname wlm-apache1.eqiad.wmflabs: Name or service not known" when I try to ssh from bastion [15:55:12] this lab instance predates eqiad, if that's of any help [15:55:22] yes [15:55:26] I remember the old hostname was something like wlm-apache1.wmpta.wmflabs or something [15:55:27] dMaggot: .wmflabs is probably a 'helper' hostname in your ssh config [15:55:36] that's probably not on the bastion? [15:55:46] I think it's gone then [15:55:51] gone... [15:55:55] all instances on old cluster were wiped [15:56:04] not wiped, just shut down [15:56:14] dMaggot: it looks to me like the wlmjudging instances are mothballed. [15:56:22] Which is to say -- off, and not working without some intervention. [15:56:32] Probably salvagable, possibly not. [15:56:59] andrewbogott: huh... ok, it's not super urgent to get it right now but I would like to see if there's a way to rescue some files there at some point [15:57:47] dMaggot: I can see about reviving it. [15:58:03] andrewbogott: ok, that would be great [15:58:17] andrewbogott: as I said, nothing you need to do right now so let me know when is a good time for you [15:59:49] so my other question is what to do with instance https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000003be.eqiad.wmflabs whose puppet status is failed [16:00:38] dMaggot: you can log in and see why it's failing, or you can just burn it down and start a new one [16:01:22] andrewbogott: trying to ssh into it hangs (nothing happens) [16:01:41] dMaggot: are you able to ssh into other (non-bastion) labs boxes? [16:02:18] andrewbogott: the only lab instance I used to log into was that wml-apache1 that is now gone [16:03:04] you can also view the syslog from wikitech [16:03:11] But there's no cost to just trashing that box and trying again [16:03:47] andrewbogott: I'm afraid I'll take the same steps and get the same results [16:04:09] if so then you'll still have learned something :) [16:04:17] And it's only one step, right? [16:04:24] Or was the instance up and running originally? [16:04:25] "ssh wikiviajesve-jurytool.eqiad.wmflabs" times out for me as well. What are the security groups for that project? [16:05:08] Oh, finally got "ssh_exchange_identification: Connection closed by remote host". Don't know if that is from ProxyCommand, though. [16:06:20] andrewbogott: When you got some time, could you please look into https://bugzilla.wikimedia.org/65948 (tools-exec-cyberbot not resolving)? [16:06:50] andrewbogott: no, I created it from scratch [16:08:12] On bastion, I get "ssh: connect to host wikiviajesve-jurytool.eqiad.wmflabs port 22: Connection timed out". So it looks like the network's blocking it. [16:09:18] scfc_de: yes, maybe remind me again in 20 minutes :) [16:09:27] k [16:09:47] dMaggot: tell me your project name again? [16:10:12] Wikiviajesve https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikiviajesve [16:10:18] instance is https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000003be.eqiad.wmflabs [16:11:56] yeah, scfc_de is right -- your security group doesn't allow ssh access. [16:12:04] that… surprises me. [16:12:06] Unless you changed it? [16:12:47] anyway, dMaggot, try now? [16:13:00] andrewbogott: works now [16:13:13] scfc_de: sorry, didn't realize you were adding info to my issue, thanks! [16:13:33] dMaggot, did you alter the security group? (If not I have a bug that I need to log) [16:14:01] andrewbogott: huh, I don't think I did, but then again I'm not sure what I did [16:14:18] andrewbogott: can you change the sec group in the configure page? [16:14:43] dMaggot: no, that would be here: https://wikitech.wikimedia.org/wiki/Special:NovaSecurityGroup [16:15:07] andrewbogott: I have never seen that page [16:15:46] dMaggot: Meanwhile… I'm not having much luck reviving wlm-apache1. I can salvage files off it though, do you know specifically what you need from there? [16:17:09] andrewbogott: /var/www? or whatever the web server was serving... now, that would be mostly filled with symlinks to another folder that is the one I care about [16:18:15] hm [16:18:20] ok, I'll see what I can do. [16:23:06] andrewbogott: I got this from puppet http://pastebin.com/xs8NDy2Z [16:23:11] andrewbogott: is there something I need to do? [16:23:35] dMaggot: that's my fault! Just a second... [16:24:05] dMaggot: count to 100 and try again… should be fixed or at least different [16:24:14] andrewbogott: I can count really fast... [16:24:25] I didnt have shared volumes turned on. So puppet was upset about not being able to find an NFS volume that wasn't there. [16:25:07] dMaggot: I think the refresh happens every 2 mins [16:25:31] andrewbogott: Speaking of, did we ever figure out a way to expose those values to puppet so that the manifest stops trying? [16:25:54] No, but I don't think I thought about it very hard [16:26:53] I mean, the warnings are harmless but noisy. [16:27:07] OTOH, there is no "volume cost" anymore so it's not clear that it still makes sense to have that configurable at all. [16:27:26] Yeah, I think probably we should just remove that config setting and set it to always on [16:27:31] s/that/those/ [16:27:47] At least for /data/project. I suppose there might be some reasons one might want to have a not-shared /home? [16:31:45] I'm not sure… I can think of plenty of reasons why having it not shared would cause unpleasant surprises [16:31:54] nice to have everything backed up. [16:32:08] True. [16:32:33] Coren, should it be easy to mount /data/project on one of the virt hosts? And if so can you feed me the mount string off the top of your head? [16:32:33] andrewbogott: still not working [16:32:37] for project wikiviajesve [16:32:41] dMaggot: what now? [16:32:46] andrewbogott: same error [16:33:30] dMaggot: ok, give me a minute and I'll look [16:33:38] might be that it just needs a few more minutes to catch up [16:33:44] andrewbogott: ok [16:33:57] Easyish. mount -tnfs4 -o rw,noatime,hard,bg,proto=tcp,port=0 labstore.svc.eqiad.wmnet:/project//project /data/project [16:35:07] andrewbogott: In general though, there should already be an entry in the fstab so "mount /data/project" should Just Work™ [16:35:41] Coren: This is on virt1001, not on a VM [16:36:02] Ah. Then the ACLs wouldn't let you; we don't add the virt* boxen to the list. [16:36:12] ok, that's what I was wondering. [16:36:20] So, hm, where should I put these salvaged files... [16:36:46] You can always scp them to labstore [16:37:07] oh, good point [16:37:09] that's easy! [16:37:54] To /srv/project//project/ [16:39:38] yep, doing now [16:39:39] thx [16:40:04] andrewbogott: different errors http://pastebin.com/KyxrDJJJ [16:45:23] dMaggot: I get a clean puppet run on that box now. [16:45:30] lemme check one other thing... [16:45:50] andrewbogott: the wiki page now shows puppet status OK [16:46:33] yeah, looks like it's totally happy now. [16:46:51] andrewbogott: thanks! [16:47:10] andrewbogott: now... I need to make a public web server out of it [16:47:14] dMaggot: on that instance, if you look in /data/project/wlm-apache1 you will find a raw copy of the /var/www dir from the old wlmjudging box [16:47:30] andrewbogott: oh thanks [16:47:34] I wouldn't advise trying to copy those files in bulk but at least you should have the info you need. [16:47:53] Please verify right now that every file you need is there -- I want to clean up all my mountpoints and it's a bit tedious to set them up again. [16:48:27] andrewbogott: yes, all good [16:48:34] cool. OK, enjoy :) [16:57:47] scfc_de: I'm looking at your bug now but haven't learned much yet. That box has a perfectly reasonable host entry in ldap. No idea why pdns ignores it [16:58:17] scfc_de: but I also can't log in to it… can you? [16:59:18] andrewbogott: I can, but from tools-login via host-based auth. I didn't try directly from bastion. [16:59:42] scfc_de: ok, I'll try that in a minute. Meanwhile… what's up with puppet on that box? [17:01:04] andrewbogott: On tools-exec-cyberbot? "puppetd -tv" = multiple complains about not being able to resolve the hostname. [17:01:14] ah, right, you said that already :) [17:01:21] scfc_de: attempted to sudo? :P [17:01:43] thanks for the reminder :) [17:02:39] YuviPanda: Eh, yes?! Though even that ("sudo -i") says "sudo: unable to resolve host tools-exec-cyberbot" (probably looks up hostname to check permissions for sudo). [17:03:14] scfc_de: nah, I just got failed sudo warning email, so joking around :) [17:04:40] YuviPanda: Ah! Well, I get these, too :-). [17:04:53] scfc_de: indeed, hence the ':P' [17:05:03] scfc_de: obligatory http://xkcd.com/838/ :) [17:06:07] YuviPanda: lol, ops gets mail for all of them [17:08:31] mutante: hah, yeah. I got a stern talking to from Leslie a few years ago [17:09:27] YuviPanda: haha:) [17:10:05] scfc_de: https://dpaste.de/yCAh <- any thoughts? [17:13:06] andrewbogott: I don't see any significant difference. Would restarting pdns be very disruptive? [17:13:28] scfc_de: it would shock me if that made a difference, but it's harmless. I"ll try in a moment. [17:14:04] (Thinking negative caching.) [17:14:17] (Though that wouldn't explain the false negative in the first place.) [17:16:16] scfc_de: Actually, first I'm going to bump dnsmasq on that host [17:17:09] @translate en de why do we get these german sudo e-mails? :P | scfc_de [17:17:43] : tools-exec-cyberbot : May 30 17:01:30 : scfc : Hostname tools-exec-cyberbot kann nicht aufgelöst werden [17:19:34] petan: Because I have set LANG to de_DE.UTF-8 :-). [17:22:48] scfc_de: that doesn't seem to have done anything… [17:22:53] any idea if/when dns worked for that box? [17:27:33] andrewbogott: No. I think I logged in after the eqiad re-setup, but I'm not sure. /var/log/puppet.log is stamped May 25th, so I assume before that it worked fine, but no hard evidence. [17:28:04] * and doesn't have those errors. [17:28:16] d'you know how hard it would be to rebuild that box? [17:28:23] Might be the better part of valor since I don't know how to proceed [17:29:16] 3Wikimedia Labs / 3Infrastructure: tools-exec-cyberbot does not resolve - 10https://bugzilla.wikimedia.org/65948#c1 (10Andrew Bogott) I don't know what's going on here. I've verified that that instance has a perfectly reasonable host entry in ldap. I restarted pdns and that did nothing. If I can figure ou... [17:29:26] Coren built that for Cyberpower678; so I don't know what needed to be done to get it back in SGE. Coren, can you chime in? [17:30:08] scfc_de: I'm prepping for the NFS server maintenance, but I can take a quick look afterwards (~1h from now) [17:31:08] Works for me. [17:31:28] scfc_de: meanwhile I'm going to go eat some curry and see if inspiration strikes. Sorry I don't have a quick fix. [17:31:51] andrewbogott: Rebuilding a grid node is generally simple, the only caveat have to do with host key exchange via /data/project/.system/store [17:32:35] Coren: Ok. Maybe later on you can explain what it means for a grid node to have a special-purpose name like that one [17:33:08] andrewbogott: Bon appetit! [17:33:14] andrewbogott: The name is meaningless and for human consumtion only; the actual functionality lives in the gridengine queue definitions. [17:33:40] Coren: OK, so presumably that would have to be altered to support a replacement node... [17:34:07] anyway, back soon... [17:45:56] scfc_de, as long as none of my bots need restarting, there's no hurry, however, if it's not done in the next few hours, bot tasks will start piling up from the crontab. [18:16:45] petan: howdy. around? I'm wondering about managing large numbers of keys in wmbot [18:17:03] petan: the salt folks have thousands of keys and would like to manage it in a saner way than talking with the bot [18:37:59] I seem to be unable to connect to labs. [18:37:59] to tools or all of labs? [18:37:59] I guess it's all on the nfs server now [18:37:59] and that's likely down [18:37:59] yay single points of failure [18:37:59] sigh [18:37:59] maybe we should try glusterfs [18:37:59] * Ryan_Lane ducks [18:37:59] Great. I've got a borked edit counter. [18:37:59] Coren: ^^ that's what this is, right? [18:37:59] same here [18:37:59] And I can't fix it. :/ [18:37:59] can't login on any instance [18:37:59] Scheduled maintenance. [18:37:59] As announced some bit ago. [18:37:59] And repeated some hours ago. [18:37:59] :-) [18:37:59] Coren: icinga bot :) [18:37:59] And will need to be rolled back because happy fun switch was not configured for 802.3ad [18:37:59] * Nemo_bis doesn't see anything about login becoming impossible, in the announcements [18:37:59] homedirectories are on nfs [18:37:59] Nemo_bis: That's a predictable side effect of your /home not being available. [18:37:59] Though I might have wanted to mention that specifically - existing sessions wouldn't be affected though. [18:37:59] Might be predictable but it's not obvious [18:37:59] Anyway, no big deal :) [18:37:59] my existing sessions hang [18:37:59] Well, for me even screen stopped working [18:37:59] (can't exit nor enter any screen) [18:38:00] That'd depend on what you were trying to do-- anything that touches your home will politely wait until it comes back. [18:38:00] NFS should be coming back gradually starting now. [18:38:46] To be fair, you promised to confirm or cancel the maintenance at least three days in advance, not one hour :-) (though people would have been just as "surprised" then :-)). [18:39:49] scfc_de: Oh, hm. I forgot that I said /confirm/ or cancel; I simply defaulted to nothing changed. [18:41:55] I think anything not requiring user reaction and lasting only minutes is okay anyhow. [18:42:28] -login seems to be having more trouble coming back, though -dev is back. [18:43:10] Ah, and seemingly I was just not quite patient enough as -login is back for me. [18:43:25] yay [18:43:43] Coren, I got the announcement, but not the repeat. [18:43:44] !log deployment-prep Updated scap to c4204dd [18:43:46] Logged the message, Master [18:44:24] Coren: of course all my bots died [18:45:17] ... how? The only thing this has done is stall filesystem access on the remote filesystems; no errors were returned since those are all hard mounts. [18:46:49] My webservice died [18:47:08] Coren: the bots failed to read the file system and deied [18:47:35] happens every tme [18:47:41] More likely my webservices on many of my tools died. [18:48:19] Which tools? [18:48:29] supercount and xtools [18:48:31] ... what. None of my bots or webservice die, though they of course stall during the outage. [18:48:50] But thanks to hedonil, they autorecover [18:49:10] Perhaps some timeouts that are not recoverable? I.e.: DB connection timeouts, code doesn't know how to reconnect, dies? [18:49:34] Coren, I was in the process of updating them, so perhaps. [18:50:16] Because nothing that isn't disk bound would even notice the outage, and anything trying to read or write from an NFS partition would just sleep until it came back. [18:50:38] But if, I guess, some network connection timed out while the tool was waiting for the disk... [18:51:07] that might cause a problem if the code doesn't have handling to recover from the connection closing I suppose. [18:51:13] Job scheduling stuck? [18:51:52] Cyberpower678: Yeah, nothing would be started during the outage; but the scheduler would just do so as soon as possible. [18:52:10] in 10 min: https://www.mediawiki.org/wiki/Architecture_meetings/RFC_review_2014-05-30 discussion of extension registration RfC in #wikimedia-office [18:52:12] It's preventing my webservice from rebooting. [18:52:14] Coren: Cyberpower678 meant "qstat -j 1247350" which shows both webgrids being overloaded. I assume that's the rebound. [18:52:15] Coren: https://tools.wmflabs.org/paste/view/22cb57b7 [18:53:25] Still waiting to start up... [18:53:38] Betacommand: d-wxrws--x 8 tools.betacommand-dev tools.betacommand-dev 4096 May 29 20:20 /data/project/betacommand-dev [18:53:38] Missing read there. "Amusing". Fix with: chmod +r . [18:54:00] Coren, There was an error collecting ganglia data (127.0.0.1:8654): fsockopen error: Connection refused [18:54:22] Cyberpower678: Unsurprising, ganglia would have been stalled too. [18:54:43] So how long until everything is up again? [18:54:49] Cyberpower678: But yeah, as the living webservices are catching up, load is unusually high so the scheduler is waiting for room. That may take a couple more minutes. [18:54:56] Coren: Grr [18:55:18] Cyberpower678: Ganglia's downtime isn't related to this. [18:55:33] scfc_de, then what is it related to? [18:55:33] scfc_de: Oh? Just a coincidence in time? [18:56:37] hedonil, a bug to work out in the altered webservice. [18:56:43] It starts with lots of memory free after a reboot, and then clogs up until it fails (same for weeks). hashar wants to split the aggregator and the web app to two different instances for more room to wiggle. But it's not related to NFS stalls. [18:56:47] Cyberpower678: heh. Should have disabled webwatcher during maintenance [18:56:54] Coren: I havent made any changes to permissions [18:56:57] Cyberpower678: no bug [18:57:18] Coren: I know you did [18:57:24] hedonil, the webwatcher shouldn't try to requeue the task when it's still in status qw [18:57:42] Betacommand: It may have been my fault when I tweaked the permissions yesterday; nothing that didn't do an 'ls' would have noticed; if you know the filename 'x' is enough to traverse directories. [18:57:51] Cyberpower678: it did what it was supposed to do - restart on outage [18:58:13] Cyberpower678: it has -once flag, so it's a thing of grid engine [18:58:21] Yes, and it restarted it. But it's constantly trying to restart something that's waiting to start [18:58:22] Load on -webgrid-01 is going back down to normal levels. [18:58:58] Cyberpower678: if the world breaks, webwatcher can't fix :P [18:59:05] Coren: however tab completion is broken [18:59:07] at least not in v.1.0 [18:59:41] Cyberpower678: xtools back again [18:59:44] Betacommand: Ah, yes, that's implicitly the same as an ls. I've added the read permission for user now. [18:59:57] Coren: thanks [19:00:27] Betacommand: Like I said, that was probably my fault when I tweaked the permissions yesterday; I wouldn't have noticed because root. [19:00:31] Sorry 'bout that. [19:00:58] Coren: still getting cannot open directory .: Permission denied [19:02:10] Cyberpower678: but if you don't trust webwatcher, you can switch back to default lighttpd config - and restart every couple of days by hand:P [19:02:35] Betacommand: Hm. Probably because I fixed the permission to the wrong directory. :-) Try now? [19:03:01] This is neither useful nor accurate. [19:03:13] Coren: seems to be working [19:03:36] Extension registration conversation happening now in #wikimedia-office [19:03:44] aw [19:13:41] how can I make my instance have a public IP? [19:14:45] dMaggot: You need to allocate an IP from the address management system, on the sidebar, but that needs us to give you quota for it. What do you need it for? [19:15:20] If it's only web, then you shouldn't use a public IP but use the proxy instead. Gives you https for free too. [19:20:03] Coren: yes, it's only for web [19:22:04] dMaggot: you can set that up with https://wikitech.wikimedia.org/wiki/Special:NovaProxy [19:22:16] it will direct http and https to your instance. [19:22:31] But you have to have port 80 open in the security rules… I'll do that for you now. [19:24:04] dMaggot: ok, firewall is fixed so you should be able to set up a proxy now. [19:24:47] andrewbogott: it works! that was easy :) [19:25:11] Coren: I have funny things in my lighty access.log [19:25:16] /enwp10/cgi-bin/list2.fcgi?run=yes&projecta=Children_and_young_adult_literature .... [19:25:34] this is a logentry from tool enwp10 ... [19:26:22] Coren: or is this a new feature? distributed logging ;-) [19:41:18] hedonil is talking about xtools; other access.logs look fine. [19:41:59] YuviPanda: Could you look into Redis to see what enwp10 maps to? [19:42:05] (And why.) [19:42:23] hedonil: That seems odd to me; it's entirely possible that wires got crossed because latency while a webservice switched ports for another reason. [19:42:43] tools.enwp10 has no webservice, so probably the Redis entry just wasn't removed? [19:43:14] Coren: yeah, funny though [19:43:26] scfc_de: That's theorically not possible because the lifetime of the entry is directed by a particular filedescriptor being kept open which doesn't survive the process dying. [19:44:01] scfc_de: Though I suppose a bug is always possible. [19:44:24] hedonil: Do you see a new request for /enwp10/ in your logs just now? [19:44:44] Coren: yep, still going on [19:44:58] Well, if I knew what Redis commands to use ... :-) [19:45:02] 10.68.16.4 tools.wmflabs.org - [30/May/2014:19:44:27 +0000] "GET /enwp10/cgi-bin/list2.fcgi?run=yes&pr [19:45:13] I've restarted the enwp10 webservice which has then replaced the entry. [19:45:54] It's mostly harmess if that happens, thanks to the path being passed unedited - but I'll keep an eye out for this happening again. That might be an oddity caused by a reddis restart, or something like that. [19:46:05] Coren: seems to be fixed now [19:46:30] Yes, starting the webservice will have overwritten the key-value pair with new, correct values. [19:50:15] yeah, shouldn't be happening [19:50:28] scfc_de: proxylistener has logs [19:50:47] scfc_de: /var/log/proxylistener [19:51:04] scfc_de: so you can use that to check [19:51:06] * YuviPanda goes off [19:58:16] I see that, if I wanted to have PHP5 in my instance I can install it through apt-get [19:58:26] but is there a way to do that through the configure page? [19:58:34] I tried adding that role that says php5-mysql [19:58:49] I don't think it made a difference (or do I need to wait?) [19:59:27] and I probably mean php cli, it probably already able to serve PHP [20:15:20] hi. how can I easily transfer a file (an image) from my local computer to tools lab ?? [20:15:54] rohit-dua: scp works [20:16:40] nosy: well I never used scp... thank you letting me know.. [20:17:10] rohit-dua: scp somefile user:remotemachine:/tmp/ [20:17:24] if you are on Windows, WinSCP [20:17:59] eh, that first : should be an @ , sorry [20:18:31] mutante: does it work on ssh? as i dont enter my credentials... [20:18:40] andrewbogott: What exactly is dnsmasq used for? A central proxy in front of pdns, i. e. instance => dnsmasq => pdns? [20:18:59] rohit-dua: yes [20:19:15] scfc_de`: It's used by openstack to keep track of dns between labs instances. [20:19:34] I only partially grasp the relationship between dnsmasq and pdns, I may be mistaking one bit for the other. [20:19:43] rohit-dua: based on ssh https://en.wikipedia.org/wiki/Secure_copy [20:20:01] scfc_de: sorry, colloquy crashed and I lost my scrollback [20:20:59] andrewbogott: Said nothing more than: "What exactly is dnsmasq used for? A central proxy in front of pdns, i. e. instance => dnsmasq => pdns?" :-) [20:21:08] and you got my answer? [20:22:24] andrewbogott: Nope. [20:22:39] scfc_de`: dnsmasq is what Openstack uses to keep track of local instance dns. [20:22:57] So for instance a standalone instance name (without the domain) [20:22:57] andrewbogott: fyi, switching virt0 over to admin yaml.. not expecting anything to happen except that then you can finally connect as non-root if you want [20:22:57] I think that's updated live by Openstack in dnsmasq [20:23:03] but I don't understand quite how the bits fit together. [20:23:14] mutante: ok, thanks for the warning [20:25:16] andrewbogott: it finished without puppet error and created all the accounts. try this if you like: ssh andrew@virt0 then sudo -s afterwards [20:25:31] andrewbogott: a) I got disconnected as well -- so that confused me. b) That sounds even more confusing, as I would have expected the LDAP stuff to be authorative. c) But I just noticed: "nslookup tools-exec-cyberbot.eqiad.wmflabs" now resolves! [20:25:47] scfc_de`: c) ok... [20:27:34] mutante: i did this> scp robot_bub.png rohit-dua@tools-login.wmflabs.org:~ how do i get it into the tool folder.. ie. the folder i get into after become toolname [20:28:33] rohit-dua: after you become toolname, and then type "pwd", what does it say [20:28:57] you just need to replace ~ with the path [20:29:00] andrewbogott: Is 10.68.16.1 a real machine or a network alias? [20:29:09] rohit-dua: Use /data/project/YOURTOOL/... [20:29:35] thank you.. i got it. :)) [20:29:49] scfc_de`: network card 4 in VLAN 1102 on labnet1001 [20:30:44] scfc_de`: I kind of thing that's a router [20:30:46] *think [20:30:57] scfc_de`: why do you ask? [20:32:03] andrewbogott: it's on labnet1001 itself [20:32:19] Oh, that makes sense, inasmuch as labnet1001 acts as the router for instances [20:32:21] or is it.. ehmm [20:32:39] andrewbogott: Because that's what the instances query for DNS. So I'm interested where that host gets its information from. [20:33:10] Oh, I see. Yeah, I suspect that's dnsmasq running on labnet1001 [20:34:49] Is that "that dnsmasq" connected with OpenStack, or just a caching one? [20:35:48] looks like just caching [20:40:07] mutante: so I can log in as myself on virt0 but not virt1000, is that what you'd expect? [20:40:40] andrewbogott: yes, virt1000 is happening while we speak, about to merge [20:40:43] ok [20:40:52] wanted to see it work on a single one first [20:40:55] scfc_de: a fairly painless explanation, here: http://docs.openstack.org/grizzly/openstack-compute/admin/content/dnsmasq.html [20:41:34] scfc_de: if I read that right, it sounds like ldap/pdns is canonical and dnsmasq just caches. [20:41:46] 3Wikimedia Labs / 3Infrastructure: tools-exec-cyberbot does not resolve - 10https://bugzilla.wikimedia.org/65948#c2 (10Tim Landscheidt) 5NEW>3RESO/FIX Now it works again: | scfc@tools-login:~$ nslookup tools-exec-cyberbot | Server: 10.68.16.1 | Address: 10.68.16.1#53 | Non-authoritative... [20:41:51] Except I'm not sure how that works for unqualified names? Since I don't think they're in ldap [20:42:36] Ah, okay, so the instances get their IP per DHCP. Totally forgot about that. [20:46:04] andrewbogott: AFAIK, dnsmasq qualifies names before it queries with what it believes is the default configured domain. [20:46:31] Coren: Ah, ok, so that would mean that pdns really is always canonical. [20:46:32] scfc_de: all I can guess is that restarting pdns 'fixed' it but there was a 60-min cache entry in dnsmasq? That's just a guess. [20:47:03] remembering when we still used pdns in production [20:47:11] Yeah, dnsmasq does some caching, and is really crappy at it. Also it's poorly reentrant and crumbles under the lightest of loads. [20:47:16] we had to restart after like 30% of the changes [20:47:26] since we switched to gndns.. never [20:47:58] andrewbogott: So the lesson is "patience"? :-) [20:48:17] maybe? I really don't know what happened, just that restarting pdns is the only thing I changed [21:45:44] !log deployment-prep Restarted uwsgi on deployment-graphite [21:45:46] Logged the message, Master