[00:02:32] wow, thansk for all those speedy patches andrewbogott_afk [= [00:44:38] !log deployment-prep Rebooting deployment-solr, jetty (or java?) is FUBAR [00:44:43] Logged the message, Master [09:12:01] !log tools Added aude to lolrrit-wm maintainers group [09:12:03] Logged the message, Master [09:16:58] yay! [12:40:00] !ping [12:40:00] !pong [14:40:20] I wanted to run a script on beta today to update CirrusSearch but can't connect to deployment-bastion any more. is there a new place to connect? [15:11:47] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/List of Toolserver Tools was modified, changed by Coet link https://www.mediawiki.org/w/index.php?diff=820044 edit summary: +BotReversor [15:27:35] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/List of Toolserver Tools was modified, changed by Coet link https://www.mediawiki.org/w/index.php?diff=820051 edit summary: /* Active Tools on the Toolserver */ [16:09:35] Hello [16:09:52] I have a tool account now [16:10:02] How can I run an IRCbot on labs? [16:41:50] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/List of Toolserver Tools was modified, changed by Coet link https://www.mediawiki.org/w/index.php?diff=820083 edit summary: /* Active Tools on the Toolserver */ [17:17:35] Coren: do you know if anyone is managing /shared/pywikipedia/ ? How often it is updated (if it is) [17:22:46] alchimista: I think that's Magnus Manske [17:27:24] HexaCore: Still looking for a bot? [18:15:39] Coren: In? [18:18:50] multichill: Just back from lunch. What's up? [18:30:12] can any sysadmin of Labs complete https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Coet , please? [18:35:14] coet|cawiki: should be all set now. [18:35:24] thanks [18:39:29] Ryan_Lane: I have two questions about account management… first, is it still the case that there's a special procedure needed for giving labs accounts to pre-labs SVN users? [18:49:50] ehm, I lost projectadmin access to deployment-prep and I need it to fix an instance (or recreate it):) [18:59:56] MaxSem, do you want me to give you admin or just make the change for you? [19:00:27] andrewbogott, I think I could use access for myself [19:01:30] MaxSem: ok, done [19:01:34] thanks! [19:23:13] Noooo! Not MaxSem! The project is dooooooomed! :-) [19:24:08] IM IN UR PROJECTS STEALING UR PRESHEZ CERTS [19:31:09] MaxSem: you mean keys? [19:31:10] :) [19:31:16] the certs are public! [19:46:30] How to run bot via tools wmf labs ? [19:47:23] Kolega2357: That's a very open-ended question. It depends greatly on the kind of bot, and how it's written. [19:47:55] Coren do yoy know to run IRC bot via wmf labs? [19:48:25] Kolega2357: If it's a continuously running bot, it's generally as simple as doing a 'jstart ' from the tool account. [19:48:29] !toolsdoc [19:48:29] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help [19:48:45] That URL ^^ points to lots of documentation on how to do that. [20:19:52] can someone kill rezas python process, its eating >90% CPU on tools-login [20:20:21] anyone around know how to connect to beta? I'm getting rejected when I `ssh deployment-bastion.pmtpa.wmflabs`. [20:21:47] manybubbles: me too: cmcmahon@bastion1:~$ ssh deployment-bastion [20:21:47] ssh: Could not resolve hostname deployment-bastion: Name or service not known [20:22:24] lbenedix: Tsk, tsk. Bad reza [20:22:42] python test.py [20:22:48] good name for a script ;) [20:23:11] Actually, the bigger problem is elph. I've warned him before. [20:24:06] running jobs on the grid thingy is really easy... [20:24:17] I know it is. [20:25:07] lbenedix: Oops. Sorry, I misaimed the message at you. :-) [20:25:18] manybubbles: I don't know why DNS would not resolve deployment-bastion from bastion.wmflabs.org, but that seems to be my experience [20:27:18] chrismcmahon: If I go directly to the ip it works.... [20:27:43] chrismcmahon: who do I complain about this to? was it hashar? [20:29:18] I'm not sure if hashar would know, Coren, any ideas on DNS failing for deployment-* hosts? [20:29:50] Is it failing or are you just using the wrong implied fqdn? [20:30:12] Hm. Bastions should work. [20:33:35] Coren: it isn't resolving deployment-bastion from tampa's bastion [20:36:46] Ryan_Lane: Around? I've got a question about AuthLDAPBindPassword in labs [20:38:37] !log deployment-prep updating search indexes in labs [20:38:43] Logged the message, Master [20:39:12] !log logstash ElasticSearch running from deb package [20:39:14] Logged the message, Master [20:39:37] !log logstash Logstash running via upstart script [20:39:39] Logged the message, Master [20:41:09] bd808: cool. [20:41:31] manybubbles: It's getting close to be sort of useful [20:42:04] I need to find the ldap bind password so I can secure the interface [20:43:11] bd808: which interface? [20:43:22] you shouldn't do LDAP auth from within Labs [20:43:22] Ryan_Lane: logstash [20:43:32] ideally we'd have OpenID for this [20:43:43] otherwise you're accepting people's passwords [20:43:55] hm. let me ask the platform team about openid [20:44:06] Ryan_Lane: Oh. Umm… ok. [20:44:06] maybe I can put the provider on wikitech [20:45:00] I was just trying to put "AuthBasicProvider ldap" on the apache so the log data that's flowing in there now isn't wide open [20:45:19] you have private logs flowing into labs? [20:45:33] * Ryan_Lane gets his paddle out [20:45:50] Ryan_Lane: I have whatever ori-l started sending me via udp2log [20:45:52] sending private data into labs? that's a paddlin' [20:46:16] I'm not sure where it's from exactly [20:46:16] please stop that stream, if the data is private [20:47:21] Ryan_Lane: I'll double check with Ori. [20:47:35] * bd808 doesn't want a beating [20:47:53] :D [20:54:16] Coren: multichill insists that i should be able to connect to tools-db [21:15:54] jeremyb: Sorry, was in a meeting. I can haz conteckts? [21:18:11] Coren: see http://tools.wmflabs.org/wm-bot/logs/index.php?start=11%2F14%2F2013&end=11%2F14%2F2013&display=%23wikimedia-wikidata [21:18:20] multichill: look up :) [21:21:05] Ah, I see what he means. Once upon a time, /user/ accounts had access to tools-db. Not anymore. [21:22:05] multichill [21:23:19] He may have meant using the nara tool account. [21:23:55] nah, i asked [21:24:26] 14 20:29:56 < jeremyb> as a tool or a user? [21:24:26] 14 20:30:03 < multichill> either [21:44:14] !log logstash Added basic auth protection for logstash interface. Will replace with openid when available. [21:44:17] Logged the message, Master [21:48:59] bd808: what's feeding into there now? [21:49:03] logstash that is [21:49:27] jeremyb: Ori pointed the log2udp stream from beta at it [21:49:45] cool [22:22:20] does beta have proper job runners? [22:33:34] manybubbles: yes [22:33:46] deployment-jobrunner08.pmtpa.wmflabs [22:33:52] and there is a videoscaler as well [22:33:53] hashar: oh hi - [22:34:04] I looked there and it wasn't running anything at the time [22:34:08] let me check again [22:34:13] might be broken :-( [22:34:41] hashar: did you know that I can't ssh into deployment-jobrunner8 unless I do it by ip? [22:34:53] some of the things in deployment- are like that now.... [22:34:59] ahh [22:35:07] so I am not alone [22:35:22] can't access anything, so I guess something is broken in labs for deployment-prep project [22:35:41] do you know who to bother about htat? [22:35:54] maybe (if I'm lucky) the two problems are related [22:36:01] yeah no dns entry [22:36:02] grmblbl [22:36:37] Coren: andrewbogott: we lost the DNS entry of instances in the deployment-prep project [22:36:55] * andrewbogott looks [22:36:56] actually it looks like deployment-jobrunner8 doesn't let me in even via ip [22:37:04] "lost"? [22:37:06] hashar@bastion1:~$ dig +short deployment-bastion.pmtpa.wmflabs [22:37:06] hashar@bastion1:~$ [22:37:29] though it works for instance in another project (i.e. integration: integration-pbuilder.pmtpa.wmflabs resolves ) [22:37:48] are the public addresses all still working? [22:38:03] looking [22:38:07] hashar: Actually, you seem to have lost the /instance/ [22:38:13] oh really [22:38:25] is one of the bare metal box dead ? :/ [22:39:03] andrewbogott: public addresses works [22:39:12] Coren: I'm logged into deployment-bastion right now. just via ip [22:39:22] deployment-jobrunner8 though, no good [22:39:31] Ah, the instance isn't named 'deployment-bastion' [22:40:00] And I see deployment-jobrunner08, not deployment-jobrunner8 [22:40:39] Which instance is deployment-bastion meant to be? [22:40:42] deployment-bastion.pmtpa.wmflabs [22:40:59] let me find the id [22:41:08] i-00000390 [22:41:09] Ah, I see it. [22:41:11] Coren: 10.4.0.58 [22:41:36] deployment-jobrunner08.pmtpa.wmflab doesn't let me in either. [22:41:55] same issue with deployment-apache32.pmtpa.wmflabs i-0000031b 10.4.0.187 [22:42:03] err wrong ip [22:42:09] ping deployment-jobrunner08 [22:42:09] PING deployment-jobrunner08.pmtpa.wmflabs (10.4.1.30) 56(84) bytes of data. [22:42:10] 64 bytes from deployment-jobrunner08.pmtpa.wmflabs (10.4.1.30): icmp_req=1 ttl=64 time=2.11 ms [22:42:27] And I can connect to it. [22:42:55] yeah it closes the connection after the motd :/ [22:43:00] Connection to deployment-jobrunner08.pmtpa.wmflabs closed. [22:43:11] hashar: Ah; that's a different kettle of fish. [22:44:12] No home on jobrunner08. [22:44:14] (!) [22:44:24] That box hasn't been rebooted since the NFS move! :-) [22:44:29] ohhh [22:44:46] and I thought I rebooted them all grblblb [22:44:57] mind rebooting it ? [22:45:05] 22:44:07 up 90 days, 14:11, 1 user, load average: 0.07, 0.09, 0.07 [22:45:06] or I can click in the interface and hope [22:45:07] Sure. [22:45:20] Rebooting. [22:45:23] thx [22:46:17] manybubbles: so hmm, the jobs were not being processed :-] [22:46:32] Coren: the next issue is some instances that lack an entry in DNS :/ [22:46:42] The issue with the names missing from DNS is a different thing though. andrewbogott, do you have an idea where to look? [22:46:56] I'm looking now… doesn't make much sense to me though [22:47:10] deployment-apache32 i-0000031a ACTIVE 10.4.0.166 [22:47:10] deployment-apache33 i-0000031b ACTIVE 10.4.0.187 [22:47:11] deployment-bastion i-00000390 ACTIVE 10.4.0.58 [22:47:16] they are in LDAP aren't they ? [22:47:23] yes, they are. [22:47:26] maybe the DNS <--> LDAP connection died [22:47:32] and the entry simply expired [22:47:36] I just restarted both, we'll see what happens in a minute or two. [22:47:40] Coren: oh [22:47:47] I've been messing with LDAP [22:47:55] I thought I checked pdns after I did, though [22:47:57] * Coren blames Ryan_Lane [22:48:01] hashar@bastion1:~$ dig +short deployment-apache33.pmtpa.wmflabs @labs-ns1.wikimedia.org [22:48:02] 10.4.0.187 [22:48:03] hashar@bastion1:~$ dig +short deployment-apache33.pmtpa.wmflabs @labs-ns0.wikimedia.org [22:48:03] if you restart LDAP, you need to restart pdns as well [22:48:03] hashar@bastion1:~$ [22:48:08] so labs-ns1 got them, not labs-ns0 [22:48:23] Ryan_Lane: yep, I restarted services opendj and pdns, in that order. [22:48:44] can't you make the LDAP service to always restart DNS as well when it is restarted? [22:49:56] additionally, bastion could use a second nameserver: [22:49:57] bastion1:~$ grep nameserver /etc/resolv.conf [22:49:57] nameserver 10.4.0.1 [22:49:59] $ [22:50:00] :( [22:51:02] hashar: restarting one could also break the other [22:51:05] pdns kind of sucks [22:51:55] do we have a way to monitor each labs-ns### to make sure they serve queries? [22:52:12] or it only impacted a few entries and thus it is barely monitor able ? [22:53:21] manybubbles: deployment-jobrunner08.pmtpa.wmflabs is back up [22:53:36] hashar: yeah, I'm on it but it isn't doing anything :( [22:53:44] so it is broken somehow [22:53:46] upgrading packages [22:54:04] !log deployment-prep upgrading packages on -jobrunner08 [22:54:05] top told me so [22:54:09] Logged the message, Master [22:54:32] manybubbles: we also have a full debug log somewhere under /home/wikipedia/logs [22:54:43] which show the runjobs.log or something similar [22:55:21] hashar: missing! [22:56:40] will rerun puppet [22:57:43] andrewbogott: have you rebooted pdns / ldap? [22:58:22] manybubbles: the machine has some puppet class applied, maybe a change in operations/puppetbroke it for labs [22:58:24] :( [22:58:28] hashar: yes, on virt0. [22:58:46] still have no response from: dig +short deployment-bastion.pmtpa.wmflabs @labs-ns0.wikimedia.org [22:59:12] nor on 10.4.0.1 (same machine maybe) [23:02:19] Ryan_Lane: are you looking at this? and/or do you have thoughts? [23:02:21] hm [23:02:25] I get responses for bastion [23:02:39] I don't for anything else [23:02:40] Most entries are working fine, just not deployment-login [23:02:57] oh [23:03:00] we don't have a bastion1 :D [23:03:03] ignore me [23:03:13] weird [23:03:20] and for that one: $ dig +short deployment-apache33.pmtpa.wmflabs @labs-ns0.wikimedia.org [23:04:05] are labs-ns0.wikimedia.org and labs-ns1.wikimedia.org virt0 and virt1000, respectively? [23:04:11] yep [23:05:33] manybubbles|away: bah puppet is broken on deployment-jobrunner08 : err: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find node 'i-000004ff.pmtpa.wmflabs'; cannot compile :-] [23:05:38] there's no entry for it [23:05:48] there's no deployment-bastion.wmflabs.org [23:06:00] ah [23:06:01] hahaha [23:06:19] works for me [23:06:26] dig @labs-ns0.wikimedia.org deployment-bastion.pmtpa.wmflabs [23:06:30] dig @labs-ns1.wikimedia.org deployment-bastion.pmtpa.wmflabs [23:08:13] labs-ns1 gives me 10.4.0.58 as expected [23:08:20] Ryan_Lane: I don't understand… I'm logged into bastion-restricted1, trying to ping deployment-bastion [23:08:22] shouldn't that work? [23:08:30] nope [23:08:37] the nxdomain is cached [23:08:48] you need to kill -HUP dnsmasq on virt2 [23:09:14] ok, back up... [23:09:26] \O/ [23:09:29] why would this have stopped working? Or do you think it never worked? [23:09:35] make sure to write that down somewhere on wikitech :-] [23:09:42] it's because I was making LDAP changes yesterday [23:09:49] and I restarted LDAP [23:09:54] which causes pdns to lose connection [23:09:57] and it doesn't reconnect [23:10:04] because pdns actively hates people [23:10:17] Ah, and restarting pdns would help, except in the meantime it's cached a bunch of brokenness? [23:10:37] yep. for the length of the negative dns cache [23:10:49] we should likely have the negative ttl set low [23:10:52] no clue what it's set to [23:10:57] So what does kicking dnsmaq on virt2 do? That's a part of the story I don't understand. [23:11:06] dnsmasq is a recursor [23:11:13] so it caches the lookup [23:11:33] that's the negative cache I'm talking about [23:11:50] and it's on virt2 because... [23:11:56] it's the network node [23:12:12] dnsmasq does dns recursing, and dhcp [23:12:38] heading bed, have a good DNS session! [23:13:05] technically all dnsmasq is doing as a recursor is sending the request forward to our normal recursor [23:13:17] but it still caches requests as they pass through [23:13:39] gotta run. back in a bit [23:13:48] ok. I HUPped those two processes, no results so far [23:14:09] you may also need to purge the cache for the record on our primary recursor [23:14:13] which may still be pdns [23:14:17] paravoid: ^^ ?? [23:14:22] I have to go... back in a while [23:22:52] Ryan_Lane: Do I need to restart a box when I add someone to a project and give them project admin? [23:32:40] TParis: you shouldn't. [23:32:56] And, indeed, project admin doesn't affect what people can do /on/ a box anyway… unless you're doing something fancy with sudo rules. [23:35:05] ok [23:35:07] thanks [23:40:01] umm, andrewbogott, I restarted the instance and apache isnt finding my files for some reason [23:40:05] Why would it do that [23:40:49] Um… I'm not great at debugging apache problems, but probably it was running with a cached config and restarting Apache caused it to pick up a different config [23:41:18] uh ohh, I dont know [23:41:50] wonder if its a permissions issue [23:41:52] probably I think [23:43:08] what's up? [23:43:33] utrs.wmflabs.org [23:43:55] sorry, andrewbogott: what's the DNS issue? [23:44:32] A few entries got cached as nxdomain [23:44:39] and now I'm totally unable to get things going again [23:45:01] Ryan suggested that i HUP dnsmasq on virt2, which doesn't seem to've done anything [23:45:01] which domains from which hosts? [23:45:12] for exanple deployment-bastion.pmtpa.wmflabs [23:45:25] from? [23:45:45] from everywhere, but, e.g. bastion-restricted1 [23:46:30] ok, looking [23:46:46] thanks [23:50:27] labs-ns0 responds NXDOMAIN [23:51:06] Which is presumably cached, since the host is in ldap and ns1 can see it [23:51:21] (and Ryan was monkeying with ldap yesterday which would cause… things like that.) [23:51:59] before you purge the cache, you need to fix the origin [23:52:08] i.e. labs-ns0 == virt0 [23:52:32] and it's not pdns not reconnecting to LDAP [23:52:36] it's probably the LDAP replication being broken [23:52:38] andrewbogott: Got it fixed, you were right [23:52:43] TParis: cool [23:53:11] paravoid: ok… now you've said lots of things that I don't quite follow. [23:53:19] pdns is picking up the ldap entries /on/ virt0, right? [23:53:33] sorry [23:53:35] so [23:53:37] I see that the host entry is correct on ldap, on virt0… I think? [23:53:38] we have two NS [23:53:42] labs-ns0 == virt0 [23:53:46] and labs-ns1 == virt1000 [23:53:51] yep [23:54:01] root@virt2:~# dig +short deployment-bastion.pmtpa.wmflabs @labs-ns0.wikimedia.org [23:54:04] root@virt2:~# dig +short deployment-bastion.pmtpa.wmflabs @labs-ns1.wikimedia.org [23:54:07] 10.4.0.58 [23:54:09] so labs-ns0 doesn't have it [23:54:13] yep [23:54:22] virt2 runs a dnsmasq which would cache responses coming from either of these two [23:54:40] so *after* we fix the origin (labs-ns0/1) we'll also need to purge virt2's recursor cache [23:54:57] OK. [23:55:06] but before we fix the cache, we need to fix the original problem [23:55:12] So that would mean… restarting pdns on virt0 and then kicking dnsmaq on virt2? [23:55:45] sorry, should let you finish :) [23:56:11] one of the recurring issues is that pdns doesn't reconnect to ldap if e.g. ldap restarts [23:56:26] so then you need to restart pdns on virt0 and of course kick dnsmasq on virt2 to purge caches, as you said [23:56:31] that doesn't seem to be the case here though :( [23:56:39] I restarted it and made no effect [23:56:53] Yep, I've already tried, twice [23:57:00] now, if I remember correctly, pdns one each of (virt0, virt1000) uses LDAP for domains [23:57:03] a *local* LDAP [23:57:12] and then the two LDAPs replicate with each other [23:57:43] so my guess is, replication is broken [23:58:08] I've done a local ldapsearch on virt0 and verified that the host entry is present. [23:58:28] I just did the same and verified the opposite [23:58:32] unless I did it wrong [23:58:34] oh! [23:58:35] what did you do? [23:58:37] * andrewbogott checks again [23:58:49] what did you run exactly? [23:58:59] ldapsearch -x -D cn=proxyagent,ou=profile,dc=wikimedia,dc=org -W -b "ou=hosts,dc=wikimedia,dc=org" [23:59:22] I ran ldapsearch -x 'associatedDomain=deployment-bastion.pmtpa.wmflabs' aRecord