[00:00:10] chasemp: the merge conflict is probably still causing problems on wikimetrics, but I think what you did with git killed my commit there anyway [00:00:19] I tried grepping in git log and couldn't find it [00:00:32] sorry man I didn't remove anything? [00:00:35] check stash? [00:00:35] so I'd have to fix that manually anyway I think, for which I saved the passwords file [00:00:43] my first move would be a git stash i [00:00:46] nono, it's fine, I can fix it manually [00:01:08] oh, no, nothing would be in the stash, we just keep a local commit as the HEAD with whatever custom changes [00:01:17] I should've told you, that's my bad [00:01:37] but my point is, chasemp: you can do whatever you like on that server and I'll restore it back to normal [00:01:40] ok [00:01:45] so reset / pull / whatever [00:02:47] I'm in your server now dude [00:03:11] I know, I'm just following along with git status [00:03:24] all up in your server [00:03:31] lol [00:03:47] I'm stalking you, man, I'm not sure which one of us will get uncomfortable first [00:05:06] ok try to ssh and sudo now? [00:07:19] YuviPanda btw: https://phabricator.wikimedia.org/T120891 [00:07:42] works chasemp [00:07:49] thank you very much [00:07:54] I do not understand limn1 tbh atm [00:08:07] it's behind but pull says updated [00:08:10] I don't get it [00:09:44] chasemp: you tried a git pull --rebase? [00:10:14] milimetric: I gotta run in like 2 but try root@limn1.analytics.eqiad.wmflabs I dropped you in for the moment to do your thing [00:10:35] is this a thing that needs to figured out tonight? [00:11:12] chasemp: i'll poke at it, no worries [00:11:21] I'll grab someone else to help, have a good night [00:11:51] I'll check back in a few when I can but also you should be able to fix and run puppet as root for the moment [00:11:55] thanks [00:16:25] chasemp: it used to be horribly written, I rewrote it :D [00:17:51] chasemp: we should make it easy for people to drop their own keys into root [00:24:05] YuviPanda: limn1 has: Error 400 on SERVER: Could not find resource 'Exec[compile puppet.conf]' for relationship on 'Class[Puppetmaster::Ssl]' on node limn1.analytics.eqiad.wmflabs [00:24:30] the operations/puppet repo is up to date [00:24:53] 6Labs: 'virt1' entry at markmonitor? - https://phabricator.wikimedia.org/T102689#1864257 (10RobH) I asked Doni to provide the listing, since it seems impossible they are unable to change something and cannot tell why they cannot change it. Since this means someone has to grep every single MM domain for the info... [00:25:02] milimetric: are you applying any roles via the default site.pp? [00:28:04] YuviPanda: the only thing in the custom commit is "import 'passwords.pp'" [00:28:07] (in site.pp) [00:28:13] hmmm [00:28:47] git diff HEAD^ is basically that, the manifests/misc/limn1.pp file and a bunch of custom templates [00:29:25] so other self hosted puppetmasters work fine and this doesn't... [00:29:31] I'm not particularly sure what's going wrong here [00:29:41] well they don't work fine until puppet is run again [00:29:53] yah but that particular error has been going on for weeks [00:29:54] but when chase logged in and pulled with root, puppet ran on wikimetrics1 [00:30:02] oh really? it's still the same one? [00:30:10] never mind then, just ignore it, hasn't caused problems yet [00:30:17] sorry, thought it was something new [00:30:25] * milimetric out [00:30:26] :) [00:30:29] enjoy your night [00:30:42] milimetric: I just came to the office :D [00:30:45] chasemp: don't worry, everything's fine [00:30:46] lol [00:30:54] :D [00:30:54] enjoy your... breakfast then? [00:31:02] I did indeed have breakfast just now [01:14:40] YuviPanda, hey [01:14:46] I can't get into deployment-salt [01:14:49] or deployment-restbase01 [01:15:31] andrewbogott, ^ [01:15:54] am looking [01:15:58] at deployment-salt [01:16:12] YuviPanda: did you get chase’s earlier email about the ldap cert mismatch? [01:17:04] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class puppet::self::geoip for deployment-salt.deployment-prep.eqiad.wmflabs on node deployment-salt.deployment-prep.eqiad.wmflabs [01:17:06] Warning: Not using cache on failed catalog [01:17:08] Error: Could not retrieve catalog; skipping run [01:17:44] that’s not what I expected [01:18:08] oh, I guess broken puppet = not updating ldap = no logins [01:18:29] YuviPanda: so probably if you change /etc/ldap.conf you can get Krenair access if you don’t want to fix puppet yourself [01:18:36] yeah ok [01:18:59] andrewbogott: where do I find the new password? [01:19:34] should all be the same, just change ldap-eqiad to ldap-labs.eqiad and ldap-codfw to ldap-labs.codfw [01:19:47] ah ok [01:20:06] andrewbogott: ldap-labs.eqiad.wikimedia.org? [01:20:08] or .eqiad.wmnet? [01:20:16] moritz aliased the the old server names but of course the cert check doesn’t like that [01:20:24] ldap-labs.eqiad.wikimedia.org [01:20:32] just add -labs. to the existing names :) [01:22:37] andrewbogott: but existingone is ldap-eqiad.wikimedia.org [01:22:40] a dash not . [01:22:42] * YuviPanda tries [01:22:51] YuviPanda: i'm seeing this error on wikimetrics staging instance in labs - sudo: ldap_start_tls_s(): Connect error something going on? [01:22:58] oh i see ldap things on top [01:23:03] s/ldap-eqiad.wikimedia.org/ldap-labs.eqiad.wikimedia.org/g [01:23:06] andrewbogott: ok! [01:23:23] Krenair: try deployment-salt now [01:23:38] madhuvishy: probably the same change for you. Or better yet update your local puppet repo [01:23:41] I'm in [01:23:56] andrewbogott: okay [01:24:14] But I get the same error as madhuvishy [01:24:17] when trying to sudo [01:24:18] madhuvishy: sorry, I emailed the list about this but I think my email was confusing [01:24:37] and a password prompt [01:24:49] hm, sudo’s ldap server might be configured someplace else [01:24:58] andrewbogott: hmmm but i cannot git pull [01:25:03] uri ldap://ldap-eqiad.wikimedia.org:389 ldap://ldap-codfw.wikimedia.org:389 [01:25:03] andrewbogott: chasemp we need to get https://gerrit.wikimedia.org/r/#/c/222857/ back in somehow. allows people to add their own root keys... [01:25:56] madhuvishy: what instance and project? [01:26:04] i will get out of your way - let me know when you're done fixing the stuff you're already working on, my stuff can wait. [01:26:15] (wikimetrics-staging1.analytics) [01:26:34] Krenair: do you want me to fix any other project? [01:26:48] I don't think this is unbroken fully yet YuviPanda [01:27:23] Krenair: did you switch the uri in ldap.conf too? [01:27:51] I can't, YuviPanda [01:27:56] You need root to do that [01:28:00] I can't sudo [01:28:00] oh [01:28:03] haha [01:28:05] ofco [01:28:17] I can't log in as root myself because puppet is broken, despite the key being in place on deployment-puppetmaster [01:28:25] right [01:28:27] let me do it [01:28:30] basically [01:28:35] everything is broken [01:28:40] what else is new [01:28:52] I bet puppet's been broken for weeks :) [01:28:56] and people notice when things like this happen [01:29:02] madhuvishy: you have a local patch that can’t be merged with upstream. May I set it aside for now? [01:29:20] andrewbogott: hmmm, i'd have to ask milimetric [01:29:27] * andrewbogott does it anyway [01:30:15] yay, puppet works now...I bet that ldap change you did fixed it YuviPanda [01:30:22] oh, nope, wtf? [01:30:31] Krenair: really? that should be competletely unrelated to anything puppet touches [01:30:32] How did it update the root keys with my exact text and then fail a manual run? [01:31:18] Krenair: it reverted it back [01:31:23] to previous ldap.yamlo settings [01:31:29] urgh [01:33:22] Krenair: boooom http://tools.wmflabs.org/watroles/role/puppet::self::geoip [01:33:30] why what AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA [01:33:33] why would someone do that [01:34:39] YuviPanda: mine’s more broken I bet [01:36:13] andrewbogott: https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=configure&instanceid=abb50762-93d0-4c9d-8853-adbbb6b56e00&project=deployment-prep®ion=eqiad [01:36:16] can't configure pages too [01:37:24] Krenair: I removed it with ldapvi on terbium [01:37:35] hm, that’s interesting [01:37:44] YuviPanda: are those pages broken for everything? Or just that one instance? [01:37:52] andrewbogott: not sure. that's the first one I checked [01:37:56] Krenair: ok, should work now. [01:37:59] It's broken for a bunch of hosts [01:37:59] Krenair: puppet run updated everything [01:38:00] maybe all [01:38:12] someone else was complaining about it on wikidata-query ones earlier [01:38:14] also WHY WOULD SOMEONE INSTALL PUPPET::SELF:GEOIP ONTO A SALT MASTER AAAAA [01:38:43] Krenair: can you check deployment-saltmaster now? [01:39:32] madhuvishy: any idea where passwords::wikimetrics is supposed to come from? [01:39:38] Notice: /Stage[main]/Exim4/Service[exim4]/ensure: ensure changed 'stopped' to 'running' [01:39:39] It’s referenced in puppet but not defined anyplace. [01:39:47] Is... that supposed to be present on deployment-salt? [01:39:51] it completed successfully YuviPanda [01:39:51] "# passwords::wikimetrics is not a real class checked in to any repository." [01:39:56] how was that supposed to ever do anything? [01:40:10] andrewbogott: ha ha i have no idea [01:40:29] andrewbogott: I'm pretty sure that it comes from the commit you removed :D [01:40:33] andrewbogott, YuviPanda: Dec 08 23:15:27 interestingly enough https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=configure&instanceid=6978ff87-d776-45fd-9a2e-4412d3926add&project=wikidata-query®ion=eqiad claims that host (wdq-beta) does not exist [01:40:41] wikimetrics and limn are done in a way where puppet will fail if you uncherrypick patches [01:40:48] YuviPanda: except shouldn’t my removed commit have also removed the /reference/ to it? [01:40:48] Dec 08 23:15:32 also https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=configure&instanceid=8978fa1c-c37a-4b69-bef8-6bf049d56aac&project=wikidata-query®ion=eqiad for db01 [01:40:49] etc. [01:40:50] I've told milimetric that if they die they die and we can not do much [01:40:56] andrewbogott: no, the reference is in ops/puppet :D [01:41:11] * andrewbogott is getting a bit stabby [01:41:13] I think at least [01:41:30] andrewbogott: IMO, we can't be held responsible for those 3 instances. they've been warned enough times. [01:41:39] yeah that still looks broken [01:41:45] we can probably add root keys and let analytics figure it out. [01:42:10] yeah, the configure issue seems more important anyway :| [01:42:23] Krenair: any other deployment-prep instance that needs rescuing? [01:42:28] andrewbogott: I'm sorry - there's a task to move away from self hosted puppet master - but it's super low priority and we've not gotten to it. but yeah, if i can get sudo back, we'll figure it out [01:42:37] I'm not aware of any others yet YuviPanda [01:42:41] Krenair: ok! [01:42:45] thanks [01:43:14] I'm going to try and fix the labs/private patch I showed earlier, making it easier for people to add their own root keys to their projects. [01:43:48] (03PS1) 10Yuvipanda: Revert "Revert "Allow addition of more root keys via hiera"" [labs/private] - 10https://gerrit.wikimedia.org/r/257808 [01:46:02] I can't log into labs bastion any more - known issue with ldap or the like? [01:46:03] madhuvishy: “My infrastructure could become unreachable at any minute” should not be a super low priority [01:46:26] andrewbogott: well we care very little about wikimetrics itself [01:46:29] andrewbogott: madhuvishy every time we change labs stuff one of us spends a few hours on those in stances, so you're basically just externalizing costs to us :( [01:46:37] madhuvishy: I can delete the project for you in that case [01:46:39] it’s easy! [01:46:49] andrewbogott: well, i wish [01:46:59] andrewbogott: madhuvishy so I think I can declare that this is the last time we'll help, since we've been saying this for years :) and if it's that super low priority I guess it'll have to die. [01:47:12] gwicke: which bastion? [01:47:26] andrewbogott: ugh, DNS down? [01:47:28] man watching this puppet apply is like a walk down memory lane [01:47:38] andrewbogott: no, no, not DNS down, I just mistyped [01:47:40] no panicking [01:47:43] YuviPanda: yeah that makes sense. [01:48:07] madhuvishy: where's the task so I can declare that on the task? [01:48:36] YuviPanda: i'll make one. also this can wait. i'll check with milimetric too. [01:48:37] gwicke I see you have successfully logged in? [01:49:11] https://phabricator.wikimedia.org/T101763 [01:49:13] madhuvishy: ^ [01:49:16] nevermind, I just fixed it on my end [01:49:54] had changed keys, but somehow my config didn't pick up the right one [01:50:25] sorry for the noise ;/ [01:50:41] madhuvishy: https://phabricator.wikimedia.org/T101763#1864407 [01:51:13] gwicke: np :) [01:53:31] YuviPanda: ah this. okay, thanks. should i make one to fix this? [01:53:43] madhuvishy: what doyou mean by 'this'? :) [01:54:35] 6Labs: Give madhuvishy and milimetric root on limn and wikimetrics instances - https://phabricator.wikimedia.org/T120900#1864414 (10yuvipanda) 3NEW [01:54:36] madhuvishy: ^ [01:54:40] YuviPanda: ah i thought i still couldn't sudo [01:54:44] i think i can now [01:54:50] madhuvishy: \o/ ok. [01:55:12] madhuvishy: your /var/lib/git/operations/puppet is in a horrible state [01:55:17] but for a few minutes you should have access [01:55:26] I dont’ know what to make of all these submodule conflicts [01:55:31] until puppet reruns hmmm [01:57:29] madhuvishy: I mashed everything into your ‘do not merge’ patch [01:57:42] but that means that patch is probably even harder to reconcile with prod than before [01:58:13] andrewbogott: okay [01:58:23] git reflog probably has the older version :D [01:58:27] * YuviPanda <3s git reflog [01:58:32] andrewbogott: what do you think of the wikitech issue? [01:59:11] I’m not sure why that would break. I’ll have a look [01:59:32] andrewbogott: thanks! [02:00:11] YuviPanda: how recently have you seen that work? [02:01:25] (03PS2) 10Yuvipanda: Revert "Revert "Allow addition of more root keys via hiera"" [labs/private] - 10https://gerrit.wikimedia.org/r/257808 [02:01:25] andrewbogott: hmm, yesterday? [02:08:12] 6Labs: Give madhuvishy and milimetric root on limn and wikimetrics instances - https://phabricator.wikimedia.org/T120900#1864429 (10yuvipanda) a:3yuvipanda [02:11:51] YuviPanda: does it work anywhere? or nowhere? [02:14:26] andrewbogott: so if you open up novainstance for deployment-prep [02:14:35] andrewbogott: the latest two instances (kafka03 and 04) created today work [02:14:39] all the old ones do not [02:14:44] ok [02:14:45] so I suspect it works for all new ones and none of the old ones [02:14:56] I’m tempted to wipe memcached [02:15:25] yeah [02:15:38] andrewbogott: have we done that before? [02:15:45] not today [02:15:51] andrewbogott: no I mean, ever? [02:15:56] sure [02:15:59] ah cool [02:16:02] it just breaks wikitech sessions [02:16:03] I wasn't sure how mc would react [02:16:08] yeah but that's ok [02:16:27] It’s a stab in the dark though [02:17:03] all of life is stabs in darks [02:17:43] didn’t help [02:18:48] andrewbogott: on NovaInstance, all of the ones without a working configure also have a missing link to the nova resource page... [02:18:57] where do those come from? [02:21:17] the nova resource pages are made by the OpenStackManager extension and updated when you change stuff about the instance [02:21:20] yeah, I see that too it’s surely related [02:21:32] what’s weird is that I see the nova call to verify the instance succeeding right before it declares no such instance [02:34:04] YuviPanda: can we update php on exec nodes? [02:34:17] liangent: to what version? [02:34:21] the trusty nodes already have 5.5 [02:34:27] YuviPanda: to match login host at least [02:34:35] liangent: yes, if you do '-l release=trusty' they match [02:34:46] liangent: that isn't default because too manyt ools rely on default being release=precise :( [02:35:25] YuviPanda: what's the breakage they got? [02:35:45] liangent: tools that don't work with 5.5 [02:35:56] liangent: I think it removes some ancient mysql interfaces... [02:37:11] exactly the opposite case than mine [02:37:27] after updating some libs I found they no longer working in 5.3 [02:37:33] for using $this inside closures [02:39:10] YuviPanda: another issue: job 166594 stuck with error message “can't get password entry for user "tools.liangent-php". Either the user does not exist or NIS error!” [02:39:24] YuviPanda / andrewbogott / madhuvishy: ok, I'm back, but I'm a bit lost about the above [02:39:29] where we left this is: wikimetrics is working fine [02:39:55] limn1 has a problem due to something that's actually broken in puppet, not related to our self-hosting at all [02:40:16] and it's not fair to say *every* time anything changes you spend *hours* on *each* of these instances [02:40:18] that's a bit dramatic [02:40:25] for the most part, I'm the one who spends those hours [02:40:37] I do ask for help here and there, and I appreciate the help [02:40:46] but it's certainly not hours [02:41:18] and most of the time not related to our custom puppet code on there, but to breaking changes in puppet [02:41:33] so, summary: everything's fine, nobody needs to pay attention to these instances [02:42:04] except puppet continues to be broken on limn1 and I'd love someone to fix that at some point. If it's not fixed, the next time I need to sudo on that machine I'll just migrate everything to a new box instead [02:42:22] but that's this error: "Error 400 on SERVER: Could not find resource 'Exec[compile puppet.conf]' for relationship on 'Class[Puppetmaster::Ssl]' on node limn1.analytics.eqiad.wmflabs" and nothing we're doing [02:45:12] milimetric: when users ask us to help with access because they can’t ssh… should I not help and tell them ‘everything is fine’? [02:46:29] andrewbogott: yeah, in this case, because we know the risks associated with limn1 and wikimetrics1, I'd rather you tell us we're screwed instead of taking on extra work that you didn't sign up for [02:46:46] the way this is set up is my fault (well, not really, but I inherited the puppet setup) [02:47:08] that doesn’t sound the same as ‘everything is fine’ but ok :) [02:47:54] everything is "normal" i guess [02:48:59] milimetric: 'breaking changes in puppet' - we expect puppet to run and not have months of catching up to do. [02:49:19] milimetric: 'the next time I need to sudo no that machine I will migrate everything to a new box' sounds good to me :) [02:49:43] milimetric: I filed that task after the last time I helped look at it. I agree that you're the one spending most of that time, but you *are* blocked on us every time it breaks badly. [02:50:26] YuviPanda: yeah, but that's technical debt that my managers didn't choose to pay down, so I'm not worried about it [02:50:32] if they want to prioritize it, they will [02:51:04] YuviPanda: but puppet should be enabled to run automatically and just git pull --rebase on those boxes [02:51:08] milimetric: it is. [02:51:11] I login every once in a while to make sure that's happening [02:51:31] right, so it's only when there are breaking changes, but I never let it slip for months like that [02:51:37] milimetric: and the patches on top break rebases now and then and when andrewbogott got puppet to run again today it was applying a *very long list of chagnes* from a long tiem ago. [02:51:37] 'cause I see the shinken alerts [02:51:51] hm [02:52:06] ok, so we can do something different then [02:52:17] anyway, I think we all agree here. If analytics considers this super low priority, it's even lesser of a priority for us and we should explicitly ignore issues caused by puppet breakage in this kind of situation. [02:52:19] we can just shut down limn1 and wikimetrics1 and tell people what's wrong so they can pay down the debt [02:52:27] it would kill all dashboards, piwik on labs, and wikimetrics [02:52:38] indeed, and then maybe it won't be low priority :) [02:52:52] or maybe they really are low priority and nobody would miss them [02:52:54] right :) I personally think it'll become medium - high pretty quickly :) [02:53:09] indeed, and if I were you that's what I would have done but you are a nicer person than me :D [02:53:43] ok, thx for the help, but don't spend any more time on it. I'll send a notice that this is what we're planning on doing [02:54:03] milimetric: I've filed a task to make sure additional people can get added to the root key on projects in the analytics project, and then I can consider ourselves truly out of your way and then you can deal with it as you see fit :) [02:54:10] and this is useful for other projects too anywya [02:55:45] (03PS3) 10Yuvipanda: Revert "Revert "Allow addition of more root keys via hiera"" [labs/private] - 10https://gerrit.wikimedia.org/r/257808 [02:55:55] I think ^ will work [02:57:30] 6Labs: ldap queries failing when looking up domain entries - https://phabricator.wikimedia.org/T120904#1864497 (10Andrew) 3NEW a:3Andrew [03:00:24] (03CR) 10Yuvipanda: [C: 032 V: 032] "Let' see how this goes!" [labs/private] - 10https://gerrit.wikimedia.org/r/257808 (owner: 10Yuvipanda) [03:01:55] Krenair: hey! want to help test ^? [03:01:59] ok [03:02:03] Krenair: we should add your key to root on deployment-prep [03:02:38] YuviPanda: Tsk, Krenair has more than enough to fix in VE right now. ;-) [03:07:01] James_F: tch tch :P [03:07:16] * James_F grins. Go for it. Krenair is awesome. [03:07:25] :) [03:07:35] Krenair: in https://wikitech.wikimedia.org/wiki/Hiera:Tools [03:07:40] you can see how valhallasw has his key added [03:08:06] yeah, I did something similar in deployment-prep [03:08:28] Krenair: you made it a list [03:08:31] Krenair: it should be a dictionary I think [03:08:39] Krenair: with the keys being your name and the value being the key [03:08:51] oh but it's supposed to be a hash [03:08:52] right [03:09:32] (03PS1) 10Yuvipanda: Move template to right place [labs/private] - 10https://gerrit.wikimedia.org/r/257819 [03:09:45] (03CR) 10Yuvipanda: [C: 032 V: 032] Move template to right place [labs/private] - 10https://gerrit.wikimedia.org/r/257819 (owner: 10Yuvipanda) [03:10:38] oops: "Could not retrieve catalog from remote server: Error 400 on SERVER: values(): Requires hash to work with" [03:11:08] heh [03:11:46] https://wikitech.wikimedia.org/w/index.php?title=Hiera:Deployment-prep&diff=222055&oldid=222051 [03:11:55] shouldn't that have fixed it? [03:13:30] Krenair: is that one straight line broken up by mw rather than two lines? [03:13:36] Krenair: also hiera is cached for like 30s or something [03:14:32] YuviPanda, done, it worked [03:14:44] it added my key to /etc/ssh/userkeys/root, login works [03:14:48] Krenair: \o/ awesome [03:15:28] Krenair: can you email releng@ or qa@ (or whatever ml is responsible for betacluster) to add more people to that list so they can login in cases of ldap / ssh failure? [03:15:31] that's much easier than having to local hack it in on the deployment-prep puppetmaster [03:15:35] yeah [03:15:38] definitely [03:17:02] releng@ is the private team list which I'm not on [03:17:52] yep, it's qa@ according to https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep [03:19:35] 6Labs: ldap queries failing when looking up domain entries - https://phabricator.wikimedia.org/T120904#1864512 (10Andrew) Here is the (or at least one) record that we're failing to load: # eqiad, hosts, wikimedia.org dn: dc=eqiad,ou=hosts,dc=wikimedia,dc=org objectClass: domainrelatedobject objectClass: dnsdoma... [03:20:18] 6Labs: ldap queries failing when looking up domain entries - https://phabricator.wikimedia.org/T120904#1864513 (10Andrew) I suspect that wikitech is acl'd out of everything under ou=hosts,dc=wikimedia,dc=org [03:20:41] andrewbogott: would that explain it working for new instances? [03:20:54] I don’t know [03:21:58] It really should fail for everything since it can’t load eqiad.wmflabs [03:23:14] YuviPanda: I don’t see ‘instance id’ for anything in deployment-prep [03:23:22] tell me again where it’s working? [03:23:26] andrewbogott: kafka03 and 04 [03:23:41] what project is that? [03:24:14] andrewbogott: deployment-prep [03:24:19] andrewbogott: in deployment-kafka03 and 04 [03:24:23] to be more exact :) [03:24:30] andrewbogott: I don't see them now [03:24:39] andrewbogott: i wonder if the memcached clear cleared them out [03:24:41] yeah maybe they were cached somehow [03:24:43] probably [03:25:01] So I think I’m going to dump this on Moritz and go to bed. I don’t much know how to debug the server-side stuff [03:25:11] that seems fair [03:25:36] 6Labs: ldap queries failing when looking up domain entries - https://phabricator.wikimedia.org/T120904#1864517 (10Andrew) a:5Andrew>3MoritzMuehlenhoff [03:25:50] I don’t like it being broken but no one is screaming at the moment [03:28:38] yeah. [03:28:41] YuviPanda, oh, damn. Since I'm not subscribed it's been held for moderator approval [03:28:52] Krenair: 'tis ok, I'm sure someone will see it soon enogh [03:28:57] Krenair: <3 thanks for sending that out [03:29:07] greg-g, zeljkof ^ [04:03:50] Krenair: approved [04:04:07] ty [04:04:27] YuviPanda, https://lists.wikimedia.org/pipermail/qa/2015-December/002452.html [04:33:43] someone is killing NFS [04:40:05] not anymore, maybe [07:35:57] hey, I just made a new service group but it doesn't have replica.my.cnf [07:36:07] and I can't make databases [07:36:17] YuviPanda: Coren ^ [07:38:11] it's there now [07:38:13] :) [07:40:08] Amir1: how long did you wait? [07:40:20] YuviPanda: about ten min. [07:40:31] I restarted it for good measure [07:40:36] it might be running in 10min intervals anyway [07:40:59] thanks [07:41:27] I'll wait some more next time :) [07:41:40] ^ that's the motto of labs, right there :) [07:41:55] btw have you seen this? http://tools.wmflabs.org/wd-analyst/index.php I made it with your help [07:41:57] thanks [07:42:03] :))))) [07:42:49] \o/ [07:42:51] awsesome :D [07:42:54] * YuviPanda smiles very broadly :D [07:43:34] :)))) [07:54:08] 6Labs: ldap queries failing when looking up domain entries - https://phabricator.wikimedia.org/T120904#1864690 (10yuvipanda) p:5Triage>3Unbreak! Increasing priority since this makes new instance usage unusable. [08:12:53] (03PS2) 10Polybuildr: Add .npm-debug.log* to .gitignore [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/241919 [08:15:19] (03CR) 10Yuvipanda: [C: 032] Add .npm-debug.log* to .gitignore [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/241919 (owner: 10Polybuildr) [08:15:53] (03PS2) 10Yuvipanda: Add connections.yaml to .gitignore [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/241921 (owner: 10Polybuildr) [08:16:02] (03CR) 10Yuvipanda: [C: 032 V: 032] Add connections.yaml to .gitignore [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/241921 (owner: 10Polybuildr) [09:50:28] (03PS1) 10Alexandros Kosiaris: Remove an extra passwords::ldap::corp [labs/private] - 10https://gerrit.wikimedia.org/r/257846 [09:51:28] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Remove an extra passwords::ldap::corp [labs/private] - 10https://gerrit.wikimedia.org/r/257846 (owner: 10Alexandros Kosiaris) [10:04:11] 6Labs, 10DBA, 5Patch-For-Review: watchlist table not available on labs - https://phabricator.wikimedia.org/T59617#1864853 (10jcrespo) Because of the size and number of tables, this will take me some time (and it will require disrupting labs lag), but I will put it on my backlog and attend it as soon as I can. [10:32:10] 6Labs, 6TCB-Team, 7Tracking: Create labs projects for lizenzhinweisgenerator - https://phabricator.wikimedia.org/T120925#1864901 (10Addshore) 3NEW [10:36:35] Amir1: ! [10:37:29] Just wondering for http://tools.wmflabs.org/wd-analyst/index.php are all of the number pre computed and in some table somewhere? [10:37:38] or are they grabbed on the fly? [10:38:36] addshore: you can read the source :-) the table you're looking for is probably "s52781__wd_p" [10:38:53] * goes to read the code* [10:39:08] (/data/project/wd-analyst/public_html) [10:39:24] firs thing I notice if you are using some python dumps reader, you should use a php one, and then you can sue the php datamodel objects etc ;) [10:40:22] we should have a way to compile a php datamodel to python! [10:40:33] valhallasw: sofixit? :P [10:40:53] it shoudlnt be that hard right? [10:41:10] parse php to ast, convert ast to python ast, dump as python [10:41:20] the last is actually not even completely necessary [10:41:58] now if only I had a job that would pay me to implement such absurdities :P [10:42:06] if only! [10:42:30] instead, I'm forced to make pretty plots with matplotlib :> [10:44:43] 6Labs, 6TCB-Team, 7Tracking: Create labs projects for lizenzhinweisgenerator - https://phabricator.wikimedia.org/T120925#1864918 (10Addshore) [11:18:57] 6Labs: ldap queries failing when looking up domain entries - https://phabricator.wikimedia.org/T120904#1864954 (10fgiunchedi) "configure" of an existing instance seems to be also broken, likely the same root case [11:19:37] 6Labs, 10Tool-Labs-tools-Other, 10Labs-Infrastructure: s52721__pagecount_stats_p import is making labsdb1005 100% utilized (and lagging its backup slave) - https://phabricator.wikimedia.org/T120926#1864956 (10jcrespo) [11:23:33] 6Labs, 6TCB-Team: Create labs projects for lizenzhinweisgenerator - https://phabricator.wikimedia.org/T120925#1864957 (10revi) [11:50:00] 10Tool-Labs-tools-Other, 10Wikimedia-General-or-Unknown: "Page Watchers" from article tool links is 404 - https://phabricator.wikimedia.org/T120927#1864978 (10Aklapper) (It's welcome to associate at least one [[ https://phabricator.wikimedia.org/project/query/active/ | project ]] with this task, otherwise nobo... [11:50:46] 10Tool-Labs-tools-Other, 10Wikimedia-General-or-Unknown: "Page Watchers" from article tool links is 404 - https://phabricator.wikimedia.org/T120927#1864986 (10Aklapper) CC'ing @MzMcBride. Either that tool needs to get fixed, or the link needs to be removed from the sidebar of en.wp, I guess? [13:20:56] 6Labs: ldap queries failing when looking up domain entries - https://phabricator.wikimedia.org/T120904#1865126 (10MoritzMuehlenhoff) It's unrelated to ACLs, the ACLs only restrict access to userPassword/shadowLastChanged. I can ldapsearch for dc=eqiad with my unprivileged user account just fine. [13:44:31] 6Labs: ldap queries failing when looking up domain entries - https://phabricator.wikimedia.org/T120904#1865160 (10Nemo_bis) [14:03:48] 6Labs: ldap queries failing when looking up domain entries - https://phabricator.wikimedia.org/T120904#1865181 (10MoritzMuehlenhoff) 5Open>3Resolved This is fixed with the merge of https://gerrit.wikimedia.org/r/#/c/257866/ (i.e. openldap now supports the dereference LDAP control which is apparently used by... [14:10:08] 6Labs, 10Tool-Labs-tools-Other, 10Labs-Infrastructure: s52721__pagecount_stats_p import is making labsdb1005 100% utilized (and lagging its backup slave) - https://phabricator.wikimedia.org/T120926#1865206 (10Stigmj) There are six temporary import tasks running right now processing 6 months of pagecount data... [14:10:20] Yeah, that's me.. [14:10:38] someone noticed my jobs, did they? :) [14:12:45] YuviPanda, ping [14:12:55] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:13:08] Stigmj: jcrespo (jynus here on IRC) is our database overlord who keeps an eye on things. He's also very helpful when it comes to advice, so it might be good to discuss your requirements/use case with him :-) [14:13:33] White_Master: Yuvi is probably asleep. Do you need him specifically? If it's a more general question, others might be able to answer. [14:13:53] valhallasw, i need a labsadmin [14:14:21] valhallasw: yeah, I'll discuss this with him... thanks. [14:14:35] Stigmj, Could I ask for you to add a long sleep every now and then? That will help with the replication. Let's coordinate on the ticket [14:15:15] White_Master: In that case, please create a ticket on phab, in the #labs project. That will make sure it gets to the right people. [14:16:17] White_Master: https://phabricator.wikimedia.org/maniphest/task/create/?projects=labs [14:17:13] valhallasw, I have already sent the task. I just wanted to ask a question to a labsadmin :) [14:17:49] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 952756 bytes in 3.559 second response time [14:18:36] 6Labs, 10Tool-Labs-tools-Other, 10Labs-Infrastructure: s52721__pagecount_stats_p import is making labsdb1005 100% utilized (and lagging its backup slave) - https://phabricator.wikimedia.org/T120926#1865222 (10Stigmj) I could insert a sleep some places, but do you have any recommendation as to how long this s... [14:24:16] jynus: I put in a 60 second sleep as a first test. [14:31:49] 6Labs, 10Tool-Labs-tools-Other, 10Labs-Infrastructure: s52721__pagecount_stats_p import is making labsdb1005 100% utilized (and lagging its backup slave) - https://phabricator.wikimedia.org/T120926#1865258 (10jcrespo) @Stigmj tool-labs database management is based on the assumption that everywhere behaves re... [14:35:13] 6Labs, 10Tool-Labs-tools-Other, 10Labs-Infrastructure: s52721__pagecount_stats_p import is making labsdb1005 100% utilized (and lagging its backup slave) - https://phabricator.wikimedia.org/T120926#1865283 (10jcrespo) I have not commented it, but if this requires specific resources, because it could be usefu... [14:47:23] PROBLEM - Host tools-mailrelay-02 is DOWN: CRITICAL - Host Unreachable (10.68.17.61) [14:49:55] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:50:25] 6Labs, 7Tracking: Create a labs project for Wikimedia Venezuela' Engineering Technology Committee - https://phabricator.wikimedia.org/T119661#1865317 (10White-Master) Ping. And this task? [14:50:27] 6Labs, 10Tool-Labs: Move tools-mail to trusty - https://phabricator.wikimedia.org/T96299#1865322 (10coren) [14:50:29] 6Labs, 10Tool-Labs, 5Patch-For-Review: Provision and test tools-mailrelay-02 - https://phabricator.wikimedia.org/T97574#1865318 (10coren) 5Open>3Resolved a:3coren Made moot by the decision to skip a precise host entirely. [14:50:45] Hm. Some slowness on web. Checking. [14:51:12] not only web... bastion is not playing nice either.. [14:51:40] Looks like it either just recovered or things got slow for a bit. I got in easy. [14:51:47] * Coren looks at graphs. [14:53:17] got some errors from my grid-jobs: error reading input file: Stale NFS file handle [14:53:29] The timing is odd, though, because that happened pretty much exactly as I was doing some instance management. [14:53:49] ... stale NFS file handle? [14:54:13] Is it possible it was doing I/O on a file that got deleted on another node in the interim? [14:54:35] Otherwise, can you give the the job number? [14:54:48] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 952709 bytes in 4.026 second response time [14:54:55] they're gone now.. [14:54:57] (So I can check on the health of the node that gave you that message) [14:55:08] What tool was this? [14:55:32] but could have been one of these: 124128, 124131, 124135, 124137 or 124138 [14:56:14] i was in the process of shutting them down as the slowness occurred... [14:56:27] pagecount tool. [14:57:08] Hm. Does any of what you were doing possibly result in deleting files? LIke temporaries, or logs, etc? [14:57:33] well... I was editing the file on bastion.. [14:57:40] could have done something? [14:58:00] no deleting should hve occurred no.. [14:58:08] Hm, there are circumstances where it could have, but it's unlikely unless it's the actual running script you were editing. [14:58:13] the exact error output was: error reading input file: Stale NFS file handle [14:58:25] sorry,.. [14:58:56] /data/project/pagecount/process_specific.sh: error reading input file: Stale NFS file handle [14:59:20] it's a script called from the script in the job.. [14:59:23] And was that the file you were editing? [14:59:27] yeah. [15:00:23] but, is this an error from the script itself or an error calling the script? ... [15:00:46] Ah, then it's not a problem - that can happen, more likely if the system is being sluggish: at some point during your editing, as the file was saved, the actual on-disk file was replaced. Since the running script had the old file open, its reference to it got stale (because it no longer existed) [15:01:00] ok.. [15:01:07] no worries then.. [15:01:29] On a single-server setup, you never notice this because the file isn't actually deleted until the last open on it is gone. NFS can't guarantee that because the thing keeping the file open may well be on another server. [15:03:36] I'm still a little annoyed/concerned that both slowdowns in the last half hour seem to have coincided when I was creating a new instance. andrewbogott, do you think it's possibly related or was it really just an odd timing coincidence? [15:03:59] That doesn't seem very likely related to me. [15:07:23] Coren: what slowed down? [15:08:05] I wouldn’t expect it to be related unless something very strange is happening with instance creation [15:09:33] ah, I suppose ‘slowed down’ is those tool home-page warnings :( [15:10:22] 6Labs, 6operations, 10wikitech.wikimedia.org, 5Patch-For-Review, 7Wikimedia-log-errors: Job queue broken for labswiki (jobs for wikitech.wikimedia.org are not running) - https://phabricator.wikimedia.org/T117394#1865388 (10jcrespo) 5Open>3Resolved I can confirm this is fixed, last error has timestamp... [15:10:31] andrewbogott: I was on bastion and various commands were not responding, just hanging.. [15:11:10] andrewbogott: as "qstat".. just hanged.. Ctrl-C didn't respond timely either.. [15:11:35] andrewbogott: trying to open a text-file with "less" also hung... [15:11:54] andrewbogott: It is. It looked for a few minutes as though everything was cpu-starved. [15:12:04] Not NFS though? [15:12:26] andrewbogott: That was my first thought, but the graphs don't show any spike. [15:12:57] Well, they show a short /dip/ which you'd expect if things stopped doing I/O for a bit. [15:13:15] But what bugs me is that in both cases, the alert was within 2 minutes of my having created an instance. [15:13:18] do you think all of labs was io starved, or just a set of hosts that correspond to a labvirt box? [15:13:44] andrewbogott: That's a hard call. Lemme see if where my new instance landed matches. [15:13:55] Coren: I did a "top" when stuff started to work again.. https://pastee.org/rame4 [15:14:21] Stigmj: That just shows a pretty normal and happy instance. [15:14:33] PROBLEM - Puppet failure on tools-mail-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [15:14:51] andrewbogott: ^^ that's also a datapoint saying 'not NFS' as when nfs has issues the load spikes as many things contend for I/O [15:17:04] Coren: yeah.. but immediately before this, I had no response from some commands, but "ls -l" seemed to work https://pastee.org/kbbbn [15:17:34] Yeah, that pretty much excludes filesystem as a cause. [15:18:39] Ah, and qstat stalled. Even more fun. [15:20:42] Coren andrewbogott: is this feasiable / do you know how to do it? https://phabricator.wikimedia.org/T118494#1862957 [15:21:51] andrewbogott: Nope, I can't see any correlation where the instances were created. :-( [15:21:56] chasemp: So there are multiple things there... [15:22:07] shell access to labs, and tool membership are already revoked. [15:22:13] :) yeah I was thinking about the accounts.wmflabs.org thing [15:22:17] If tools maintain their own user lists internal to the tools... [15:22:23] that’s largely out of our hands. [15:22:25] yeah [15:22:43] I have never heard of/seen accounts.wmflabs.org before [15:22:48] 6Labs, 10Labs-Other-Projects, 10The-Wikipedia-Library: Create Cyberbot Project on Labs - https://phabricator.wikimedia.org/T112881#1865463 (10Sadads) [15:23:03] http://accounts.wmflabs.org/ me neither but steipp seems to point it out specifically etc [15:23:10] For the most part, we can't even know /what/ tools might have granted creds to that user. [15:23:10] so I was wondering if either of you had [15:23:27] I think accounts uses OAUTH though. [15:23:58] looks like not [15:24:21] andrewbogott: I'm not seeing any correlation other than time. :-( I'm thinking we want to keep an eye out next time we create an instance - if only to satisfy ourselves that this is not related? [15:24:42] Coren: want me to create one now just to see? [15:24:58] andrewbogott: I suppose that's a reasonable test. [15:25:39] chasemp: I don't think there is much we can do other than poke the maintainers and tell them that the user has had its accesses revoked and that they may consider doing the same. [15:29:20] well I punted a bit to get more opinions [15:29:21] https://phabricator.wikimedia.org/T118494#1865502 [15:29:21] but that's the way I understand the situation [15:29:21] andrewbogott: Did you just create an instance? [15:29:21] yes [15:29:22] Bastion has stalled. [15:29:22] And, recovered a bit, but very sluggish still. [15:31:18] There was definitely an effect, though nowhere near as dramatic as the previous one. [15:31:29] It took about 20s for me to start a simple 'top' though. [15:36:44] valhallasw, are you involved in toolsbeta? [15:38:07] Coren: can you figure out about virt host arrangement? [15:38:28] andrewbogott: Yeah, I'm trying to do that now. What is the instance you created? [15:38:54] andrewbogott: yes, sort of [15:39:11] Coren: create-test-110.testlabs.eqiad.wmflabs on labvirt1010 [15:39:16] andrewbogott: as in: I test puppet manifests there sometimes [15:39:29] valhallasw: toolsbeta-mail puppet says: [15:39:38] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Reading data from Toolsbeta/host/toolsbeta-mail failed: NoMethodError: undefined method `[]' for nil:NilClass at /etc/puppet/manifests/realm.pp:11 on node toolsbeta-mail.toolsbeta.eqiad.wmflabs [15:39:38] Warning: Not using cache on failed catalog [15:39:48] which results in it being unreachable to folks without a root key [15:40:15] oh, right, because ldap changed [15:41:22] valhallasw: any suggestions as to how to fix? [15:41:24] Or who to ask? [15:41:35] I can login to toolsbeta-puppetmaster3., though, so I'll just do a git reset --hard there [15:41:50] (for the ops-puppet repo) [15:42:28] andrewbogott: No visible correlation. 1010 doesn't have the bastions, nor anything that could clearly affect them. [15:42:56] huh [15:43:03] then… network? [15:43:58] * Coren grunts. [15:44:58] That seems to be a deeply unsatisfying guess to me. [15:45:25] andrewbogott: ok, reset /var/lib/git/operations/puppet to origin/production. That should fix it (hopefully...) [15:45:38] if not, I think we can just kill the host [15:45:57] andrewbogott: But I /do/ see a very visible dip in labnet1002 traffic matching when you created the instance though. Hmmm. [15:46:07] valhallasw: I’m pretty sure I already did that… it’s something broken in the node def, not the puppet repo [15:49:29] RECOVERY - Puppet failure on tools-mail-01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:50:12] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Brclz was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=222656 edit summary: [15:52:42] hey yall, there's a single instance (that i've tried) that I can't log into anymore as of today. it was fine yesterday, and some other instances in its project still work fine [15:52:46] getting Permission denied (publickey). [15:53:43] ottomata: Define "yesterday"? [15:54:05] Also, what instance is this? :-) [15:54:22] uhm, when i last tried yesterday probably around 18 hours ago [15:54:35] deployment-eventlogging03 in the deployment-prep project [15:55:29] * Coren checks. [15:56:38] Ah, disabled puppet. [15:56:45] (03CR) 10ArthurPSmith: "Hi Ricordisamoa - not sure what you're suggesting - replace the get_entities query with SPARQL?" [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/245591 (owner: 10ArthurPSmith) [15:57:01] ottomata: Can I reenable puppet on that box? [15:58:13] yes please [15:58:28] Running puppet now. [15:59:29] ottomata: Should be fixed. Try? [15:59:58] yes thank youuuuu [15:59:59] sorry about that [16:00:18] No worries; I like when problems are that easy to track down and fix. :-) [16:24:32] 6Labs, 10Tool-Labs: Common http error response pages - https://phabricator.wikimedia.org/T89864#1865681 (10coren) 5Open>3declined a:3coren That used to be the case, but tool maintainers have requested that error messages from their web services be displayed untouched. It's still the case for 503 errors... [16:25:58] 6Labs, 10Tool-Labs: Create a set of 'template' tools in various languages with deploy scripts for toollabs - https://phabricator.wikimedia.org/T91059#1865697 (10coren) @yuvipanda: Do you count your k8s work as filling in that task? If so, we might want to merge and/or close this one? [16:28:28] 6Labs, 10Tool-Labs: Document how to turn shadow into master - https://phabricator.wikimedia.org/T91133#1865707 (10coren) 5Open>3Resolved a:3coren ... it became true about 3 minutes later. :-) [16:29:57] 6Labs, 10Tool-Labs: Have a 'undergoing scheduled maintenance' page for toollabs set up for scheduled maintenance - https://phabricator.wikimedia.org/T90595#1865711 (10coren) a:3coren This can be done fairly cleanly at the proxy level (though of course that will not catch the cases when the proxy //themselves... [16:30:49] 6Labs, 10Tool-Labs: Webservice start failing with duplicate port allocation from portgranter - https://phabricator.wikimedia.org/T93875#1865721 (10coren) 5Open>3Resolved a:3coren Long resolved. [16:31:56] 6Labs, 10Tool-Labs: Make jsub / qsub default to trusty instances - https://phabricator.wikimedia.org/T94792#1865727 (10coren) I think this is now a reasonable step; but we probably want a month's warning or so. I'm going to make an annoucement about this - the change itself is trivial enough. [16:33:11] andrewbogott: huh. weird, not sure what's going on then. That line is a hiera() line, and the error seems server-side [16:33:28] valhallasw`cloud: maybe just restart the puppetmaster? [16:33:34] yeah, good idea [16:34:41] 6Labs, 10Tool-Labs: Have a 'undergoing scheduled maintenance' page for toollabs set up for scheduled maintenance - https://phabricator.wikimedia.org/T90595#1865734 (10valhallasw) We already have a 'tool labs is down' hiera-based handler. We can probably just use that for this, assuming it's supposed to block/c... [16:34:52] ok, done. And now we wait ;-) [16:37:59] 6Labs, 10Tool-Labs: Make jsub / qsub default to trusty instances - https://phabricator.wikimedia.org/T94792#1865745 (10valhallasw) During the last discussion on this, @scfc suggested to skip trusty and go to jessie immediately. That limits the number of times users are forced to upgrade. On the other hand, ou... [16:38:59] 6Labs, 10Tool-Labs: Have a 'undergoing scheduled maintenance' page for toollabs set up for scheduled maintenance - https://phabricator.wikimedia.org/T90595#1865755 (10coren) @valhallasw: that'd do for tools, but won't cover the general proxy would it? [16:41:01] 6Labs, 10Tool-Labs: Have a 'undergoing scheduled maintenance' page for toollabs set up for scheduled maintenance - https://phabricator.wikimedia.org/T90595#1865764 (10valhallasw) This task has 'tool labs' in the title ;-) As for the rest of labs -- I think the central proxy is also an instance of dynamicproxy... [16:41:12] andrewbogott: so, I 've got VMs without in instance ID right now [16:41:31] just created it and it won't get an instance id and I can't connect to it via ssh [16:41:46] 6Labs, 10Tool-Labs: Make jsub / qsub default to trusty instances - https://phabricator.wikimedia.org/T94792#1865765 (10coren) Jessie is a no-starter while we rely on gridengine; which is going to be for a while still (k8s provides a superior alternative for many, but not all, scenarios and migration in that di... [16:42:16] Coren: oh, right, SGE :( [16:42:46] akosiaris: project, instance? [16:42:58] andrewbogott: etherpad, etherpad001 [16:43:06] but also packaging, sdfdsfdsfsdfdsfdsf [16:43:21] well a bit fewer sdfs [16:43:35] vm seems to be up [16:43:38] ip is pinging [16:43:43] 6Labs, 10Tool-Labs: Have a 'undergoing scheduled maintenance' page for labs set up for scheduled maintenance - https://phabricator.wikimedia.org/T90595#1865772 (10coren) [16:43:57] ssh listening... [16:44:00] so DNS fails ? [16:44:13] 6Labs, 10Tool-Labs: Have a 'undergoing scheduled maintenance' page for labs set up for scheduled maintenance - https://phabricator.wikimedia.org/T90595#1062829 (10coren) Fixed the title. Labs/labs/labs/labs. Don't let geeks name things. :-) [16:50:11] Core: still laggy? [16:50:21] Coren: ^ [16:50:59] andrewbogott: ah the sdfsdfsdf vm did get an instance ID now [16:51:33] did you do something or did it happen on its own ? [16:51:40] akosiaris: I think something is wrong with rabbitmq — communication between labs servers seems to be unreliable [16:51:48] sometimes I restart things and I get a sudden flood of messages [16:52:19] andrewbogott: Stigmj reports a bout of lag too - this is becoming a pattern. [16:52:21] * akosiaris refrains from commenting on what he thinks of rabbitmq [16:55:18] Coren: funny stuff.. trying to ping login.tools.wmflabs.org and the TTL is fine (when it starts answering), but the whole operation took 12 seconds.. This was during the lag I just experienced. https://pastee.org/kc286 [16:55:59] o_O that's from your home box? [16:56:05] Coren: any chance the lags are specific to things that hit ldap? sudo for example? [16:56:21] andrewbogott: It might be DNS [16:57:04] yep.. [16:57:04] andrewbogott: ... which relies on LDAP, doesn't it? [16:57:05] public or internal? [16:57:05] andrewbogott: either both? Stigmj's symptom sez public [16:57:05] public is ldap [16:57:05] internal is not [16:57:11] So LDAP then. [16:57:32] Which would also be coherent with creating instances being causal since that hits ldap too [16:57:53] yeah [16:57:55] failed to bind to LDAP server ldap://ldap-eqiad.wikimedia.org:389: Connect error: TLS: hostname does not match CN in peer certificate a lot of this [16:58:06] when trying to create instances just now [16:58:13] it was working about 1/2 hour ago (?) [16:58:19] yeah, new instances use a base image with the old ldap server name [16:58:25] wait, really? [16:58:28] that happens when trying to hit the old [16:58:39] apergos: I suspect that it was working AND you were getting those same errors. [16:58:52] That happens always until the first puppet run. [16:58:54] well maybe I missed them in console output [16:58:59] I believe that the actual problem is designate hitting ldap and timing out (breaking new instance creation) [16:59:08] moritzm: I think we must have hit another connection limit in openldap [16:59:11] (BTW, andrewbogott, I"ve got a pending merge to image creation if we're going to rebuild them) [16:59:21] let me have a look [16:59:56] andrewbogott: That'd be coherent with lots of things stalling in labs too - getent() lookups rely on ldap. [17:00:06] as does anything with keystone [17:00:07] yeah, DNS is definitely lagging [17:00:10] ok, that image came up. but the previous attempt just went out to lunch [17:00:20] anyways as long as I have an image I will ignore it [17:00:25] eh, wait. [17:00:40] what is supposed to be the external-facing DNS server? [17:00:46] labcontrol1001.wikimedia.org? [17:01:01] labs-ns0 and labs-ns1 [17:01:06] which are labcontrol1001 and labcontrol1002 [17:01:08] pdns backed by ldap [17:02:43] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Amitie 10g was created, changed by Amitie 10g link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Amitie_10g edit summary: Created page with "{{Tools Access Request |Justification=I want to test my tools and illustrate to everyone how them works |Completed=false |User Name=Amitie 10g }}" [17:03:08] we're well under the limits 1100/1400 out of 4096 [17:04:09] labcontrol1001 is 208.80.154.92, labcontrol1002.wikimedia.org doesn't exist? -ns0 is .94, -ns1 is .102, so there's something weird there [17:04:21] 6Labs, 6operations, 5Patch-For-Review: Investigate whether to use Debian's jessie-backports - https://phabricator.wikimedia.org/T107507#1865865 (10faidon) [17:04:48] moritzm: I’m going to do something, can you watch and see if it spikes? [17:04:58] sure [17:05:32] well, of course everything is snappy now [17:05:53] do you have any historical data to see if there was an activity spike in the last hour or two? [17:05:55] numbers are mostly identical to 5 mins ago [17:07:07] laggity lagg.. [17:08:16] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Amitie 10g was modified, changed by Amitie 10g link https://wikitech.wikimedia.org/w/index.php?diff=222854 edit summary: [17:16:49] 6Labs, 10Tool-Labs-tools-Other, 10Labs-Infrastructure: s52721__pagecount_stats_p import is making labsdb1005 100% utilized (and lagging its backup slave) - https://phabricator.wikimedia.org/T120926#1865912 (10Stigmj) I have put in some random sleeps (between 60 and 600 seconds) in between each time my import... [17:19:04] andrewbogott: one issue I have seen is nodes on old ldap can throw [17:19:05] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Failed when searching for node shinken-ircbot-testing.shinken.eqiad.wmflabs: LDAP Search failed [17:19:05] Warning: Not using cache on failed catalog [17:19:05] Error: Could not retrieve catalog; skipping run [17:19:18] which means they crap out on getting their new ldap settings [17:20:35] huh, but that's a server-side error? [17:21:28] so for some reason the puppetmaster fails searching in ldap [17:21:49] It all circles back to ldap. [17:22:18] valhallasw`cloud: yeah I agree [17:22:47] RECOVERY - Host tools-shadow is UP: PING OK - Packet loss = 0%, RTA = 1.61 ms [17:23:10] skinken, you lie. [17:23:47] Coren: I'm going to guess a new host now has tools-shadow's old ip [17:24:11] valhallasw`cloud: Hm. Puppet on shinken is more recent than that. [17:25:57] Yeah, puppet is up to date - I'm not sure why shinken remains convinced that instance still exists. [17:26:24] dns cache? [17:26:38] not sure why that would happen, though [17:40:24] PROBLEM - Host tools-shadow is DOWN: CRITICAL - Host Unreachable (10.68.16.10) [17:41:56] !log wikimania-support Rebooting bd808-vagrant.wikimania-support.eqiad.wmflabs to see if that makes ssh magically work again [17:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikimania-support/SAL, Master [17:42:32] bd808: Hi, I got a question concerning mediawiki-vagrant [17:42:49] Luke081515: ask :) [17:42:54] I can't access the database at the instance, via the normal mysql command [17:43:19] user: root password: vagarnt [17:43:32] luke081515@wiki-rcm:~$ mysql -u root -p [17:43:32] -bash: mysql: command not found [17:43:51] RECOVERY - Host tools-mailrelay-02 is UP: PING OK - Packet loss = 0%, RTA = 1.86 ms [17:43:58] is this in a labs host using the LXC based role? [17:44:13] * Coren goes to grab a quick lunch. [17:44:43] Luke081515: if so you probably need to login to the LXC container with `vagrant ssh` first [17:50:08] chasemp: a reboot of bd808-vagrant.wikimania-support.eqiad.wmflabs didn't magically fix anything. I can see in the wikitech console that it rebooted, but I'm getting my ssh key denied. I know this is a self-hosted puppetmaster instance so something broken with that seems the likely issue. [17:50:25] RECOVERY - Host tools-shadow is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms [17:50:29] bd808: I can't get in to fix it, so your call [17:50:45] PROBLEM - Host tools-mailrelay-02 is DOWN: CRITICAL - Host Unreachable (10.68.17.61) [17:50:47] so ssh as root is toast too? [17:51:44] for me at least yes [17:51:59] for me too [17:52:10] (and my key has been in root authorized for longer) [17:52:14] blerg. ok [17:52:20] bd808: I can get in. [17:52:34] \o/ Coren is teh powerful [17:52:54] chasemp: I'm guessing puppet is out of date enough that your key isn't there. :-) [17:53:03] figured [17:53:16] let me state again, the self hosted puppet master mechanism here just plan sucks :) [17:53:20] plain even [17:53:42] it is mostly awesome except when it blows up [17:53:54] PROBLEM - Puppet failure on tools-exec-cyberbot is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [0.0] [17:54:27] Lemme see what I can do. [17:55:33] Huh. This is self-hosted puppetmaster? It doesn't look like puppet ran ever on that box. [17:56:46] bd808: Exiting; no certificate found and waitforcert is disabled [17:56:56] That box never had running puppet afaict. [17:57:26] PROBLEM - Host tools-master is DOWN: CRITICAL - Host Unreachable (10.68.16.9) [17:57:36] it did I am pretty certain. I wonder if It got messed up by the work YuviPanda has been doing related to self-hosted [17:57:54] bd808: It might, if it's really /really/ old. [17:58:03] Coren: could you try your magic on services-deploy.services.eqiad.wmflabs [17:58:06] it certainly is [17:58:08] can't get in there no one knows where it came from :) [17:58:35] chasemp: I haz the powah! [17:59:58] bd808: nope, I didn't touch anything outside of tools and made no changes to the self code itself [18:00:00] That host builds iso images of mediawiki-vagrant for putting on usb sticks for hackathons [18:00:00] it has been around for probably close to 2 years [18:00:01] chasemp: Seeing what I can do. [18:00:02] bd808: Well, I'm in but there's no apparent way I can bring it up to date - puppet won't run at all in its current state. [18:00:13] PROBLEM - Puppet failure on tools-exec-gift is CRITICAL: CRITICAL: 85.71% of data above the critical threshold [0.0] [18:00:34] is there any way to hack my key into working on it? I can rebuild from scratch (and will) but it would be nice to grab a few files [18:00:35] bd808: Well, I could sign the cert, but it's suspicious that it isn't already. [18:01:28] bd808: I can probably hack something together for that if you give me a few. [18:01:42] I vote freeze puppet and add his key to root and count on nuking it [18:01:53] it's probably way old and there have been a ton of self master changes [18:01:53] I'm in no rush. chasemp was just wanting it crossed off the list of broken crap [18:02:22] we can leave it I guess as a "we don't care ldap is bad here" [18:02:24] chasemp: The issue is that puppet is completely broken anyways; so that's not an issue. [18:02:56] services-deploy has a very very out of date puppet.conf that prevents it from running. That's a reasonably easy fix. [18:03:16] 6Labs, 10Labs-Infrastructure, 6operations, 10ops-eqiad: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1866123 (10Cmjohnson) @coren: we no longer have the spar h800 we used in labstore1001 last month. [18:04:13] Oh boy - that thing hadn't had a puppet run in ages. Because: Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class role::deployment::deployment_servers::labs for i-00000705.eqiad.wmflabs on node i-00000705.eqiad.wmflabs [18:05:03] RECOVERY - Puppet staleness on tools-exec-cyberbot is OK: OK: Less than 1.00% above the threshold [3600.0] [18:05:18] chasemp: That box won't work until manifest fixes. Do you know who manages it? [18:05:41] it's mine [18:05:53] RECOVERY - Puppet staleness on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [3600.0] [18:06:06] oh, not services-deploy sorry [18:06:20] bd808: btw, unrelated, but I added a way for people to add their own root keys to projects they manage. Krenair sent an email about it to qa@ I think [18:06:26] bd808: No, yours is completely gone. :-) [18:07:12] Coren: coolio. I'll build a replacement and then nuke that one (it is actually still doing it's job at the moment -- http://wikimania-vagrant.wmflabs.org/wiki/Main_Page) [18:09:49] bd808: try to log in as root to bd808-vagrant? [18:10:00] RECOVERY - Host tools-master is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [18:10:10] 6Labs: 'virt1' entry at markmonitor? - https://phabricator.wikimedia.org/T102689#1866178 (10Dzahn) >>! In T102689#1864257, @RobH wrote: > it is something we pay them for as part of their service. :thumbsup: [18:10:16] Coren: I''m in! thanks [18:12:52] PROBLEM - Host tools-shadow is DOWN: CRITICAL - Host Unreachable (10.68.16.10) [18:14:52] chasemp: What do you want me to do with services-deploy? It's going to be broken until someone who knows what manifests it's supposed to be running fixes them. [18:14:52] chasemp: At the very least now it's able to try to update itself. [18:14:53] I say propose it for deletion [18:14:53] on https://etherpad.wikimedia.org/p/remaining-ldap [18:14:53] kk [18:14:53] no one w/ services seems to know what it is [18:14:54] chasemp: I looked at the SAL for that project. I think it was added at some point to power Terbuchet deploys inside the project [18:15:06] I bet it hasn't been used for that for a long time [18:15:09] maybe never used? [18:15:40] there was a sal entry from me about joining the project to debug trebuchet problems for them [18:16:24] Turns out the ldap migration will have shaken out a lot of abandonned cr... things. :-) [18:16:56] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:17:03] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:17:15] blah. ldap seems to have had another burp [18:17:27] moritzm: ^ [18:17:38] Very brief this time though - it looks as though they keep occuring but are getting less bad. [18:18:25] * Coren wonders if the instances that don't like the certificate hammer too hard on it - since we're reducing their number. [18:18:48] it's back up now [18:18:58] YuviPanda: Yeah, that one didn't last long. [18:19:08] Barely a minute [18:19:56] YuviPanda: can you look at 1 logstash01.logstash.eqiad.wmflabs. [18:20:01] I think unnecessary enc? [18:20:11] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Failed when searching for node logstash01.logstash.eqiad.wmflabs: Failed to find logstash01.logstash.eqiad.wmflabs via exec: Execution of '/usr/local/bin/ldap-yaml-enc.py logstash01.logstash.eqiad.wmflabs' returned 1: [18:20:11] Warning: Not using cache on failed catalog [18:20:11] Error: Could not retrieve catalog; skipping run [18:21:25] chasemp: yup looks like it [18:21:29] let me remove [18:21:33] hmm [18:21:47] actually, people can put their node/labs files in their locally checked out pupppetmaster [18:21:51] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 953344 bytes in 4.955 second response time [18:21:55] bd808: what's the puppetmaster for logstash? [18:21:59] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 953399 bytes in 5.366 second response time [18:22:36] YuviPanda: that whole project is old dead crap [18:22:42] could be nuked from orbit [18:23:16] bd808: let's do it! [18:23:21] imma kill them all then [18:23:28] I last touched it over a year ago when trying to get a place for matanya to play wiht logstash [18:23:40] bd808: how 'bout 1 bd808-vagrant.wikimania-support.eqiad.wmflabs.? [18:23:56] hmm I need my phone [18:24:05] but I'm in bed and don't want to get up [18:24:11] since it's too cold outside blankets [18:24:14] hmm [18:24:30] chasemp: I'm going to drag a few things off of that host, build a replacement and then nuke bd808-vagrant [18:24:36] kk [18:24:43] chasemp, YuviPanda, Coren, at least one of the VMs on the etherpad (util-abogott) is definitely fully puppetized and happy, and has always been. [18:24:54] so if you are tracking things assign to me and yell if it's not gone 2 days [18:24:57] So something is lurking and hitting nembus and I don’t know what it is [18:25:04] andrewbogott: any thoughts on why that would try to hi? [18:25:05] ok [18:25:06] but if that’s true then /every/ labs instance should be hitting nembus... [18:25:07] ??? [18:25:24] we may need a bigger tcpdump sample [18:26:14] yeah [18:26:31] might also be a caching thing, since I think it was slightly less than 24 hours when moritz ran his report [18:31:57] andrewbogott: Hm. Lemme take a look at it. [18:32:26] RECOVERY - Host tools-mailrelay-02 is UP: PING OK - Packet loss = 0%, RTA = 2.89 ms [18:35:43] PROBLEM - Host tools-master is DOWN: CRITICAL - Host Unreachable (10.68.16.9) [18:35:44] 6Labs, 10MediaWiki-Vagrant, 15User-bd808: Create "mediawiki-vagrant" project - https://phabricator.wikimedia.org/T120982#1866308 (10bd808) 3NEW a:3bd808 [18:35:45] no, it was earlier the day, maybe 6 hrs earlier than right now [18:35:45] 6Labs, 7Tracking: New Labs project requests (tracking) - https://phabricator.wikimedia.org/T76375#1866321 (10bd808) [18:35:45] 6Labs, 10MediaWiki-Vagrant, 15User-bd808: Create "mediawiki-vagrant" project - https://phabricator.wikimedia.org/T120982#1866319 (10bd808) 5Open>3Resolved https://wikitech.wikimedia.org/wiki/Nova_Resource:Mediawiki-vagrant [18:35:47] but surely, a current tcpdump won't hurt [18:35:47] moritzm: What I mean is, you ran the audit <24 hours after we switched over to the new ldap servers. Correct? [18:35:48] true that [18:35:48] just speculating [18:35:48] it was about 16-18 hrs after we switched the servers [18:35:50] chasemp: puppet's good on the quarry instances [18:38:23] andrewbogott: afaict, at this time the only ldap traffic from your test box is towards seaborgium and serpens (interestingly enough - both) [18:38:36] Right... [18:38:48] so I think that means that our etherpad list probably has a ton of false positives [18:39:00] moritzm: do you have time to re-run that audit? [18:39:18] (I do not relish reconciling new audit info with the existing etherpad) [18:39:58] I wonder why both. libldap being smart enough to RR between the two servers? [18:40:44] andrewbogott: I'll be mostly out soon, simply run "tcpdump -n dst port 389 -w ldap.dump" and "tcpdump -n dst port 686 -w ldaps.dump" on either host [18:40:49] moritzm: thought it would round-robin per query [18:40:54] which surprises and impresses me if true [18:41:00] moritzm: ok, thanks [18:41:11] * andrewbogott is trying to attend V.A.’s talk right now but this may be futile [18:41:51] I'm not sure on the lookup behaviour (whether RR or sticking with the first and only using the second as fallback) [18:41:53] RECOVERY - Host tools-master is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [18:42:33] moritzm: that seems important… if it’s using the second as a fallback then that would imply that we’re failing over constantly [18:43:07] andrewbogott: That's what I was wondering exactly. [18:51:57] hm… whatever tcpdump -n dst port 389 -w ldap.dump is producing is not a format I can read [18:53:17] that is tcpdump's custom format, you can open it in wireshark, e.g. [18:53:32] is has a statistics open which gathers all the endpoints [18:53:38] statistics mode [18:54:19] So simply run "tcpdump -n dst port 389 -w ldap.dump" and "tcpdump -n dst port 686 -w ldaps.dump" on either host and then :) [18:54:44] 'open it in wireshark' is the step. :-) [18:54:56] open the file in wireshark, go to the "Statistics" menu -> then "endpoints" [18:55:04] (I have wireshark 2, might be different in 1.x) [18:55:15] wireshark being something that is... [18:55:19] and there's an option to export to CSV [18:55:22] a local tool, or a unix shell tool? [18:55:44] I feel like I can accomplish this with sed and grep, in any case [18:55:54] local tool for packet analysis, open source and should be available for macos [18:56:09] off for dinner [18:58:16] I am clearly not the right person for this job, but I will see about running a new report this evening. [19:00:58] i'm going to make my way to the office [19:01:05] and continue going through the etherpad [19:04:18] andrewbogott: I have no issues doing so myself if you want. [19:04:38] Coren: yeah, if you think you can reproduce moritz’s process, that would be great. [19:04:55] (if nothing else, because I need lunch) [19:05:05] andrewbogott: No probs. I think we should have a fairly wide window to be useful. 1h sounds good to you? [19:06:56] yeah, should be plenty [19:10:26] 6Labs, 6TCB-Team, 15User-bd808: Create labs projects for lizenzhinweisgenerator - https://phabricator.wikimedia.org/T120925#1866426 (10bd808) 5Open>3Resolved a:3bd808 https://wikitech.wikimedia.org/wiki/Nova_Resource:Lizenzhinweisgenerator [19:17:49] if we remove the to be deleted ones a reaudit should be fresh and valid as it wont show fixed VMs [19:18:11] but i am going to the dentist atm [19:20:32] Oh, dumbass. I was doing the dump on the /current/ ldap servers and wondering why they were growing so fast! [19:25:57] 10Quarry: Login to somenody's account - https://phabricator.wikimedia.org/T120988#1866469 (10IKhitron) 3NEW [19:26:06] 10Quarry: Login to somebody's account - https://phabricator.wikimedia.org/T120988#1866476 (10IKhitron) [19:27:06] 10Quarry: Login to somebody's account - https://phabricator.wikimedia.org/T120988#1866479 (10yuvipanda) How long ago did this happen? [19:28:05] 10Quarry: Login to somebody's account - https://phabricator.wikimedia.org/T120988#1866480 (10IKhitron) A minute ago. [19:29:06] 10Quarry: Login to somebody's account - https://phabricator.wikimedia.org/T120988#1866482 (10yuvipanda) Probably a blip from the redis session stuff I was just doing. I'll clear out everyone's sessions to make sure :) [19:30:18] 10Quarry: Login to somebody's account - https://phabricator.wikimedia.org/T120988#1866490 (10IKhitron) Hope you are right. Otherwise it's security problem. :-) [19:31:38] has anyone tested new instance creation today? I spun up mwv-iso-builder.mediawiki-vagrant.eqiad.wmflabs about 50 minutes ago and although it has booted it will not accept my ssh key yet [19:31:47] 10Quarry: Login to somebody's account - https://phabricator.wikimedia.org/T120988#1866492 (10yuvipanda) I've just cleared out everyone's sessions, which should clear up any other missing issues! Hopefully I won't have to mess with redis again for a while :) Can you confirm that your session has been logged out? [19:33:16] 10Quarry: Login to somebody's account - https://phabricator.wikimedia.org/T120988#1866500 (10IKhitron) Indeed, it has. Thank you. [19:34:28] PROBLEM - Host tools-master is DOWN: CRITICAL - Host Unreachable (10.68.16.9) [19:35:14] 10Quarry: Login to somebody's account - https://phabricator.wikimedia.org/T120988#1866505 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Ok :) Closing it for now! :) Thanks for reporting! <3 [19:36:02] 10Quarry: Login to somebody's account - https://phabricator.wikimedia.org/T120988#1866508 (10IKhitron) No problem. [19:51:55] RECOVERY - Host tools-master is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [19:53:11] 10Quarry: Login to somebody's account - https://phabricator.wikimedia.org/T120988#1866567 (10Edgars2007) I'm everywhere :) [19:53:50] 10Quarry: Login to somebody's account - https://phabricator.wikimedia.org/T120988#1866569 (10yuvipanda) As the wise poets from 'The Beatles' once said, 'here, there, everywheeeerreeeeeee' [19:54:34] 10Quarry: Login to somebody's account - https://phabricator.wikimedia.org/T120988#1866576 (10IKhitron) You're everywhere, @Edgars2007, or anyone can use your account? :-) [20:18:35] 10Quarry: Login to somebody's account - https://phabricator.wikimedia.org/T120988#1866675 (10Edgars2007) >>! In T120988#1866576, @IKhitron wrote: > You're everywhere, @Edgars2007, or anyone can use your account? :-) I trust you :) [20:30:28] PROBLEM - Host tools-master is DOWN: CRITICAL - Host Unreachable (10.68.16.9) [20:45:01] RECOVERY - Host tools-master is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [21:02:27] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Frisko was created, changed by Frisko link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Frisko edit summary: Created page with "{{Tools Access Request |Justification=Host the bot FriskoBot for use on Swedish Wikipedia. Today the bot is mostly used to update lists of maintenance categories, such as "Kat..." [21:36:28] PROBLEM - Host tools-master is DOWN: CRITICAL - Host Unreachable (10.68.16.9) [21:46:55] RECOVERY - Host tools-master is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [22:00:01] 6Labs, 10Tool-Labs: Update Java 7 to Java 9 - https://phabricator.wikimedia.org/T121020#1867127 (10doctaxon) 3NEW [22:01:06] PROBLEM - Host tools-master is DOWN: CRITICAL - Host Unreachable (10.68.16.9) [22:02:18] 6Labs, 10Tool-Labs: Update Java 7 to Java 8 - https://phabricator.wikimedia.org/T121020#1867149 (10doctaxon) [22:03:39] andrewbogott: YuviPanda ok so deal is [22:03:47] I got promethium a cert from labcontrol1001 [22:03:52] if I puppet agent --test on taht server [22:03:57] it contacts labcontrol1001 [22:04:21] and the master returns [22:04:22] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find node 'promethium.eqiad.wmnet'; cannot compile [22:04:27] chasemp: how did you get there? When I tried last night it wouldn’t do a cert exchange [22:04:31] this confuses me as I expected it to get the default node value [22:04:50] andrewbogott: so the way orphan clients work is tehy look for the puppet hostname [22:04:51] I think it might be an error from the ldap terminus [22:04:55] and attempt to request a cert [22:05:06] YuviPanda: yeah, I bet [22:05:10] I basically cheated for testing and wiped out the old cert stuff and put an entry in /etc/hosts [22:05:15] that says "puppet" is labcontrol1001 [22:05:35] it autosigned it and such which was a little alarming [22:05:40] chasemp: I had already changed puppet.conf to use labcontrol1001 as the master. Any idea why that didn’t work? [22:05:40] but this is my next up mystery [22:05:54] so create a ldap entry for promethium.eqiad.wmnet [22:05:59] and see if that gets rid of the error [22:06:03] even if it returns empty stuff [22:06:22] good question I have never done it that way andrewbogott, I was under the impression puppet agent looked for teh generic puppet hostname and best practice was to have that resolve as intended [22:06:27] taht's jsut like, the way I know from years ago [22:06:29] sure [22:06:35] but labs hosts use a named puppetmaster [22:06:40] that’ll get changed as soon as we get a good puppet run [22:06:41] ok then not sure [22:06:44] 2 mysteries [22:06:44] huh [22:06:48] well, anyway, it’s progress :) [22:07:10] so we are thinking if the ldap entry is missing instead of falling through [22:07:12] it's just kicking back [22:07:21] right [22:07:22] I’ll work on the node definition thing. I’d rather we didn’t have to add an ldap entry for metal servers but I need to think about how to avoid that [22:07:29] agreed [22:07:35] maybe just pass on w/o one? [22:07:44] where is that defined to look in ldap? [22:07:44] once we get it puppetized it’ll rename the puppetmaster and then I can work on that issue again :) [22:08:03] I don’t know yet, need to look. It might be in the local puppet.conf on promethium [22:08:11] 6Labs, 10Tool-Labs: Update php 5 to php 7 - https://phabricator.wikimedia.org/T121022#1867173 (10doctaxon) 3NEW [22:08:50] hm, nope, not there [22:08:57] so I kind of stuck on that and I'm working through w/ a local repo what sticks against the labs role application atm [22:09:02] andrewbogott: chasemp one of the things I was thinking of was we write a proper real ENC that's backed by MySQL and integrate that into horizon [22:09:18] then we can do nice things with it and get rid of LDAP for things [22:09:26] yeah I'm good w/ that I find ldap for this to be kinda meh [22:09:42] so that's possibly one thing we can do that'll be much more useful than service groups IMO [22:10:00] and would allow us to not let LDAP touch horizon for now [22:10:42] also more man-power intensive tho [22:10:50] andrewbogott: quick question as an aside, do we still think ldap (new) is being overwhelmed? [22:10:53] it's causing dns issues etc? [22:11:07] RECOVERY - Host tools-master is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [22:11:11] chasemp: I don’t know. I suspect the problem is still there but haven’t seen it recently. [22:11:15] kk [22:11:58] bd808 was saying he had new instance issues around lunch or so fyi (I think) [22:12:38] back to the qeustion at hand, YuviPanda do you know where the logic is that tells puppet to look in ldap? [22:13:20] Looks like it’s not easy to bypass the ldap node classifier selectively. I’ll add a dummy entry for promethium [22:13:29] it’s in /etc/ldap.conf on labcontrol1001 [22:13:37] https://docs.puppetlabs.com/guides/ldap_nodes.html [22:14:33] ok so dummy entry that adds nothing but doesn't block etc [22:14:42] and that goes on our todo list to work around [22:16:03] chasemp: yeah, it's in puppet.conf in the master [22:16:38] ok I get it now, it helps when you look in the right place [22:17:04] chasemp: ok, I added an ldap record [22:17:11] now it’s failing on the metadata call which we talked about the other day. [22:17:15] I’ll write a puppet patch for that [22:17:19] chasemp: > modules/role/manifests/labs/puppetmaster.pp: 'node_terminus' => 'ldap', [22:18:19] andrewbogott: what project are you going to use? [22:18:26] testlabs for now [22:18:30] k [22:18:39] unless there’s somewhere more obvious [22:19:23] nope [22:22:51] brb (again) [22:24:28] PROBLEM - Host tools-master is DOWN: CRITICAL - Host Unreachable (10.68.16.9) [22:24:52] ^is this dns flapping? [22:25:35] so are you all aware of deployment-kafka03 being broken? [22:25:58] chasemp told me even his root key doesn't work earlier [22:26:31] 6Labs, 10Tool-Labs: Update php 5 to php 7 - https://phabricator.wikimedia.org/T121022#1867266 (10doctaxon) [22:26:50] 6Labs, 10Tool-Labs: Update Java 7 to Java 8 - https://phabricator.wikimedia.org/T121020#1867270 (10doctaxon) [22:27:01] 6Labs, 10Tool-Labs: Update php 5 to php 7 - https://phabricator.wikimedia.org/T121022#1867272 (10Krenair) Urgently? Why? [22:27:09] 6Labs, 10Tool-Labs: Update php 5 to php 7 - https://phabricator.wikimedia.org/T121022#1867276 (10Krenair) p:5High>3Triage [22:27:24] I think we know but thcipriani was going to look at deployment/staging things for ldap changeover [22:27:39] 6Labs, 10Tool-Labs: Update Java 7 to Java 8 - https://phabricator.wikimedia.org/T121020#1867282 (10Krenair) p:5High>3Triage {{cn}} [22:27:40] and I don't know what they may or may not have done there config wise or who uses that vm [22:27:47] PROBLEM - SSH on tools-exec-1221 is CRITICAL: Server answer: [22:28:18] I took a look at staging, hadn't looked at the deployment ones. [22:28:26] [17:53:58] can you not get an interactive console from the actual host, chasemp? [22:29:04] I can't ssh to it, if there is some vnc trick I don't know it [22:29:48] andrewbogott: so I have a local setup ready to apply role::labs::instance there [22:30:05] mind if I go through w/ it to uncover parts that go boom while you work that out? [22:30:11] sure [22:30:28] yeah, deployment-kafka03 isn't reachable by salt, may need someone able to login with root on that one. [22:30:40] Coren: ^ you about to try this? [22:30:44] thcipriani, you cannot log in as root. [22:31:50] Krenair: it may be that some ppls root keys are there, we have seen that [22:31:53] depends on how long and how it's broken [22:32:12] andrewbogott: if you want to watch it's going to /root/puppet_run_1.txt [22:33:10] Dispenser: Do you know about the Cyberbot project? [22:33:31] Nope, been working Dab solver [22:35:34] It's a bot that goes through and replaces dead links with archive links in the Wayback Machine (when they exist). I'm not if it has any kind of bad archive detection though (detecting when the wayback machine has archived an error page instead of a content page). I was wondering if you had the code for Checklinks available anywhere, so that others could [22:35:34] learn from the work you did on that. [22:35:55] I'm not = I'm not sure [22:36:15] The most useful stuff is in Reflinks already [22:36:28] or even borrow some of the logic [22:37:10] Dispenser: Oh, I didn't realize that reflinks did that too [22:37:49] chasemp: I can. Gimme a sec. [22:37:56] I've told him this before. My Schedule is suppose to be GeoHack, Telnet Wikipedia, then fixing our reference problems [22:37:59] Coren: tx no hurry [22:38:15] Dispenser: Told who? [22:38:28] Cyberman was it [22:38:44] chasemp: No can root. Box hosed. [22:38:48] kk [22:38:50] oh, guess he went ahead anyway [22:39:05] https://en.wikipedia.org/wiki/User_talk:Dispenser#Revisiting_dead_site_detection [22:39:50] Anyway running a bit behind my goals, so Probably be doing Telnet Wikipedia so it ready for the 15th anniversary [22:40:42] BTW, would WMF be willing to get VoIP provider (If completed) [22:41:03] Dispenser: Regardless, it would be great if the code you used for checking wayback archives were available somewhere, so that people don't re-invent the wheel (and probably poorly). [22:41:41] Dispenser: no idea about that one :) [22:43:35] andrewbogott: chasemp: Just put a list of still-talks-to-old-ldap at the bottom of the etherpad [22:43:35] My first major project. I know I could do a better job with 24 TB. :-) The real value is in knowing edge those hundreds of hours of testing [22:44:19] Coren: andrewbogott YuviPanda funny suggestino, what would you say to agreeing we all get pinged on 'l-team' [22:44:37] chasemp: Not insane. [22:44:37] to avoid the constant 3 person autocomplete [22:44:50] * Coren adds the stalk [22:44:51] analytics seems to use a-team and it works well [22:45:02] RECOVERY - Host tools-master is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [22:45:05] what? [22:45:07] oh :) [22:45:17] ha sorry milimetric :p [22:45:18] * milimetric is proof positive [22:45:33] stealing yer good ideas mate [22:46:30] If I can remember wth that setting /is/ :-) [22:46:58] could also do lteam if there is a pref [22:47:03] the pinky reach and all :) [22:47:26] lteam seems nicer [22:47:39] chasemp, YuviPanda, sorry, irccloud is acting up again [22:48:47] chasemp: I'm stalking both now. [22:49:17] andrewbogott_: Only bit you missed from me maybe is "list of instances still talking to old ldap on etherpad at bottom" [22:50:08] Dispenser: I tried running Alabaster through Reflinks (it has two raw links as references and both are dead), but it said that no changes were needed. [22:50:50] It doesn't tag them unless they're in a database that hasn't been updated in a LONG TIME [22:51:16] I see [22:51:51] I'll get to, but its a really big project that WMF hasn't committed any resources to [22:55:36] Dispenser: Anyway, what do you think about the idea of posting your archive checking code somewhere public so that others can build off it? I'm not trying to pressure you, just letting you know it would probably be useful to other folks. [22:56:22] 6Labs, 10Tool-Labs: Update Java 7 to Java 8 - https://phabricator.wikimedia.org/T121020#1867509 (10Aklapper) Please elaborate why this is urgent for the general audience. [22:56:50] 6Labs, 10Tool-Labs: Update php 5 to php 7 - https://phabricator.wikimedia.org/T121022#1867513 (10Aklapper) Please elaborate why this is urgent for the general audience. [22:57:00] Can't. First project. Lifted some materials from tutorials and I don't have the copyright for that [22:57:14] remember to not run non OSS things on labs. [22:57:41] andrewbogott_: chasemp Coren sounds good! I'll try to figure out how to configure weechat for that [22:58:07] Dispenser: ah, understood :) [22:58:39] YuviPanda: Wouldn't that include downloading random shit off the internet? Like what a link checker does? [22:59:57] Dispenser: Perhaps I could help you rewrite it at some point [23:00:03] only if it stores it forever afterward [23:00:05] *afterwards [23:00:19] which is why an archival things similar to IA would have problems but a link *checker* will not [23:00:55] same way using IE to access tools is not a violation, but installing proprietary software in tools itself is [23:03:16] sorry I opened a can of worms. just pretend I wasn't here :) [23:04:05] * YuviPanda runs away from worms too [23:05:09] kaldari: The whole thing needs to be rewritten. The value, again, is in knowing those edge cases and having a way to work around it. Now if WMF provides the right incentives I *may* change my schedule, but its looking like I can start in February or March. Now I need go learn wordpress to build an e-commerce store. [23:14:47] chasemp, YuviPanda: https://gerrit.wikimedia.org/r/#/c/258051 [23:16:50] 6Labs, 10Tool-Labs: Update php 5 to php 7 - https://phabricator.wikimedia.org/T121022#1867587 (10doctaxon) Because php 5 won't be maintenanced next year any more. And why should we work with an old php version? [23:20:23] andrewbogott_: will this only set teh fact on vm's then? [23:20:33] what happens on a hw box like promethium? [23:21:03] chasemp: on promethium it’s set in hiera so we never hit the fact [23:21:12] if it wasn’t set in hiera it would error out [23:21:39] I guess I'm saying, it would be nice if the fact is still set for project on the box? [23:21:46] even tho we don't then go and use it for this [23:22:16] maybe it's inconsequential [23:22:16] how would that be useful? [23:22:21] 6Labs, 10Tool-Labs: Update Java 7 to Java 8 - https://phabricator.wikimedia.org/T121020#1867609 (10doctaxon) Because Java 7 isn't supported by Oracle any more, not public way. https://www.java.com/de/download/faq/java_7.xml Are security updates unsupported, too? [23:22:30] easy way to look at a promethium and see what project it's been allocated to [23:22:43] hm [23:22:51] I don’t think facts can look at hiera [23:22:54] also makes debugging easier [23:22:56] ok [23:22:57] so I don’t know how the fact would know [23:23:01] unless we added a whole other level [23:23:06] problem for another day then [23:23:08] I can accept that [23:23:16] like injecting into metadata, or hardcoding in ruby, or adding a lookup table that the fact looks in etc. [23:23:49] I don't like that realm isn't a fact etc etc just makes things more difficult but hey no need to muck this up [23:23:50] just asking [23:24:03] yeah... [23:24:14] facts are never as useful as I want them to be [23:24:20] since they can’t see each other or hiera [23:24:29] right we need to bridge that [23:24:40] right now it's a bizarre disconnect [23:24:44] anyways [23:25:04] Mostly it’s puppetlabs trying to solve the same problem over and over in disjoint ways [23:25:24] well we don't use mcollective so we kind of treat facts as second class citizens [23:25:28] like, there is a grain for realm [23:25:28] etc [23:25:35] imo [23:30:07] 6Labs, 10Tool-Labs: Update php 5 to php 7 - https://phabricator.wikimedia.org/T121022#1867622 (10Krenair) 5.5 should get security fixes through to 2017-08-28. [23:33:08] 6Labs, 10Tool-Labs: Update php 5 to php 7 - https://phabricator.wikimedia.org/T121022#1867633 (10doctaxon) What are the contraindications for an actual version? [23:33:45] chasemp: do you have other thoughts about how to get the right project into the fact? Or shall I merge as is? [23:34:21] nothing I want to get into atm if we don't have a clear way, seems ok to me man [23:37:39] mind if I try a puppet run on promethium to see how that affects things? [23:37:49] sure give me a sec tho [23:38:53] andrewbogott_: yep can enable and run [23:41:15] hm, the hiera lookup is failing it seems [23:42:15] andrewbogott_: it's using the labs master [23:42:22] and the labs master has no top level host lookup [23:42:35] it's a bit circular atm actually :) [23:42:36] - "labs/%{::labsproject}/host/%{::hostname}" [23:44:42] we can add "labs/hosts/%{::hostname}" [23:44:43] should be a noop for existing things assuming no one has defined anything whacky [23:45:09] PROBLEM - Host tools-master is DOWN: CRITICAL - Host Unreachable (10.68.16.9) [23:45:10] So we have no solution to hosts with broken ssh? [23:45:59] chasemp: is that just a matter of creating that subdir, or does that have to be configured someplace as well? [23:46:03] Krenair: can you be more specific? [23:46:21] Krenair: we’ve all three been busy fixing such hosts all day [23:46:21] andrewbogott_: it needs to exist in teh lookup tree in modules/puppetmaster/files/labs.hiera.yaml [23:46:35] if you have one in particulary you care about I can look at it now [23:46:40] if no one can log into an instance as normal users or as root, there's no way to get in at all? [23:47:01] Krenair: that’s most likely true [23:47:11] but also it’s unusual for there not to be root keys [23:47:22] to be fair if you decide on a self hosted puppet master it requires more diligence and it's not something we can fix [23:47:23] some labs instances are older than chase, in which case he doesn’t have one [23:48:00] I figured you would have some way in without relying on ssh [23:49:00] not really [23:49:13] there are some desperate things we could do with VNC and such but none of that is likely to work in retrospect [23:49:21] techically kvm allows mounting console via vnc [23:49:47] yeah, and I can also stop an instance and mount its drive and mess with that [23:49:55] for a vm messed up from a self hosted puppet master the best I would say is to blow and recreate [23:50:09] oh, am I back here now? [23:50:28] * andrewbogott pleased to be self again [23:50:59] Krenair: but, again, I’m happy to look if you have a specific case in mind [23:51:21] thcipriani already deleted it instead [23:51:31] >!log delete deployment-kafka03 doesn't seem to be in-use yet and cannot be accessed via salt or ssh by root or anyone [23:51:32] I'm just surprised you don't have a way to do it [23:52:04] so andrewbogott you have to add the lookup path to the hiera yaml file there [23:52:13] if you add it at the top it'll work how you want I think [23:52:21] but I gotta run here pretty quick [23:52:30] I have a few things figured out elsewise [23:52:38] it looked like that one was an instance that was created yesterday, no roles or configuration as near as I could tell, I think deleting it was fine. [23:52:40] I think I got the sudo thing going but want to relook on the morrow [23:55:04] 6Labs, 10Tool-Labs: Update php 5 to php 7 - https://phabricator.wikimedia.org/T121022#1867758 (10EBernhardson) What version of php is currently exposed? Is it Zend or HHVM?