[00:29:11] 3Wikimedia Labs / 3Infrastructure: Database upgrade MariaDB 10: Lock wait timeouts / deadlocks in a row - 10https://bugzilla.wikimedia.org/68753#c5 (10Sean Pringle) (In reply to Tim Landscheidt from comment #2) > (In reply to Sean Pringle from comment #1) > > [...] > > > Today we are trialing READ-COMMITTED... [00:33:52] 3Wikimedia Labs / 3Infrastructure: Database upgrade MariaDB 10: Lock wait timeouts / deadlocks in a row - 10https://bugzilla.wikimedia.org/68753#c6 (10Sean Pringle) (In reply to metatron from comment #4) > Though I didn't change anything, the massive lock wait timeouts or deadlocks > disappeared. Right now... [00:37:22] 3Wikimedia Labs / 3wikitech-interface: Specal:NovaRole - sort members alphabetically in Add member dialog - 10https://bugzilla.wikimedia.org/36516 (10Daniel Zahn) 5NEW>3RESO/FIX [00:39:52] 3Wikimedia Labs / 3Infrastructure: autostart / puppetize snmtrapd on labs nagios - 10https://bugzilla.wikimedia.org/36470#c1 (10Daniel Zahn) meanwhile this raises the questions: - does labs nagios still exist - will it exist again - if so, how is the puppet freshness monitoring - do we use snmp traps for it [00:46:53] YuviPanda: What be thy question? [00:47:01] Coren: 'tis ok, found the answer :) [00:47:04] https://bugzilla.wikimedia.org/show_bug.cgi?id=68818 btw [00:47:11] * YuviPanda clicks [00:47:25] That's what we discussed earlier with betacommand [00:47:43] Coren: ah, ok. I responded on the RT for graphite, btw - needs packaging of python-txstatsd. It should be a trivial repackage to build from precise to trusty, tho [00:47:58] Coren: reading through it now [00:48:02] YuviPanda: What was your issue? Is this something that should be documented? [00:48:41] Coren: no, it was mostly a brainfart. Was wondering if we have an easy way of sharing files between different instances in the same project, and then realized thata /data/project exists [00:48:53] in the gap between (1) and (2) I had pinged you :) [00:49:00] Heh. Which is kinda the point. :-) [00:49:06] indeed [00:49:55] Coren: do you think you'll have some time this week to wrangle up the missing package? [00:50:44] YuviPanda: Possibly, though when the effing disks finish wiping on labstore1003 that immediately becomes my absolute priority. [00:50:49] Coren: hmm, also, do we have a 'query killer' in labsdb? [00:51:18] Coren: yeah, that's ok :) I suspect with wikimania travel coming up, I won't be able to properly focus on this until after wikimania anyway [00:51:53] YuviPanda: It's called Sean Pringle. :-) [00:52:06] Coren: heh :D and I guess it doesn't get called often... :) [00:52:10] considering he hasn't automated that [00:52:37] No, we're considering it but I keep to my philosophy of no hard limits unless they become absolutely required. [00:52:45] Coren: yeah, makes sense. [00:53:11] I do need to sit down and talk improvement with the catscan2 people though. [00:53:21] heh, hopefully at wikimania [00:53:23] * Coren should look up if any of them will be around the Hackaton [00:53:52] Coren: I'm working on something with the analytics / research folks now, should be able to show it off at wikimania! [00:54:00] Coren: will the labsdbs be back on ssds by then? [00:54:00] Coolness! [00:54:52] YuviPanda: I wouldn't count on it; Sean is likely going to concentrate first on converting the other databases. [00:55:02] Coren: hmm, right. shouldn't be too bad either way. [00:56:30] Coren: oh, also, by default I can't seem to run 'sudo -u ' [00:56:36] Coren: is there something I can nudge to make that happen? [00:56:42] right now I've to first 'sudo -s' and then 'sudo -u' [00:57:13] YuviPanda: The default rule allow sudo to root, not sudo to any. You can edit it though. [00:57:25] Coren: hmm, can I do that with Nova:Sudoers? [00:57:28] * YuviPanda checks [00:58:15] Coren: Wikitech is *really* slow, 30s+ for any of the Nova pages :( [00:58:50] YuviPanda: I've got a host in beta with a djnago app that I'd like to setup outbound mail from. Should I install postfix or something to act as my MTA or is there one in labs that I can point at? [00:59:02] It's fast and snappy for me, even instance lists on tools which is as heavy as it gets. [00:59:18] bd808: you probably should just install your own thing. toollabs has an instance setup, but that's probably not accessible to you [00:59:28] coolio [01:01:16] Coren: hmm, I modified it on Special:Sudoers to have 'Allow running as:' ' All project members', and I still can't sudo -u [01:01:24] Coren: I'm trying to sudo -u as a system user tho [01:01:32] Coren: should I just drop a sudoers file? [01:01:50] Coren: woah, after ticking that box I can't sudo as *anyone* [01:01:59] including root [01:02:12] o_O [01:02:22] Coren: and unticking that lets me sudo as root [01:02:51] That ldap sudo thing is teh suk. Lemme go check for a sec. [01:03:20] Coren: ok [01:03:24] Coren: this was project 'quarry' [01:03:29] Coren: I can turn it back on if you want [01:06:19] YuviPanda: As far as I can tell, I see no reason why that rule wouldn't work. [01:06:28] * Coren boggles a little. [01:06:35] Coren: ugh. [01:06:48] Lemme go try to debug. What instance is this? [01:06:56] Coren: quarry-web-test [01:07:02] Coren: project: quarry [01:08:54] Coren: long rant posted to the bug [01:09:26] Neither meta or wikitech should have labs docs [01:10:17] I agree as well, though I understand the "Ugh, not *another* wiki" reaction. [01:10:50] YuviPanda: ... I have no difficulty sudoing to random users. [01:10:59] YuviPanda: What's your issue, exactly? [01:11:11] Coren: sudo -u quarry touch hi [01:11:14] Coren: that asks me for a password [01:11:38] marc@quarry-web-test:~$ sudo -u quarry id [01:11:38] uid=997(quarry) gid=997(quarry) groups=997(quarry) [01:12:12] Coren: I just tried the exact same thing, and got a password prompt :| [01:12:21] Coren: can you sudo -u yuvipanda -s, and then try it? [01:12:23] Log off and on? You might be missing a crucial group membership? [01:12:39] Coren: I have before, trying again [01:12:47] Coren: yup, still the same [01:12:57] Interesting. Your account seems to suck. :-) [01:13:04] YuviPanda: you are completely closing the ssh connection? [01:13:18] YuviPanda: Wait are you using mosh? [01:13:21] Coren: no, ssh [01:13:40] Coren: if you sudo to be me and try it, does it work? [01:13:57] No. I can't seem to figure out why. [01:14:00] Coren: I remember others running into the same issue with vagrant, and having to do a 'sudo -s' first before doing a sudo -u vagrant [01:14:30] Coren: maybe you're logging in with your root key and hence it works? [01:14:34] I don everything on my labs-vagrant hosts as root, but I'm lazy [01:14:42] sudo don't care 'bout no key [01:16:51] ah-ha! [01:17:15] There's an extra fun rule coming from puppet that %ops ALL=(ALL) NOPASSWD: ALL [01:17:21] aah [01:17:25] * Coren tests moar things. [01:17:32] that's probably why you can :) [01:17:39] and I'm not part of the ops group yet [01:17:42] That explains why it works for me, doesn't explain why it /doesn't/ work for you. [01:19:10] Coren: right [01:22:48] * Coren keeps investigating. [01:22:54] Coren: I think I put it fairly elegantly https://bugzilla.wikimedia.org/show_bug.cgi?id=68818#c8 [01:23:07] I'va also added two comments around yours. [01:24:06] Coren: you might point out that by separating /creating a new wiki it makes cleaning up the disaster a lot easier [01:24:07] YuviPanda: Conceptual error in the sudoers interface - it clearly has no correct way to state ALL as the target from the web thingy. (Bug) [01:24:34] Betacommand: I think that's clearly implied by "you're not going to clean the mess up by mixing in another set of docs" :-) [01:24:36] Coren: hmm, right. so there would need to be a wikitech patch that lets you add 'ALL' and not just 'All other project users' [01:25:05] YuviPanda: Yes, because 'project users' simply equates to '%project-whatever' (i.e., other members of that group) [01:25:10] ah [01:25:11] right [01:25:12] bah [01:25:28] YuviPanda: In the meantime, you can always add a file in /etc/sudoers.d [01:25:38] But that's sorta sucky. [01:26:23] Coren: yeah, that's what I'm doing now [01:26:36] Coren: can you file a bug with your findings? I'm going to crasassshhh now [01:27:20] YuviPanda: KK, will do. [01:29:29] <^demon|brb> I should finish interwiki searching. [01:29:37] <^demon|brb> So I can plug all the tech wikis into each other. [01:29:44] <^demon|brb> Then you can make as many as you want Coren! [01:29:46] <^demon|brb> :) [01:29:56] 3Wikimedia Labs / 3wikitech-interface: Sudoers interface should provide an option for ALL - 10https://bugzilla.wikimedia.org/68834 (10Marc A. Pelletier) 3NEW p:3Unprio s:3normal a:3None Currently, the sudoers per-project interface allows creation of sudo rules with individual users or "all project us... [01:30:08] 3Wikimedia Labs / 3wikitech-interface: Sudoers interface should provide an option for ALL - 10https://bugzilla.wikimedia.org/68834 (10Marc A. Pelletier) a:3Andrew Bogott [01:30:17] ^demon|brb: That'd make things _worse_ if that's on by default! :-) [01:30:42] <^demon|brb> You haven't even used the feature yet! [01:30:56] Right now, we're suffering from a needle-in-haystack problem. No, more precisely, we're suffering from a needle-in-haystackS problem. :-) [01:31:21] <^demon|brb> So you want another haystack :) [01:31:33] I'm proposing neatly sorted stacks of hay! [01:31:42] <^demon|brb> lol, sorted. [01:32:02] Well yeah. The problem that info is sorta-spread out and mixed up is real. [01:32:24] <^demon|brb> See I'd rather throw all the needles in one big haystack [01:32:42] <^demon|brb> But give yourself tools to look through the haystack effectively. [01:33:27] ^demon|brb: That's a Hard Problem. Google is throwing literally billions at it, and is far from perfect at the best of times. :-) [01:34:04] <^demon|brb> Anyway, I'm clearly losing the battle. [01:35:01] No, no - it's not a battle. If nothing else, having someone clearly offer the opposite perspective is a good thing. I just think that you overestimate how easy it is for the typical labs user to find their way around the documentation. [01:36:02] We're suffering from tl;dr in a desperate attempt to cover everything in a way that's findable, and much of what could be useful can only be found with difficulty when you know what you are looking for and can recognize context. [01:36:24] Cross-wiki search would be GRAND when you don't know where something is. [01:37:18] But when you have a fairly small focus and a well-defined audience, collecting everything together for them is sometimes the best solution. Couple that with the community dynamics that come from a wiki, and I'm sold. Doesn't mean everyone needs to agree. :-) [01:38:08] In fact, it's the WMF. If everyone agrees unanimously on something, then you should start running for the hills or something. :-P [01:38:29] <^demon|brb> :) [01:40:08] (Also, being able to do main page announcements, put the must reads in mainspace, etc, would make the community management aspects of my job much easier - and that also affects my opinion) :-) [01:42:12] <^demon|brb> So is this really moving labs to its own wiki or moving wikitech back off? [01:42:23] <^demon|brb> For me, it makes sense to have the labsconsoley features on the labs wiki. [01:44:36] To a point it makes sense, but I don't think there's any need to combine the two; interwiki link work well at worse. I would really like Labs on its own so that it can be a simple, content-oriented wiki with a narrow focus. [01:44:52] I could be convinced either way, though. [01:45:01] <^demon|brb> See, *that* would really confuse me. [01:45:06] Would it? [01:45:21] <^demon|brb> I mean I guess not because I was in the room when we discussed it. [01:45:25] There's a secondary point though, which you might not be aware of: [01:45:38] <^demon|brb> But it's weird. I'd put all the labs stuff on the same wiki and keep production stuff separate. [01:46:00] <^demon|brb> (Also useful if we end up using some version of virtualization for prod services, which might need its own console) [01:46:20] revamping the labs interface is on our goals, and it's entirely possible we axe the openstackmanager extension entirely. [01:46:28] YES PLEASE LET US KILL OSM [01:46:57] <^demon|brb> Anyway, just thinking aloud. [01:47:08] Also, I'd *really* want to make the prod virtualization staff to be at arms' length from the labs stuff. Having both on Wikitech would confuse the hell out of things. [01:47:15] s/staff/stuff/ [01:48:36] ^demon|brb: So, likely, the technical aspects of wikitech will go away entirely and all that'd be left is an odd mix of labs and prod stuff. :-) [01:49:16] <^demon|brb> Coren: No, not both on wikitech, that wasn't my idea. [01:49:32] <^demon|brb> My idea was wikitech would have prod + prod virtualization, labswiki would have labs shit + labsconsole. [01:49:50] <^demon|brb> But if OSM goes away it doesn't matter, like you said. [01:50:24] ^demon|brb: That'd make sense to me; and I'd go that way were it not for the fact that EEEW another non-prod wiki to maintain. So if OSM goes away we are dancing. :-) [01:50:42] <^demon|brb> I'm not convinced it has to be so un-prod. [01:50:53] <^demon|brb> We could fix that problem, and I'd be more than willing to help doing so. [01:51:02] Because [bleep] SMW right now it's a mess. [01:51:09] <^demon|brb> We can fix that. [01:52:27] <^demon|brb> Hmm, I'm going to play with something. I'm not convinced we have to stay on $massivelyAncientSMWVersion [01:52:35] <^demon|brb> Well, I shall eat and then play with it. [01:52:35] We can, but I woudl wait until we figure out what we do with OSM first. [01:52:36] <^demon|brb> Hmm [01:52:47] ^demon|brb: The problem is composer. [01:53:13] <^demon|brb> Oh that's easily worked around at branch-creation time. [01:53:40] * ^demon|brb shall look at this some [01:53:52] <^demon|brb> The not-updating-labsconsole-ever annoys the crap out of me. [01:56:29] Well, we just /did/ upgrade wikitech with a couple of kinks. [01:57:30] <^demon|brb> To be fair, you guys are updating labs more often than we updated prod when we were still on SVN :) [01:57:38] <^demon|brb> Long as you do it every 9 months or less ;-) [02:03:50] FYI - this merged to puppet this evening: https://gerrit.wikimedia.org/r/#/c/149068/ [02:04:14] it will break varnishes that are older than the latest 3.0.5 packages in our repo 3.0.5-plus~x-wm7 [02:04:40] I believe I upgraded the betalabs varnish caches, but if there are other one-off instances out there running varnish and pulling from our puppet config somehow [02:04:52] the fix is to upgrade the varnish package [02:41:54] Yoohoo! [02:42:19] I'm suddenly having problems accessing the labsdb servers from my non-tools instance [02:42:34] I have the whole setup, with /etc/hosts file and iptables nat [02:42:39] did anything change lately? [02:42:46] lately being within the last month [03:04:31] Ok, I see there was a change [03:04:40] unfortunately I now have bigegr problems [03:04:44] nslcd: failed to bind to LDAP server ldap://127.0.0.1/: Can't contact LDAP server: Transport endpoint is not connected [03:04:57] urgh [03:05:05] I cannot log onto that machine [03:05:22] it does not connect to LDAP after an apt upgrade :-( [03:05:40] andrewbogott_afk, any idea? [03:14:13] hmm [03:14:24] maybe it just needed a puppet run [04:05:22] 3Wikimedia Labs / 3wikitech-interface: Sudoers interface should provide an option for ALL - 10https://bugzilla.wikimedia.org/68834#c1 (10Andrew Bogott) What is an example of a user who is a member of ALL yet not a member of 'all project users'? [09:05:50] !log deployment-prep Beta scap script broken since 6:30am UTC https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ [09:05:52] Logged the message, Master [09:22:26] hashar: beta labs isn't responding, not for http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page , nor api.php, load.php . Nothing in deployment-bastion:/data/project/logs since 08:17 [09:22:50] spagewmf: yeah hhvm must be deadlocked [09:23:12] !log deployment-prep restarting hhvm on mediawiki 01/02 [09:23:14] Logged the message, Master [09:23:21] 3Wikimedia Labs / 3wikitech-interface: Sudoers interface should provide an option for ALL - 10https://bugzilla.wikimedia.org/68834#c2 (10Yuvi Panda) All system users? [09:23:26] right, I guess all of them go through it. Is there a non-HHVM URL to check? [09:23:34] nop [09:23:42] we are fully on hhvm [09:23:58] to gather as many crashes as possible while there is some from Facebook working at the wmf office :D [09:24:01] it is annoying :-( [09:24:25] hashar: maybe there's a /images/websitelogo.png or something one can request from apache [09:24:35] thanks for restarting [09:24:59] !log deployment-prep rebooting deployment-mediawiki01 hhvm process went zombie [09:25:02] Logged the message, Master [09:27:33] !log deployment-prep manually started hhvm on mediawiki01 [09:27:35] Logged the message, Master [09:27:36] spagewmf: should be good now [09:29:15] !log deployment-prep Rebooting apache01/02 to see whether it fix the ssh connection issue [09:29:17] Logged the message, Master [09:40:48] !log deployment-prep restoring on puppetmaster modules/mediawiki/templates/apache/apache2.conf.erb which got deleted somehow [09:40:50] Logged the message, Master [09:42:27] !log deployment-prep bastion had broken puppet because deployment_server and zuul both declare the same python packages {{gerrit|150501}} [09:42:29] Logged the message, Master [09:59:06] 3Wikimedia Labs / 3deployment-prep (beta): beta labs not responding; API shows 503 from varnish - 10https://bugzilla.wikimedia.org/68574#c5 (10Andre Klapper) So is this an upstream issue resembling https://github.com/facebook/hhvm/issues/2531 ? Or do we (Wikimedia) plan to investigate a workaround/fix oursel... [10:13:51] 3Wikimedia Labs / 3deployment-prep (beta): beta labs not responding; API shows 503 from varnish - 10https://bugzilla.wikimedia.org/68574#c6 (10Antoine "hashar" Musso) We talked about the issue during the RelEng/QA weekly checkin. There is an engineer of Facebook in WMF office for a month and the HHVM folks... [12:30:50] andrewbogott_afk: Coren instances in the same project should be able to see all ports on other instances in same project, right? [12:31:17] that's not happening in quarry [12:31:51] andrewbogott_afk: Coren oh, hmm, weirder, it kinda seems to be happening [12:31:53] * YuviPanda checks harder [12:39:46] Coren: error reason 1: can't get password entry for user "tools.liangent-php". Either the user does not exist or NIS error! [12:41:30] and this blocks future runs, because a job in error state is left there [13:12:42] When attempting to login: /usr/bin/mosh: connect to host 208.80.155.132 port 22: Connection timed out [13:12:52] anyone else having this problems [13:13:03] *these [13:18:43] liangent: Hrm. Look like there was a LDAP hiccup. [13:27:24] rillke: Works for me without issue. Can you try a traceroute and see where it ends? [13:31:51] 3Wikimedia Labs / 3wikitech-interface: Sudoers interface should provide an option for ALL - 10https://bugzilla.wikimedia.org/68834#c3 (10Marc A. Pelletier) Also, 'all project users' excludes root. :-) [13:37:39] oh, it was just my VM -- rebooted and everything is okay -- sorry for disrupting [13:39:12] rillke: No worries. Tech happens. :-) [13:41:24] liangent: bigbrother not overriding errored out jobs is by design (because an error state normally means the job /cannot/ run), but it seems like it should send an email to the maintainer(s) then. [14:14:43] YuviPanda: some time last night you said "roan broke wikitech, see -operations" can you clarify? [14:14:59] I thought I talked to roan about that but what we discussed doesn't fit with what bryan was seeing... [14:15:28] andrewbogott: he was investigating a VE error, and discovered the the submodule update for git was in a weird state, so did updates, which somehow broke all of it with a PHP error, and then he fixed it again [14:16:21] 'did updates' on wikitech? [14:16:23] !log deployment-prep rebooting hhvm [14:16:25] Logged the message, Master [14:16:36] …because roan has root? [14:16:39] andrewbogott: yeah, submodule update? it was in -operations around the same time [14:16:52] ok, I'll look at backscroll [14:18:58] Coren: do you need those jobs for diagnosis, or I'll qdel them? [14:19:02] andrewbogott: also, another module for a labs project :) https://gerrit.wikimedia.org/r/150425 [14:19:10] omg, --recursive? [14:19:43] andrewbogott: only inside the VE module, I think? [14:19:52] anyway it happened several times to me when a job in error state blocks next runs [14:19:58] so any solution? [14:20:53] eh and this one of my jobs is not using bigbrother (actually I haven't read its documentations yet). this is using a crontab entry with -once [14:31:29] YuviPanda: system users are typically puppetized, right? [14:31:53] Or are there situations where they derive from ldap and don't have a puppet entry? [14:31:54] andrewbogott: to the extent that labs systems are puppetized :) [14:32:09] andrewbogott: vagrant is a user on ldap, for example, and is considered a system user [14:32:52] andrewbogott: and some are just... there. Installing exim sets up an Exim4-debian user, I think, by default [14:32:57] I don't think that's specified in puppet anywhere [14:33:21] 3Wikimedia Labs / 3wikitech-interface: Sudoers interface should provide an option for ALL - 10https://bugzilla.wikimedia.org/68834#c4 (10Andrew Bogott) Having sudo policies in ldap doesn't preclude setting up sudo policies directly on the box... To the extent that system users are puppetized, it seems like... [14:33:56] andrewbogott: 'root' is also a system user, so if we tick that box that means you can't actually sudo even to root [14:34:20] ah, we're talking about 'sudo as' rather than who has sudo rights... [14:34:22] This makes more sense now! [14:34:41] andrewbogott: ah, yeah, that's what it was about :) [14:35:06] 3Wikimedia Labs / 3wikitech-interface: Sudoers interface should provide an option for ALL - 10https://bugzilla.wikimedia.org/68834#c5 (10Andrew Bogott) Oh, my mistake, I misunderstood what we meant by 'target' here. This makes sense after all :) [14:44:01] andrewbogott: also https://gerrit.wikimedia.org/r/#/c/150425/ when you've the time [14:49:32] !log deployment-prep Started apache2 service on deployment-mediawiki01 [14:49:35] Logged the message, Master [15:05:54] !log deployment-prep Two cherry-picks in puppet conflicting with merged production changes: I5afba2c6b0fbf90ff8495cc4a82f5c7851893b52 and Iac547efa83cf059a1276b6e279c3ebd4c7224b2c (ori, twentyafterfour) [15:05:56] Logged the message, Master [15:22:58] !log deployment-prep Removed cherry-pick for Iac547efa83cf059a1276b6e279c3ebd4c7224b2c and updated cherry-pick for I5afba2c6b0fbf90ff8495cc4a82f5c7851893b52 to latest patch set. [15:23:01] Logged the message, Master [15:31:25] hashar: _joe_ is working on a data collection script that we can use when hhvm is hung up in beta. I'll email you the details on it (or he will) when it's ready. [15:33:41] bd808: do we have any clue what is causing the issue? [15:33:46] I though about crontabing a restart :] [15:34:52] hashar: hhvm is getting stuck internally in a spinlock. Requests seem to be handled but in the most degenerate way with no output back to the requesting connection. [15:35:16] sounds fun [15:35:27] I don't really know more than that yet. We had some gdb output from it yesterday for Brett to look at. [15:36:13] bd808: also I got your mail about sentry [15:36:29] I attended a python meetup yesterday and someone showed me up sentry [15:36:38] ok cool [15:36:52] I was going to write an email to you about it when I have seen the mail stating it got deployed overnight [15:36:54] +3 :-] [15:37:01] how does it connect with logstash? [15:37:11] right now it doesn't [15:37:17] but there is hope [15:37:32] There is a plugin for the commercial version https://gist.github.com/clarkdave/edaab9be9eaa9bf1ee5f [15:37:52] We should be able to adapt that for the foss version [15:38:07] REally it should just take changing the target url [15:38:56] ori is working on something to collect hhvm crash stacktraces and push them into sentry [15:39:32] That was our primary use case, but getting it to work for prod php errors would be a sweet add-on [15:39:56] definitely [15:40:05] I'm trying to figure out how ssh for scap got broke in beta now [15:40:11] ah yeah [15:40:19] I forgot to poke labs ops about it sorry [15:40:20] noticed that [15:40:22] The puppet master was in a bad state so that may be the cause [15:40:26] first failure at 6:35am UTC apparently [15:40:33] I can't reproduce by connecting on the instance :-/ [15:40:41] It's in our custom crap so labs ops wouldn't know how to fix probably [15:40:51] s/our/my/ [15:40:55] I noticed a change on deployment-bastion in LDAP config /etc/nscld.conf or something [15:41:08] some other instances are impacted apparently [15:41:23] (hearsay, don't quote me) [15:41:31] ah yeah puppet-compiler02.eqiad.wmflabs [15:45:31] bd808: is the sync to apache from rsync01 ? [15:45:54] oh... it's not just scap. ssh in general on 4 hosts. And they are host which apply role::beta::appserver or a variant [15:46:24] and I can't ssh to deployment-apache01.eqiad.wmflabs [15:46:25] hashar: Yeah. rsync01 pulls from bastion and then the other hosts pull from rsync01 [15:46:29] I attempted to reboot it earlier [15:46:45] the mediawiki01 02 are fine though [15:47:04] apache0[12], videoscaler01 and jobrunner01 seem to be the broke hosts [15:47:45] AH [15:47:52] apache01 console yields: [15:47:52] nslcd: failed to bind to LDAP server ldap://127.0.0.1/: Can't contact LDAP server: Transport endpoint is not connected [15:48:16] That would match. No ldap == no auth in labs [15:48:38] Our puppet master was messed up. Maybe they just need a forced run [15:48:42] I can try that via salt [15:49:34] libpam-ldap got upgraded yesterday [15:49:45] Notice: /Stage[main]/Ldap::Client::Pam/Package[libpam-ldapd]/ensure: ensure changed '0.8.4ubuntu0.2' to '0.8.4ubuntu0.3' [15:49:45] 2014-07-29 16:28:24 apparently [15:49:49] puppet is disabled on apache01 [15:50:04] the upgrade is done automatically [15:50:04] which ... maybe is related? [15:50:12] but then that provides a default /etc/nslcd.conf file [15:50:17] which is then overridden by puppet [15:50:33] and puppet does : http://paste.debian.net/112714/ [15:50:48] so since puppet was/is broken on some instances [15:50:50] the package got upgraded [15:51:01] and the nslcd.conf reverted back to default settings [15:51:05] I wish the upgrade kept our file [15:51:25] my traces come from deployment-bastion in /var/log/puppet.log.1.gz [15:52:19] curious indeed that the upgrade preferred the package's version as opposed to the local version [15:54:06] hashar: _joe_ made us a dump script for hhvm. It's in /root/bt-hhvm on mediawki02. [15:54:21] I'll copy it to mediawiki01 as well [15:54:46] godog: Commandline: /usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install libnss-ldapd [15:54:46] Commandline: /usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install libpam-ldapd [15:54:58] maybe DPkg::Options::=--force-confold cause it to delete old confs [15:55:08] bd808: or even better puppetize it :-] [15:55:12] hashar: if you have to kick hhvm again for being non-responsive it would be good to run that first to get an idea of what hhvm is doing for Brett and joe [15:55:22] Yeah I was just thinking of that [15:55:32] and push a mail about it on some list [15:55:41] * bd808 adds it to the list of things to do [15:55:45] I am having a 1/1 with greg in 4 minutes then will head back home [15:56:36] hashar: heh that should keep the old file in place, not prefer the package's :( [15:57:56] godog: from dpkg doc that is apparently only used when there is no default action [15:58:14] maybe the packages have a default action of picking the new conf [15:59:03] or it is something entirely different [15:59:27] !log deployment-prep Can't ssh to apache0[12], videoscaler01 and jobrunner01. Puppet not running on any of them. libnss-ldapd unattended update has broken /etc/nslcd.conf [15:59:30] Logged the message, Master [16:00:11] !log deployment-prep Puppet seems manually disabled on apache0[12]. [16:00:11] hashar: I'd be surprised if the default was to install the new conffiles, but it is possible [16:00:13] Logged the message, Master [16:00:51] !log deployment-prep Puppet runs on videoscaler01 and jobrunner01 failing for "Could not find dependency Ferm::Rule[bastion-ssh] for Ferm::Rule[deployment-bastion-scap-ssh]" [16:00:53] Logged the message, Master [16:29:24] and off to home [16:29:38] will do emails tonight [16:43:37] Why does it take 5.5 seconds to query a user's most recent edit? [16:43:55] !log deployment-prep Fixed ssh to jobrunner01 and videoscaler01 by correcting unrelated puppet manifest problem and forcing run via salt. [16:43:57] Logged the message, Master [16:52:10] !log deployment-prep Fixed beta-scap-eqiad Jenkins job by correcting ssh problems in beta project [16:52:12] Logged the message, Master [17:22:22] 3Wikimedia Labs / 3deployment-prep (beta): Unable to upload new version of images in commons beta lab - 10https://bugzilla.wikimedia.org/68760#c6 (10Greg Grossmeier) cc'ing ori/bryan in case this was an effect of hhvm permission weirdness. [17:53:14] YuviPanda: FYI, updated ticket for labmon1001 [17:55:07] Dispenser: What view are you using? [17:55:18] select * from revision where rev_user_text="01kkk" limit 500; [17:55:35] Dispenser: As the docs say, use revision_userindex. :-) [17:55:36] that's taking forever [17:56:44] MariaDB [enwiki_p]> select * from revision_userindex where rev_user_text="01kkk" limit 1; [17:56:44] 1 row in set (0.02 sec) [17:57:27] yup, much faster [17:57:49] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Tables_for_revision_or_logging_queries_involving_user_names_and_IDs [17:57:59] Now if we could only get phpmyadmin running... [17:58:26] You're confusing "can" and "must not" again. [17:59:13] Its still running the Toolserver [17:59:34] ... with proper authentication. [18:01:45] I have no particular issue with phpmyadmin (except that it sucks and you're better off with workbench in the first place) beyond the fact that any use of it must require a labs account being properly authenticated. [18:02:59] Shell or wiki account? [18:06:20] Dispenser: The actual database credentials (so that the actual queries can be linked to the right user). Also to attribute any created tables, etc. [18:07:07] Wouldn't there be security concerns with that? [18:09:04] 14.5 seconds for 474 users (0.03 sec/user); now I delete those 40 lines for fetching from the API (slowly at 0.41 sec/user) [18:09:10] A good way to do it cleanly is Non-Trivial™. It might be possible to do something properly with OAuth (wikitech is a provider) [18:10:30] I've no objection to working with you to work something out, though because Wikimania my availability is going reduced for a couple weeks. [18:10:56] (Unless you are *at* Wikimania in which case my availability switches to "in person") :-) [18:12:05] Coren: Dispenser ah, so about PHPMyadmin [18:12:15] Coren: Dispenser have you seen https://meta.wikimedia.org/wiki/Research:Ideas/Public_query_interface_for_Labs [18:12:26] Coren: Dispenser or the test run at quarry.wmflabs.org [18:12:43] Coren: Dispenser or the puppetization of the entire thing at https://gerrit.wikimedia.org/r/#/c/150425/ :) [18:13:08] Coren: Dispenser I've been working with analytics/research on this for the last two weeks, we're going to release right before research hackathon at wikimania [18:13:19] Coren: and springle is ok with my current safeguards :) [18:13:37] YuviPanda: Yeah, that project sounds good to me too. [18:14:07] Coren: it's pretty much live right now, except for some minor points I've to fix. current install is running from the puppet patch. A merge would be nice :D [18:14:37] Coren: it has a few things to work out (OAuth is in MW.org, although I explciitly link to Labs ToS), and I've to move a google font (OSS font) to our system [18:15:34] Coren: re: ticket for labmon, I updated too. [18:16:58] Coren: Dispenser note that the current link is just a test instance, will go down soon and be replaced with proper merged patches [18:17:15] !ping [18:17:15] !pong [18:17:17] ok [18:17:54] good bye clever oursql batching implementation [18:18:35] Dispenser: hmm? [18:19:30] deleted workaround code for revision speed issues, mentioned above [18:20:00] ah right [19:31:18] !log deployment-prep Disabled puppet on deployment-mediawiki01; Ori will look into hhvm config changes that were being applied [19:31:21] Logged the message, Master [19:32:12] !log deployment-prep Disabled puppet on deployment-mediawiki02 for the same reason [19:32:14] Logged the message, Master [19:46:42] !log deployment-prep Restored prior /etc/hhvm/php.ini from puppet filebucket archive on deployment-mediawiki0[12] [19:46:44] Logged the message, Master [19:50:04] more or less around [19:58:25] Can we get user_email_authenticated visible on Labs? There shouldn't be any privacy concerns since its already exposed on the wiki. [19:59:52] !log deployment-prep Created local commit 7d56b79 in puppet to work around bugs in Ia463120718dceab087ad3f8e3f35917fa879f387 [19:59:54] Logged the message, Master [20:54:44] Dispenser: probably, file a bug for it? [20:54:56] I don't file bugs [20:56:49] Dispenser: where is it exposed on-wiki? [20:57:12] https://en.wikipedia.org/wiki/Special:AbuseLog/2609524 as user_emailconfirm [20:57:23] and on their user page of course [20:57:44] hmm, I feel like I discussed this with Coren at some point. [20:57:54] https://jira.toolserver.org/browse/TS-935 [20:58:24] There two part, user_email_confirmed and up_property="disablemail" [20:58:27] https://bugzilla.wikimedia.org/show_bug.cgi?id=58196#c7 [21:01:08] 3Wikimedia Labs / 3(other): (Tracking) Database replication services - 10https://bugzilla.wikimedia.org/48930 (10Kunal Mehta (Legoktm)) [21:01:10] 3Wikimedia Labs: Make user_email_authenticated status visible on labs - 10https://bugzilla.wikimedia.org/68876 (10Kunal Mehta (Legoktm)) 3NEW p:3Unprio s:3normal a:3None [12:58:25] Can we get user_email_authenticated visible on Labs? There shouldn't be any privacy concerns since its a... [21:01:48] Dispenser: do you still get emails from your toolserver email? [21:02:00] I don't think so [21:02:44] hm, well you're still using it in bugzilla :P [21:22:27] legoktm: What? [21:22:50] Coren: https://bugzilla.wikimedia.org/68876 nothing urgent though [21:23:06] I was just thinking that we had talked about this before, and found https://bugzilla.wikimedia.org/show_bug.cgi?id=58196#c7 [21:24:03] Coren: could you take a look at https://bugzilla.wikimedia.org/show_bug.cgi?id=68614 ? [21:24:38] Coren: whenever you do have some time though, it would be nice if you could take a look at scfc_de's patch to fix some mail weirdness: https://gerrit.wikimedia.org/r/#/c/149316/ [21:24:48] I need to implement a custom timeout for state "handle-request" [21:25:37] To be honest, I'm not going to have much time to do much until Wikimania. I have a number of ridiculously urgent things to do that keep hitting annoying delays. [21:25:57] I'll take a look at patches, though, since that's relatively context-free. [21:26:40] hedonil: Same with yours; I have no objection but if you have a patch chances of my being to deploy it before Wikimania increase greatly. [21:26:49] Coren: asked the other day but idk if you saw: can jsub/jstart get options to print out the corresponding qsub instead of running qsub? [21:26:58] are jsub/jstart in git? [21:27:46] jeremyb: It is; in labs/toollabs [21:28:37] Coren: Can you point me to the place where those lighty things are defined? [21:28:54] Nova_Resource namespace is not very useful if it is not searchable by default [21:28:55] Coren: or is exec_environ ok? https://git.wikimedia.org/blob/operations%2Fpuppet/a47b78eb57bc8498723e47b509c5427e1a40310e/modules%2Ftoollabs%2Fmanifests%2Fexec_environ.pp [21:29:44] hedonil: Yep, the exec_environ is what you want. [21:29:53] Coren: 'kk [21:33:27] Coren: k, danke [21:35:16] hedonil: If you strictly need that only on webnodes with lighttpd, you can limit it to operations/puppet:modules/toollabs/manifests/webnode.pp [21:36:57] scfc_de: I'd rather not have web nodes diverge from the general exec environment. [21:37:12] scfc_de: Makes it harder to test, or to deploy new kind of nodes. [21:39:09] Coren: scfc_de: just thought about webnode.pp. But does it it work on exec_environ.pp if no dependencies like lighttpd are defined there ? [21:39:23] Or will it then install lighttpd on all exec nodes? [21:39:28] Coren: As we don't have lighttpd on general exec nodes, mod_magnet.so would be kinda hard to test there :-) [21:39:49] Oh duh, that's a mod to lighttpd - my brain was thinking python module. [21:40:01] Yeah, wenbode.pp's the right place for /that/ [21:40:56] Coren: scfc_de: 'k. webnode.pp it is. [21:45:36] 3Wikimedia Labs / 3deployment-prep (beta): Unable to upload new version of images in commons beta lab - 10https://bugzilla.wikimedia.org/68760#c7 (10Bryan Davis) Beta uses the NFS shared directory /data/project/upload7 to store images. I just did a permissions check on this directory for sub-directories th... [21:55:05] 3Wikimedia Labs / 3tools: Mail from the command line appears to ignore the domain name if the local part exists as a local user - 10https://bugzilla.wikimedia.org/68545 (10Kunal Mehta (Legoktm)) 5PATC>3RESO/FIX [21:55:36] 3Wikimedia Labs / 3tools: Mail from the command line appears to ignore the domain name if the local part exists as a local user - 10https://bugzilla.wikimedia.org/68545#c6 (10Kunal Mehta (Legoktm)) 5RESO/?>3VERI tools.legobot@tools-login:~$ echo Test | mail -s Test legoktm@wikimedia.org Email ended up i... [21:59:24] 3Wikimedia Labs / 3tools: Unable to explain queries on replicated databases - 10https://bugzilla.wikimedia.org/48875 (10Dispenser) [21:59:24] 3Tool Labs tools / 3[other]: Migrate http://toolserver.org/~dispenser/* to Tool Labs - 10https://bugzilla.wikimedia.org/66868 (10Dispenser) [21:59:24] 3Wikimedia Labs: Make user_email_authenticated status visible on labs - 10https://bugzilla.wikimedia.org/68876 (10Dispenser) [22:15:49] hedonil: Whitespace woes, but otherwise okay. [22:16:11] Coren: yeah, f*ck editor ;) [22:16:22] 3Tool Labs tools / 3[other]: merl tools (tracking) - 10https://bugzilla.wikimedia.org/67556 (10merl) [22:16:23] 3Wikimedia Labs / 3tools: Provide resource fpr db access in grid - 10https://bugzilla.wikimedia.org/68881 (10merl) 3NEW p:3Unprio s:3normal a:3Marc A. Pelletier Currently long maintences are done at meriadb10. Also future updates and so on. Having a sql resource would prevent sge scripts to run while... [22:17:28] hedonil: Jenkins won't like it until that's fixed. :-) [22:18:15] * hedonil tries to convince editor to convert tabs on save [22:22:05] 3Wikimedia Labs / 3tools: Unable to explain queries on replicated databases - 10https://bugzilla.wikimedia.org/48875#c15 (10Tim Landscheidt) IIRC, with the move to the MariaDB 10 setup we can now EXPLAIN currently running queries (i. e. long-running queries). Could someone please document how to do that eit... [22:32:36] hedonil: what editor? [22:32:49] Coren: it works!! :) [22:33:25] matanya: nano (with Q enabled) [22:34:38] Just removed the tabs, but still in patchset :/ [22:35:11] in vim :set et|retab [22:35:45] :wq [22:35:51] and push [22:37:43] hedonil: in your ~/.nanorc: [22:37:47] set tabsize 4 [22:37:48] set tabstospaces [22:38:16] * hedonil tries harder.. [22:38:47] right that. thanks legoktm haven't used nano in years [22:44:27] Thank matanya, legoktm. Hmm. In my local file, all tabs are removed, but gerrit seems to see no difference... [22:44:54] did you git add the modifications? [22:46:49] legoktm: shame! no. now it seems to work [22:47:04] :) [23:03:02] !ping [23:03:02] !pong [23:22:47] !ping [23:22:47] !pong