[00:05:54] YuviPanda: if/when you are working, I’m interested in your thoughts about why https://gerrit.wikimedia.org/r/#/c/258071/ doesn’t work [00:06:17] andrewbogott: looking [00:06:49] andrewbogott: is realm set properly? [00:07:29] YuviPanda: puppet is currently failing with [00:07:30] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Failed to determined $::labsproject at /etc/puppet/manifests/realm.pp:18 on node promethium.eqiad.wmnet [00:07:39] That’s inside an if realm == labs block [00:10:15] andrewbogott: wait [00:10:22] andrewbogott: $::labsproject is not from hiera.. [00:10:26] andrewbogott: $::labsproject is from facter [00:10:36] is that set by us anywhere? [00:11:06] YuviPanda: I encourage you to update your puppet repo :) [00:11:19] https://gerrit.wikimedia.org/r/#/c/258051/ [00:11:34] ah :) [00:12:57] It’s possible I’m misinterpreting what’s happening… I assume that that initial hiera lookup is failing, it’s falling back on the fact (which doesn’t work on metal) and erroring out [00:16:07] andrewbogott: right. I guess we can test that with a notice {} there maybe? [00:16:34] sure [00:16:37] * andrewbogott tries [00:18:29] 6Labs, 10Tool-Labs: Update Java 7 to Java 8 - https://phabricator.wikimedia.org/T121020#1867916 (10Merl) Two additional reasons: First one is caused by wikimedia plan to change to ssl only, please read my comment at T105794#1574563. So I need Java 1.8 to keep my bot working. The second one is that it causes m... [00:19:38] 6Labs, 10Tool-Labs: Update Java 7 to Java 8 - https://phabricator.wikimedia.org/T121020#1867920 (10Merl) [00:21:48] YuviPanda: I added notify messages and I see them on a labs box but not on promethium [00:21:59] so… no idea what puppet server promethium is hitting :( [00:22:06] but that would explain a lot [00:22:09] right [00:25:02] RECOVERY - Host tools-master is UP: PING OK - Packet loss = 0%, RTA = 1.53 ms [00:27:02] YuviPanda: it’s hitting the right puppetmaster [00:27:15] it’s just failing too soon to get to any of the notify lines [00:28:27] andrewbogott: hmm, I guess we can't really put an ordering thing there [00:31:39] did you change something? It just started working [00:36:17] andrewbogott: no [00:36:23] andrewbogott: also hiera caches for a min or something [00:36:31] hm [00:36:32] PROBLEM - Puppet failure on tools-exec-1402 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [00:36:34] must be something cache-related [00:36:42] although it was way more than a minute [00:37:22] andrewbogott: wait puppet also needs to run on the puppetmaster to apply the labs hiera config change [00:37:25] did that happen already too? [00:37:34] I… think so? [00:37:39] But if not, that would explain the delay [00:37:54] right [00:56:51] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Amitie 10g was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=223625 edit summary: [00:57:26] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Frisko was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=223628 edit summary: [00:57:58] hmm... is there a reason I cannot create an instance? [00:58:12] YuviPanda, ? [00:58:24] did you use up all server space? [01:07:44] YuviPanda: sorry to ping but this is driving me nuts. I'm trying to build a new instance to replace bd808-vagrant. I created a new project "mediawiki-vagrant" to put it in this morning. I have since tried to build 2 instances there and neither one would allow me to ssh in after puppet finished running. [01:07:57] YuviPanda: mwv-image-builder.mediawiki-vagrant.eqiad.wmflabs is the instance I haven't deleted yet with the problem [01:11:40] RECOVERY - Puppet failure on tools-exec-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [01:15:00] PROBLEM - Host tools-master is DOWN: CRITICAL - Host Unreachable (10.68.16.9) [01:40:12] RECOVERY - Host tools-master is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [02:10:00] 6Labs, 10Labs-Infrastructure: ssh access to instances in mediawiki-vagrant project failing with "Connection closed by UNKNOWN" response - https://phabricator.wikimedia.org/T121064#1868319 (10bd808) 3NEW [02:25:32] 10MediaWiki-extensions-OpenStackManager, 10Echo, 3Collaboration-Team-Current, 5Patch-For-Review, 5WMF-deploy-2015-12-08_(1.27.0-wmf.8): Write presentation models for notifications in OpenStackManager - https://phabricator.wikimedia.org/T116853#1868344 (10Catrope) 5Open>3Resolved [02:31:04] 6Labs, 10Labs-Infrastructure: ssh access to instances in mediawiki-vagrant project failing with "Connection closed by UNKNOWN" response - https://phabricator.wikimedia.org/T121064#1868356 (10bd808) [02:31:08] 6Labs, 10MediaWiki-Vagrant, 15User-bd808: Create "mediawiki-vagrant" project - https://phabricator.wikimedia.org/T120982#1868357 (10bd808) [02:56:21] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:57:08] bd808: ugh, sorry :| [02:57:11] also wat testing-shinken- [02:57:26] fuck fuck fuck [03:02:46] ok [03:02:46] it's fine [03:02:46] just slow [03:02:47] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:02:47] mm [03:02:47] maybe not [03:02:47] Hi! i created an instance - apache-multihost-01.analytics - turned role::puppet::self on, and am trying to ssh into it - but i get denied [03:02:48] YuviPanda: I filed a bug about it. Hopefully it's something simple [03:02:48] yeah [03:02:49] I'm at an event now with super low data [03:02:57] I've called andrew and he should be here soon [03:03:00] for the tools outage [03:03:42] YuviPanda: is this another one of those ‘everything is slow for a minute’ things? [03:03:43] Or different? [03:03:46] andrewbogott: not sure. [03:03:49] (maybe you were asleep for those) [03:03:53] bd808: are you having the same problem that i am? [03:04:01] everything is recovered now, of course [03:04:01] andrewbogott: icinga / shinken paged and then it timed out when I tried it [03:04:04] augh [03:04:10] So, same as before I think [03:04:37] I see [03:04:48] madhuvishy: apache-multihost-01.analytics looks just fine to me [03:05:13] andrewbogott: interesting - i can get in now [03:05:18] It’s a mess [03:05:20] but reachable [03:05:21] i tried 10 times before [03:05:29] why is it a mess [03:05:29] I got the page too, was it transient? [03:05:32] did you apply puppet::self before logging in [03:05:33] ? [03:05:39] chasemp: same thing as this morning I think [03:05:41] andrewbogott: oh yeah i think so [03:05:55] madhuvishy: usually best to make sure puppet is running cleanly before you apply new classes [03:06:01] probably you should just delete this one and start over [03:06:20] andrewbogott: ok cool - i thought that was for other classes - not puppet::self. cool, i'll do that thanks [03:06:53] madhuvishy: /especially/ for puppet::self; it’s like fixing a car while it’s driving down the freeway. [03:07:09] Good analogy heh [03:07:25] chasemp: So those pages are going to keep happening, and I don’t know how to debug them [03:08:17] andrewbogott: I see. Thanks for explaining! [03:08:50] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 953411 bytes in 5.580 second response time [03:09:55] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 953352 bytes in 4.140 second response time [03:12:53] andrewbogott: yeah, so seemsl ike toollabs homepage is showing up pretty consistently [03:13:01] I guess that's because it's one of the few things we consistently check [03:13:07] yeah [03:13:09] any ideas how many times that has happened today? [03:13:17] three or four I think [03:13:20] looking at logs to see if anything makes sense now [03:13:35] but we should send an email w/ the times and how far apart and generally what the deal is to @ops [03:13:43] yeah [03:13:48] in case it happens first tiem and someone can catch in in teh act [03:15:25] holy crap I see w/ hto slapd fill the screen w/ procs [03:15:40] and it's only at 30-40% on the 4 cores but I could easily see this spiking [03:15:52] i don't know openldap much at all [03:16:08] PROBLEM - Host tools-master is DOWN: CRITICAL - Host Unreachable (10.68.16.9) [03:16:22] I’m trying to write a little tool to record ldap query times [03:16:52] that would be a good idea [03:17:03] assuming that it’s queries and not writes... [03:18:40] queries are definitely affected at least [03:18:53] oh? [03:19:00] I feel like it could just be ‘labs was locked up’ [03:19:07] unless you have a slow query running on a non-labs host? [03:19:09] otherwise why would regular dns be affected? [03:19:15] queries I mean [03:19:33] hm true [03:19:41] hey can you spin up a vm quick [03:19:46] that was causign this am [03:19:53] ... recovery? [03:19:57] Same as the other times? [03:20:03] Coren: same [03:20:10] (Sorry for the delay, was on the road back from the gym) [03:20:38] chasemp: doing [03:21:06] Still DNS stalls pointing to ldap? [03:21:48] well that's the symptom theory [03:21:50] hm, it’s happening I think [03:21:53] root cause unknown [03:21:54] Now? [03:21:59] or else my local network is just slow [03:22:05] No, it might be. [03:23:31] DNS is erratic - I have delays on labs-ns1 but labs-ns0 is responsive. [03:23:32] what do you think, is labs stalling out? [03:23:34] Hm. Though I expect pdns itself does some caching so won't always hit ldap. [03:24:23] Yeah, something is definitely wrong with labs in general, but I just ruled out the FS (active, no issues there) [03:24:31] But lookups of users stall. [03:24:41] (getent passwd just hangs) [03:24:56] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:25:00] Most bots are happily chugging along. [03:25:10] ok, there, the storm is over [03:25:15] so it lasts just long enough to page :) [03:25:28] But anything that hits passwd or group db goes blam. [03:25:49] I can login to a vm fine now etc [03:25:59] chasemp: same here [03:26:52] But I had a ssh open on a bastion so I could do tests without hitting auth. 'touch foo -> worked', 'sleep 1 -> worked', 'ps faux -> hangs', 'getent passwd -> hangs' [03:27:01] I'll eat my had if that isn't ldap. [03:27:04] hat* [03:27:30] I mean, surely seems like ldap but why [03:27:36] Coren: but why do our shells freeze at the same time? [03:27:44] Oh, yours didn’t? [03:27:58] Mine did but I was doing :wq in a VM [03:27:59] Not as long as I didn't do anything that looked usernames up. [03:28:01] which would have hit NFS [03:28:14] Hm. [03:28:36] NFS might be partially impacted too - it does ldap lookups for group membership. But it caches aggresively. [03:29:10] andrewbogott: how often does ldap "sync" [03:29:12] between them [03:29:31] chasemp: ooooh. I like the way you think. [03:29:50] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 953520 bytes in 3.765 second response time [03:31:14] where is this coming from [03:31:15] PROBLEM - Host tools-master is DOWN: CRITICAL - Host Unreachable (10.68.16.9) [03:31:19] is it main icinga? [03:31:22] or shinken? [03:31:42] The alert? Shinken afaik. Yuvi? [03:32:22] * Coren points out that it might be a good idea for shinken-wm to say 'l.team' somwehere in its alerts. [03:33:06] I've got instrumentation on labstore and on an instance looking and processes. If it happens again I'll see if there is a NFS impact as well. [03:33:11] pretty sure it’s shinken [03:33:28] Coren: great, that saves me the trouble of setting that up [03:33:37] history since the top of the hour for tools.wmflabs.org [03:33:38] [2015-12-10 03:25:38] SERVICE ALERT: tools.wmflabs.org;tools-home;OK;SOFT;3;HTTP OK: HTTP/1.1 200 OK - 953415 bytes in 4.299 second response time [03:33:39] Service Ok[2015-12-10 03:25:17] SERVICE ALERT: tools.wmflabs.org;NFS read/writeable on labs instances;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.017 second response time [03:33:39] Service Critical[2015-12-10 03:23:48] SERVICE ALERT: tools.wmflabs.org;tools-home;CRITICAL;SOFT;2;CRITICAL - Socket timeout after 10 seconds [03:33:41] Service Critical[2015-12-10 03:23:28] SERVICE ALERT: tools.wmflabs.org;NFS read/writeable on labs instances;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds [03:33:42] the other thing I wanted to do is gather stats about ldap query response time, and gather those stats on a non-labs box [03:33:43] Service Critical[2015-12-10 03:21:48] SERVICE ALERT: tools.wmflabs.org;tools-home;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds [03:33:45] Service Ok[2015-12-10 03:02:27] SERVICE ALERT: tools.wmflabs.org;tools-home;OK;HARD;3;HTTP OK: HTTP/1.1 200 OK - 953455 bytes in 9.700 second response time [03:33:47] Service Ok[2015-12-10 03:01:57] SERVICE ALERT: tools.wmflabs.org;NFS read/writeable on labs instances;OK;SOFT;3;HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.106 second response time [03:33:49] Service Critical[2015-12-10 03:00:18] SERVICE ALERT: tools.wmflabs.org;tools-home;CRITICAL;HARD;3;CRITICAL - Socket timeout after 10 seconds [03:33:57] Coren: Is that something you can arrange easily, or should I keep working on that? [03:34:22] chasemp: I just sent an introductory email to the Ops list; please respond with additional details as you see fit [03:34:39] ok thanks I don't have much to go on atm, I see tools.wmflabs.org flapping a lot [03:34:40] andrewbogott: I'm doing this the manual way - eyeball with tops and iptraf; do you want longer-term recording? [03:34:42] but idk why [03:34:48] would dns issues cause a check of that to die? [03:35:09] chasemp: No, but if ldap stalls the tools.wmflabs.org landing page does many. [03:35:14] I want to reset the master account password for my labs-hosted wiki but the email for changing it isn't going through. Is there a special incantation I forgot to say? [03:35:18] chasemp: As it looks up group membership. [03:35:23] ah [03:35:32] well then [03:35:44] I can make a quick fix and remove the tool list from the landing page [03:35:58] But that will just hide the problem. [03:36:31] Coren, want me to cause another outage? :) [03:37:00] andrewbogott: If you have a repeatable way to trigger it, I'd love to see the effects live for once. [03:37:41] Punchline: After the crash, the first thing the software engineer says is “Let’s go back to the top of the hill and see if the brakes fail again!" [03:37:45] andrewbogott: [03:37:48] could it still be file descriptors [03:37:49] http://www.openldap.org/lists/openldap-software/200507/msg00063.html [03:37:55] Dec 10 03:37:05 seaborgium slapd[17834]: connection_input: conn=1463599 deferring operation: binding [03:37:59] and I see lots of [03:38:03] Dec 10 03:36:23 seaborgium slapd[17834]: connection_read(1855): no connection! [03:38:06] some ppl say to ignore but [03:38:19] Coren: strap in [03:38:19] some more savy posts seem to indicate if you see openldap deferring [03:38:23] it's bouncing folks is it not? [03:38:32] If there's lots of them, that smells bad chasemp. [03:38:34] Jul 4 13:50:03 annuaire slapd[19523]: connection_read(18): no connection! [03:38:35] That is why it is deferring the operation. A quick google on that [03:38:35] shows some tuning of idle timeouts and possibly ulimit might be [03:38:36] helpful. I would recommend that you start tracking connections and [03:38:39] file descriptors to get a feel of conditions surrounding your [03:38:40] "freeze". [03:38:47] we have a ulimit issue already didn't we? [03:39:16] chasemp: alex pushed that limit way up and we're nowhere close to it that he could see. (2000+ fd iirc) [03:39:36] Well, nevermind, it’s not going to happen this time apparently [03:39:52] Max open files 4096 4096 files [03:39:53] andrewbogott: Of course, it *had* to be an heisenbug too. [03:41:44] openldap can open a shit ton of threads and use a shitton of connections I imagine [03:41:44] If I could’ve made it happen while Moritz was watching it’d be fixed by now :) [03:41:45] especially since it's getting hit w/ web traffic to icinga and other business [03:41:45] grep deferring debug | wc -l [03:41:45] 1000 [03:42:01] ah, there we go [03:42:07] chasemp: What time period is this? [03:42:19] I’m seeing some freezing now [03:42:23] but I can still do ldap queries [03:42:51] andrewbogott: I'm seeing a reduction in NFS traffic consistent with stuff freezing, but I can now confirm that the server /itself/ remains responsive. [03:42:53] * andrewbogott is going to keep triggering that page until EVERYONE is out of bed [03:43:10] ldap looks fine to me though... [03:43:24] Most processes run without issue. [03:43:27] dafu [03:45:25] Coren: ldap working for you too? [03:45:27] andrewbogott: I'm not sure. I'm trying a few things. [03:45:27] chasemp: Are you seeing a burst of defered in the logs? [03:45:30] hmmm [03:45:32] bursts of [03:45:34] Dec 10 03:44:39 seaborgium slapd[17834]: connection_read(1858): no connection! [03:45:34] Dec 10 03:45:07 seaborgium slapd[17834]: connection_read(1802): no connection! [03:45:35] Dec 10 03:45:17 seaborgium slapd[17834]: connection_read(1810): no connection! [03:45:41] idk some ppl say ignore it but... [03:45:43] it's odd [03:45:50] and we’re back [03:45:52] The timing is very suspicious. [03:46:00] chasemp: did it stop doing that? [03:46:07] when we recovered? [03:46:21] (or shortly before) [03:46:31] what's recovery time? [03:46:32] idk [03:46:42] Near 03:46 [03:47:13] Like, if there are many less during 03:45 than 03:46 we might be on to something. [03:47:18] last one is Dec 10 03:45:58 seaborgium slapd[17834]: connection_read(1813): no connection! [03:47:20] now I see another one [03:47:22] Err, the other way around. [03:48:07] the tools page seems pretty happy now [03:48:20] (weird that that last go didn’t page. It seemed like a freeze to me.) [03:49:10] there is a specific messages for file descriptors exhaustion adn I don't see any past the 8th [03:49:14] so there is that [03:49:57] andrewbogott: The check has to happen during the stall, if it's less than 5m it may well be missed entirely. [03:50:00] RECOVERY - Host tools-master is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [03:54:38] I am sleepy and out of ideas. Are we waiting until tomorrow to pester moritz and alex, or are there more angles of attack yet tonight? [03:54:52] I'm going to increase log level on slapd [03:54:56] and see if it spits out more useful [03:55:00] and I"m looking at timeout tuning ideas [03:55:39] I honestly can't think of a new approach at this time. Unless the increased logs verbosity suggests something. [03:56:42] Here’s a thing that doesn’t fit the timeline but has been bothering me for a couple of days: https://dpaste.de/wXNx [03:57:00] if internal WMF dns has been going down, that would cause all kinds of hurt, including what we’re seeing now [03:57:04] Seems far-fetched though [03:57:19] andrewbogott: Hmm. [03:57:29] PROBLEM - Host tools-master is DOWN: CRITICAL - Host Unreachable (10.68.16.9) [03:58:58] * Coren sighs. [03:58:58] And we need a way to explain to shinken when hosts no longer exists. :-) [03:58:59] shinken itself is fine [03:58:59] this is a testing instance someone not me is running [03:58:59] Krenair: i think [03:58:59] Ah. [03:59:00] andrewbogott: Hm. I admit that while the timing doesn't work, internal dns resolution errors are worrisome. [03:59:53] ah, ‘testing-shinken’ [04:00:01] RECOVERY - Host tools-master is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [04:00:41] if someone has op they can ban it from here [04:01:37] how long has it been now? :) [04:01:52] ok ok I restarted it there [04:01:55] so that may have been my bad [04:03:47] Hm. You just restarted it? [04:03:47] And we get exactly the same symptoms. [04:03:56] no around 21:57 [04:03:56] ... is puppet restarting the daemon at every run, perchance? [04:07:33] chasemp: Are the more verbose logs being helpful, at least? [04:08:20] not particurly but there is a crazy number of logging options [04:08:26] http://www.openldap.org/doc/admin24/slapdconfig.html [04:08:32] Table 6.1: Debugging Levels [04:08:53] Meh. Set all bits to 1! [04:14:47] see if a page comes I restarted it again [04:16:36] I dunno about pages, but things just stalled. [04:16:37] other than this stall any in the last 15 minutes? [04:16:38] Yeah, cool I suppose. It seems we now know of a surefire way to cause the issue: restart slapd. [04:16:40] are the stalls only on instances with NFS? [04:16:41] so I did a few things here guys [04:16:44] I haven't gotten any ores pages and they have no NFS [04:16:45] disabled puppet on seaborgium and set a few limits in slapd.conf [04:16:46] YuviPanda: No, the general bastion also stalls and iirc that one doesn't have NFS [04:16:58] ok [04:17:21] it's just anything ldap afau [04:17:32] YuviPanda: The stalls are always fairly short; 2-3 min, so depending on check interval we don't always get alerts. [04:17:40] slapd sometimes decides it can't answer or honor a client or client request idk [04:18:03] I did one more thing I commented out [04:18:04] #limits dn.exact="cn=repluser,dc=wikimedia,dc=org" time=unlimited size=unlimited [04:18:16] which I believe says the replication can use as much tiem or query size as it wants [04:18:19] Unstalled. [04:18:34] Hm. Or not quite yet. [04:18:43] so it's still hosed up now? [04:19:04] sure enough [04:19:11] chasemp: It's erratic. I managed to log into bastion, then not. [04:19:18] o_O [04:19:42] ok I'm going to reset here w/ puppet [04:19:58] YuviPanda: what do you know about openldap? :) [04:20:02] I know...not much [04:20:06] 0 [04:20:14] me too [04:20:26] chasemp: I've deployed openldap in the past, but not with that many clients. [04:20:29] I also have an intermittent connection for the next hour before I reach stability [04:20:45] can we increase cores / ram for the ganeti instance? [04:22:26] I've always been super uncomfortable about the fact that we're using ganeti for this [04:22:27] it's not really hammering away at ram or cpu [04:22:27] I'm more suspicious of an in-slapd limit [04:22:27] ok [04:22:27] * YuviPanda is on his phone and just making stuff up [04:23:21] chasemp: Wait, are you doing evil things to ldap? I'm getting auth errors now. [04:24:22] uh yeah so yes [04:24:24] idk why it failed to start but it did [04:24:25] give it like 10s [04:24:25] Ah, kk. [04:24:26] so this is pretty much nuts [04:24:30] and unfortunately the nature of our rollout makes rollback a nightmare as well [04:24:45] Interestingly enough, you just restarted it and this time: no stall. [04:26:52] stall now? can you let me know if it starts again? [04:26:52] I'd like to lay off for a few to see what happens [04:26:52] Nope. Not stalling that I can see. [04:26:56] Oh, wait, here it comes. [04:27:03] I was probably saved by the caches. [04:27:09] nscd for the save. :-) [04:27:18] doing it now? [04:27:22] Ayup. [04:28:17] * andrewbogott can still ldaplist [04:30:06] Hm. All but one instance have stopped talking to NFS [04:30:07] .... and it's back [04:30:17] Ah. stracing the mountd showed it was stuck in a getgrgid() [04:31:05] Interesting fact: the only instance that wasn't stuck was quarry-runner-02.quarry.eqiad.wmflabs. [04:31:31] s/wasn't stuck/was talking to NFS/ [04:31:46] (I presume others might have been not stuck but just not talking to nfs) [04:33:01] First time I see that effect on a stall; I wonder if there is a cache timeout issue to compound things. [04:33:32] chasemp: But I note that none of the stalls have lasted long enough for shinken to pick it up. Has anything changed on your end? [04:33:53] at "I'd like to lay off for a few to see what happens" [04:33:59] I put in soem limit params and left it alone [04:34:15] I'm not optimistic tho [04:34:49] I mean if this is prod we wake up moritz at this point as the main guy on this [04:34:55] idk what the protocol is here [04:35:08] but learning ldap in 30m at this tiem of night isn't working out [04:35:18] I'm not sure either - is this an outage or just degraded? [04:35:37] I think we can live with 'degraded' until Moritz wakes up. [04:36:29] yeah, I think chasemp should add a brain dump to the email I just sent, and leave it to moritz and alex to look at in a few hours [04:38:03] how can I prevent a SGE error from blocking further submission of -once jobs in the future? [04:38:25] Ah, because the job errored out? [04:38:33] andrewbogott: best guess is something to do w/ replication...is [04:38:33] do_syncrep2: rid=001 LDAP_RES_SEARCH_RESULT [04:38:37] when replication kicks off? [04:38:44] it happened several times that another user notifies me about my bot not working, then I qdel the job to get it working again [04:39:00] Did the job status include an 'e'? [04:39:00] Coren: it's always the "NIS error" or something [04:39:09] Coren: yeah a big "E" [04:40:08] liangent: Could could do a qmod -cj before attempting to start the job - that will clear the error condition if there is one. [04:42:03] liangent: At the cost, of course, of hiding possible issues. [04:42:09] Coren: what will happen with this command [04:42:24] the job in error state will be restarted? [04:42:55] chasemp: your guess is as good as mine. It looks replicationy [04:43:40] ehm [04:43:48] andrewbogott: yeah it's just flapping all over the place [04:43:49] Coren: labs instance has stopped working (/data/project) [04:44:11] andrewbogott: Coren i saw NFS read/writeable on labs instances flap bad/good/bad/good there [04:44:13] but this second running again [04:44:31] so what's up? I just got paged for the 400th time :P [04:44:48] chasemp: Yeah, NFS also suffers from the same symptoms because it uses ldap. :-( [04:45:44] plz don't let it crash completely [04:45:47] bblack: seems the ldap servers put in place are periodically freezing or not answering clients or so it seems [04:45:59] basically anything that hits ldap has flapping issues we believe [04:46:07] and that is a lot of stuff [04:46:25] Where "a lot" means "pretty much all" [04:46:40] andrewbogott: ok so how do we just turn off repl then [04:46:45] to see if that helps [04:46:52] we need sanity more than robustness [04:46:53] atm [04:47:38] chasemp: I have not touched either of the openldap servers. I can start flipping through the docs though :) [04:47:50] ha I'm in teh same boat [04:48:01] in light of the email, can we just mark that one service in downtime while it's being worked on? [04:48:13] I don't want to ignore my phone that could have real pages on it, and I'm tired of it going off :P [04:48:35] chasemp: what would it meant to break replication? Services will write to one or the other server and they’ll get out of sync… seems like that could cause hilarious things to happen [04:48:56] andrewbogott: I was thinking stop slapd on the secondary and have only one server not trying to replicate [04:49:09] ah [04:49:14] bblack: I'm not positive which thing paged you [04:49:27] well… presumably if we stop one of them it will stop replicating, right? :) [04:49:27] but a downtime is probably in order if it's just tools homepage or something [04:49:37] andrewbogott: well yeah but does it spin it's wheels harder or not [04:49:39] :) [04:49:43] tools-home on tools.wmflabs.org [04:49:57] bblack: looking [04:49:58] it's flapped like 8 times today [04:50:04] ok I downtimed that for 4 hours I think [04:50:05] that's the only one that pages us all [04:50:19] or not I did the wrong one [04:50:39] chasemp: looks right to me [04:50:42] 4 hours is optimistic [04:51:10] yeah that's the right one in icinga [04:51:12] what time is it in germany right now? [04:51:19] UTC+2? [04:51:23] 5:50 AM [04:51:33] well, in France, I presume DE is the same [04:51:45] oh ok yeah let's just call moritz that's not bad, if I'm him and it's my deal I want to be called [04:51:53] agree / disagree? [04:52:13] I agree and also I feel bad :) [04:52:16] bblack: how current are you on ldap :) [04:52:39] chasemp: it’s an hour later in Greece [04:53:00] could call alex I think he did a lot of the ldap-ing [04:58:09] ok no dice on answer [04:59:39] moritz too? [05:00:55] I only called moritz atm, we are sitting green everywhere I look [05:01:17] and I'm watching seaborgium atm it looks ok so I'm kinda waiting a moment [05:06:04] ok andrewbogott let's see how stopping the other slapd works out [05:06:05] :) [05:06:29] ok... [05:06:46] Since this is an intermittent issue… we’ll only know if it’s worse, not if it’s better [05:07:30] yeah [05:07:53] and I’m not prepared to stay up for the 6 hours needed to form an opinion about if it’s resolved [05:08:13] me neither I'm just looking to gauge stability [05:08:21] I think the last hour has been survivable honestly [05:08:25] it seems to have at least chilled out some [05:08:34] whether that's due to settings staged on seaborgium [05:08:35] idk [05:08:37] maybe not [05:08:37] so I might argue in favor of doing nothing [05:08:42] vs. shutting down one server [05:08:52] because killing one of them enters new territory and we won’t be here to babysit [05:09:18] and the status quo is bad but I can imagine worse [05:09:52] well I had already stopped the secondary when you said that :) it went green from red (not that it means anything) [05:09:58] and I'm going to jsut sit and wait for a few [05:10:19] and if it has same issues then ok restart and give it a bit of time but yes the last hour or so has been not a big deal really [05:10:37] but it has flaked out pretty hard 2h ago or so [05:10:42] so I'm waiting for a few [05:11:29] ok — I’m going to go brush my teeth &c and will check back at least to say goodnight :) [05:11:35] ok same shit [05:11:37] k [05:11:42] * andrewbogott is in EST right now [05:19:41] Same here; but I'll keep an eye out as long as I can/ [05:25:28] it's been about 45 minutes since ;NFS read/writeable on labs instances flapped [05:25:33] before that [05:25:48] it was 6 times within the hour [05:26:02] and that's when I decided to sit on my changes so I have no optimism for this being and end all [05:26:08] but if it says ok for an hour [05:26:21] I'll try to send an update and call it traiged for moritz [05:26:50] works for me. I agree that it seems solid right now [05:27:14] I’m signing off — with luck all this will be magically fixed by morning [05:27:50] Yeah, it's been visibly more stable. [05:28:01] oh, chasemp, I’m going to mark ldap and puppet downtimed on serpens [05:28:11] doing it now :) [05:29:13] Coren: related? Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [05:29:56] chasemp: ok, we both did it then [05:30:03] ha [05:30:07] tx andrewbogott [05:30:09] good night all! [05:30:20] late rman I'll call if I see things get weird [05:30:42] chasemp: Not sure. That's ldap-bound for sure. [05:35:25] chasemp: So, replication has been out for that time? [05:36:23] no down for about 22m [05:36:31] 20m or so I guess [05:36:49] well I'm not sure what is the end result of one daemon stopped, how hard does the other try [05:37:00] from what I can tell it's doing a kind of mirroring [05:37:10] but anyways yeah [05:39:24] it's hard because if this were some existing long standing service I would look at changes and hopefully a pain point or two [05:39:28] but being 1 day or so old [05:41:30] chasemp: You feel comfortable with my going to sleep? [05:41:35] Coren: did you say http://tools.wmflabs.org/ hits ldap hard for that membership info? [05:41:45] sure man go for it I can call you if it gets bad again [05:42:01] I'm going to hang it up assuming this lasts another bit here [05:42:07] 30m since real flapping or so [05:42:18] chasemp: Lemme check how hard the current version does; I moved much of it to cache in the DB instead. [05:43:15] * Coren wonders. [05:43:26] Actually, the current version doesn't hit ldap at all... [05:43:29] D'oh! [05:43:30] ha [05:43:32] the proxy does! [05:43:46] Because it maps per-user [05:44:13] I'm tired and dumb how does that work out for tools.wmflabs.org being so touchy [05:44:28] ppl are hitting urls that do hit ldap hosing it but the homepage doesn't? [05:45:05] No, the homepage php doesn't hit ldap - but the proxy itself does to look up the user to connect /to/ [05:45:24] (The homepage is just a magic rewrite for /admin/ [05:45:35] I.e.: the tool.admin user [05:46:04] I expect if things stall, /every/ web service does because proxy. The landing page is just the only one being checked. [05:47:16] Anyways, bed if you're okay holding down the fort for a bit more? [05:47:24] sure later on [05:47:24] You PST right now? [05:47:27] CST [05:47:33] it's late yo [05:47:36] Still better than EST. :-) [05:47:37] o/ [05:47:39] true [07:34:38] PROBLEM - Host tools-worker-04 is DOWN: CRITICAL - Host Unreachable (10.68.16.122) [07:44:14] 6Labs, 10Tool-Labs, 6Design Research Backlog, 6Learning-and-Evaluation, and 2 others: Organize a (annual?) toollabs survey - https://phabricator.wikimedia.org/T95155#1868588 (10egalvezwmf) @leila, @yuvipanda, wondering how this is going? Wondering if you need support or if its possible to post a brief summ... [08:50:56] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:54:14] 6Labs, 10Tool-Labs: Update php 5 to php 7 - https://phabricator.wikimedia.org/T121022#1868704 (10valhallasw) 5Open>3declined a:3valhallasw Contraindications: - php7 is not available for trusty, precise or jessie from the official repositories, - php5 will be supported for the lifetime of trusty, which... [09:07:26] 6Labs, 10Tool-Labs, 5Patch-For-Review: Update Java 7 to Java 8 - https://phabricator.wikimedia.org/T121020#1868742 (10valhallasw) Once merged, this will install ``` openjdk version "1.8.0_40-internal" OpenJDK Runtime Environment (build 1.8.0_40-internal-b09) OpenJDK 64-Bit Server VM (build 25.40-b13, mixed... [09:07:36] 6Labs, 10Tool-Labs, 5Patch-For-Review: Update Java 7 to Java 8 - https://phabricator.wikimedia.org/T121020#1868743 (10valhallasw) p:5Triage>3Low [09:10:53] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 954646 bytes in 7.632 second response time [09:56:38] 10Quarry: Login to somebody's account - https://phabricator.wikimedia.org/T120988#1868826 (10IKhitron) I am happy, @Edgars2007, but: 1: Maybe I'm a troll. ;-) 2: Maybe somebody else was in your account, and he is a troll. [10:22:54] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:32:52] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 955296 bytes in 7.694 second response time [10:35:12] 6Labs, 10Tool-Labs-tools-Other, 10Labs-Infrastructure: s52721__pagecount_stats_p import is making labsdb1005 100% utilized (and lagging its backup slave) - https://phabricator.wikimedia.org/T120926#1868874 (10jcrespo) This seems to have worked. Lags is in the 0-10 range, which it is acceptable. There is stil... [10:35:36] 6Labs, 10Tool-Labs, 10DBA: Certain tools users create multiple long running queries that take all memory from labsdb hosts, slowing it down and potentially crashing (tracking) - https://phabricator.wikimedia.org/T119601#1868877 (10jcrespo) [10:35:37] 6Labs, 10Tool-Labs-tools-Other, 10Labs-Infrastructure: s52721__pagecount_stats_p import is making labsdb1005 100% utilized (and lagging its backup slave) - https://phabricator.wikimedia.org/T120926#1868875 (10jcrespo) 5Open>3Resolved a:3jcrespo [10:51:42] 6Labs, 10Tool-Labs, 10DBA: Throttling linkwatcher tool user as it is consuming 100% CPU - https://phabricator.wikimedia.org/T121094#1868898 (10jcrespo) 3NEW [10:52:12] ^Beetstra [10:54:11] Question concerning puppet: If I remove a puppet role from a instance, is puppet now disabled for this role, or will puppet delete the diretory, were the files are? [10:56:50] Luke081515: just disabled. [10:57:27] ok, that's good, so I can disable, that puppet overwrite custom files [10:57:53] Thanks [11:09:34] Can someone help me? After a reboot I can't access my instace. Permission denied (publickey,keyboard-interactive). [11:20:31] Luke081515: try adding your pubkey to your project hiera in this way: https://wikitech.wikimedia.org/wiki/Hiera:Tools [11:20:48] Luke081515: afterwards, after a puppet run (which can take another ~20 mins or so), you should be able to login as root [11:27:18] valhallasw`cloud: Thanks [11:32:24] Is anyone around that can allocate me a single IP for a labs project? Or should I file a tickeT? :) [11:33:34] 6Labs, 10Attribution-Generator, 6TCB-Team: Assign 1 IP address to lizenzhinweisgenerator labs project - https://phabricator.wikimedia.org/T121095#1868930 (10Addshore) 3NEW [11:33:58] 6Labs, 10Attribution-Generator, 6TCB-Team, 15User-bd808: Create labs projects for lizenzhinweisgenerator - https://phabricator.wikimedia.org/T120925#1868939 (10Addshore) [11:34:04] 6Labs, 10Attribution-Generator, 6TCB-Team, 15User-bd808: Create labs projects for lizenzhinweisgenerator - https://phabricator.wikimedia.org/T120925#1864901 (10Addshore) Many thanks! [11:34:27] 6Labs, 10Attribution-Generator, 6TCB-Team: Assign 1 IP address to lizenzhinweisgenerator labs project - https://phabricator.wikimedia.org/T121095#1868942 (10Addshore) [11:46:39] 6Labs, 10MediaWiki-extensions-Newsletter: Internal error when creating new user in newsletter-test.wmflabs.org - https://phabricator.wikimedia.org/T119945#1868966 (10Qgil) a:301tonythomas [11:56:14] valhallasw`cloud: I'm not a linux expert, can you help me how to ssh with root permissions? [11:56:59] if someone get a message, that I tried to sudo at bastion-01, that was my previous try, it failed :-/. Wrong approach [12:01:43] Luke081515: why are you sshing from bastion-01? [12:02:01] Luke081515: anyway, ssh root@your-host-name [12:02:20] addshore: why is novaproxy not good enough? [12:04:09] valhallasw`cloud: luke081515@bastion-01:~$ ssh root@rcm-2 => Permission denied (publickey,keyboard-interactive). :-/ [12:04:36] Luke081515: and does that actually use the relevant private key? [12:04:40] use -vv [12:11:57] debug1: key_load_public: No such file or directory [12:14:33] Luke081515: sorry, I don't have the time at the moment to walk you through all the steps. In that debug output, there should be a list of keys tried. Compare that with the keys you expect to be tried (namely, those in your key agent). Is the forwarding working? Are the right keys loaded? Etc. [12:15:03] ok [12:28:46] hey ! can someone take a look at why I am getting this error : https://phabricator.wikimedia.org/T120516#1869021 [12:35:54] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:36:34] valhallasw`cloud: I can't point a custom domain at that can I? [12:37:27] addshore: ahhh. Not by default, I think, no. [12:40:52] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 955843 bytes in 8.357 second response time [12:46:54] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:47:00] !log tools broke tools-proxy-02 login (for valhallasw, root still works) by restarting nslcd. Restarting; current proxy is -01. [12:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [12:51:54] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 955787 bytes in 9.331 second response time [12:53:46] valhallasw`cloud: any idea why this one is happening up ? [12:53:50] https://phabricator.wikimedia.org/T120516#1869021 [12:54:00] the instance id changed back to newsletter-test though [12:54:22] tonythomas: no clue, but I remember some issues with LXC not being detected. Try a reboot? :-p [12:54:32] just rebooted now, though [12:54:58] /srv/lxc is empty here [13:29:34] tonythomas: try passing --provider=lxc, as the error suggests? :/ [13:33:48] trying that one now [13:34:43] The provider 'lxc' could not be found, but was requested to [13:34:43] back the machine 'default'. Please use a provider that exists. [13:34:44] mwvagrant@newsletter-test:/srv/mediawiki-vagrant$ [13:35:17] valhallasw`cloud: I tried to solve that issue, but the debug output says, that only TCP is responding, but the other components not :(. Can someone take a look at that? [13:35:54] Luke081515: I don't understand. [13:37:05] valhallasw`cloud: Here is the relevant output: https://phabricator.wikimedia.org/F3064458 A part of the instance looks like it is not working [13:37:08] tonythomas: iirc bd808 had seen the issue before, but I'm not sure what the solution was :/ [13:37:28] valhallasw`cloud: yeah - https://wikitech.wikimedia.org/wiki/Help_talk:MediaWiki-Vagrant_in_Labs [13:38:07] Luke081515: right, so agent forwarding is working, but your key is being denied. Are you sure you are using the right key and username? [13:38:42] Yeah, I'm using this key and this username for all my instances at that project, and I can still ssh to other instaces at my project with this setting [13:47:48] Can a labs admin take a look at this instance? [13:47:53] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:48:53] ... odd; I'm not seeing a stall at all. [13:48:59] That short? [13:49:10] Luke081515: I'll take a peek, lemme catch up on scrollbakc. [13:50:58] Luke081515: What project is rcm-2 in? [13:51:01] Coren: I made a change at the srv directory and restartet it. Now I can't login, the instance don't accepts the key: https://phabricator.wikimedia.org/F3064458 [13:52:29] Luke081515: What project is this instance in? [13:52:45] Coren: Project rcm [13:52:46] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 955810 bytes in 3.133 second response time [13:52:57] Allright, lemme take a look. [13:53:05] Thanks [13:53:38] Hrm. [13:53:51] Even my labswide root key isn't accepted. [13:54:22] Do you recall what the exact change you did was? [13:57:19] Coren: I moved srv/phab to srv/phab.old, and then I did this: https://phabricator.wikimedia.org/T117441#1842484 [13:57:53] I did the same yesterday at rcm-4, and I can't login, and wanted to do this today, but now it works at rcm-4 again, but I don't know why [13:58:08] I see absolutely no reason why that should affect auth or login at all. [13:58:17] * Coren boggles a bit. [13:58:27] Does this project use a self-hosted puppetmaster? [13:58:55] Coren: No it uses just ::role::phabricator::labs. But I got this one time: AH00112: Warning: DocumentRoot [/srv/phab/phabricator/webroot] does not exist [13:59:07] But it seems like puppet recreated this file [13:59:21] Hm, and indeed I have no issue logging into rcm-4 [13:59:24] I get now a fatal error at the web domain, but no 403 like before [13:59:45] That warning comes from Apache, also shouldn't affect auth. [14:00:12] Mind if I reboot rcm-2? [14:00:26] I already done this, but I can do this again, wait a moment [14:01:19] its rebooting now [14:04:11] Coren: Reboot finished [14:04:35] * Coren attempts to figure out what's up. [14:05:54] Hm. Still no luck. You're telling me that rcm-4 did the same thing yesterday? Do you remember what time(s)? I might be able to figure out what's going on by looking at /its/ logs. [14:09:06] Yesterday, ähm, that was between 21, or 22 UTC and 23 UTC. Then I tried it again at about 9 or 10 UTC, and then it works [14:09:38] Ah, excellent - that gives me a good idea. [14:09:42] * Coren digs into logs. [14:17:42] Luke081515: That's really odd - I see you connecting on Dec 9 17:53; doing the changes like you said via sudo, then no attempts at all until you logged in on Dec 10 10:59. Not even failed attempts. [14:18:48] hm, but I get Permission denied (publickey,keyboard-interactive). too there. And we webhost behave same like at rcm-2, first 403, than at lest a succesful answer [14:18:55] *the [14:19:07] *at least [14:19:23] Wait, you're getting permissions denied on rcm-4 *now*? [14:21:29] coren: No, at the moment not [14:21:37] at the moment I can login [14:22:07] but i get this yesterday [14:22:32] I'm looking at the auth log on rcm-4 and I only see no connection attempt between your succesful ones. [14:22:37] * Coren is confused. [14:31:48] Or wait, I don't have a log from yesterday of my console. One thing was equal: The 403 at the webspace. But I'm not sure, if the login fails [14:31:56] maybe im mixed something up [14:32:05] But I cloned the same data to rcm-4 [14:32:26] That's almost certainly unrelated; apache and ssh share no common configuration. [14:32:31] Hm. [14:34:47] but otherwise that means that the auth at rcm-2 crashed by the first reboot after the change at the srv directory [14:44:44] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Other: tools.weather gadget causing high load on tool labs - https://phabricator.wikimedia.org/T121104#1869216 (10valhallasw) 3NEW [14:49:14] Coren: Do you have access to the data at rcm-2 without login? Otherwise maybe we can solve the problem by moving the data to a new instance, (but I need the data from there before, the last backup was a few weeks ago) [14:51:10] Luke081515: Possibly, but it's a very complicated operation at best. I'm working on a different labs-wide problem atm but I should be able to give your issue more attention shortly? [14:51:43] ok, that's good [15:03:29] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Other: tools.weather gadget causing high load on tool labs - https://phabricator.wikimedia.org/T121104#1869262 (10valhallasw) Sinkholed with .lighttd.conf ``` url.rewrite-once = ( "^/[^?]*(\?(.*))?$" => "/index.php?$1&full_request=$0" ) ``` and ``` 6Labs, 10MediaWiki-extensions-Newsletter: Create a larger newsletter-test instance in labs - https://phabricator.wikimedia.org/T120516#1869275 (1001tonythomas) [15:50:52] 6Labs, 10Attribution-Generator, 6TCB-Team: [AG] [Task] Assign 1 IP address to lizenzhinweisgenerator labs project - https://phabricator.wikimedia.org/T121095#1869513 (10Tobi_WMDE_SW) p:5Triage>3High [15:51:24] 6Labs, 10Attribution-Generator, 6TCB-Team, 5Attribution-Generator-Release-2.0: [AG] [Task] Assign 1 IP address to lizenzhinweisgenerator labs project - https://phabricator.wikimedia.org/T121095#1868930 (10Tobi_WMDE_SW) [16:07:10] 6Labs, 10MediaWiki-extensions-Newsletter: Create a larger newsletter-test instance in labs - https://phabricator.wikimedia.org/T120516#1869594 (10bd808) > No usable default provider could be found for your system. This usually means that the shell alias of `vagrant` has not been setup for the current shell. `... [16:16:54] 6Labs, 10MediaWiki-extensions-Newsletter: Create a larger newsletter-test instance in labs - https://phabricator.wikimedia.org/T120516#1869625 (10bd808) Added a note about this issue on wikitech: https://wikitech.wikimedia.org/w/index.php?title=Help:MediaWiki-Vagrant_in_Labs&diff=224409&oldid=224404 [16:43:52] bd808: I got another confusing thing with mediawiki-vagrant [16:44:16] super happy fun time. Lay it on me [16:44:40] (16:42) root@localhost:[wiki]> SHOW COLUMNS; [16:44:40] ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 1 [16:44:49] normaly that command works [16:45:45] valhallasw`cloud: can you explain your last comment on https://phabricator.wikimedia.org/T121104#1869262 ? [16:45:57] I can’t tell if you’re describing a problem or a solution [16:45:57] Luke081515: really? what would it be showing you columns from? [16:46:01] andrewbogott: problem [16:46:31] ok :( [16:46:34] andrewbogott: https://tools.wmflabs.org/weather/ was deleted, so returned a 404, but that javascript then kept retrying with ~10 req/sec [16:46:40] Since that script is on a wiki can we just fix it? [16:46:51] Luke081515: SHOW COLUMNS FROM tablefoo; seems more likely to work [16:47:00] Luke081515, show columns is not very used [16:47:06] but that is the syntax [16:47:12] DESC is shorter [16:47:20] andrewbogott: yes, that is also an option, but I don't have rights for that ;-) [16:47:21] bd808: Sorry, I get a kind of error in mind...^^ [16:47:21] DESC table; [16:47:44] I want to use SHOW Tables, but.... sorry [16:48:02] or if you want something more standard [16:48:05] Luke081515: no worries. [16:48:08] I'm a bit ill at the moment.. so [16:48:18] valhallasw`cloud: I was thinking that as a wiki page maybe ‘anyone can edit' [16:48:22] SELECT * FROM information_Schema.columns WHERE table_name= ''; [16:48:30] andrewbogott: gadget :-) [16:48:30] but I can’t tell which farsi tab is ‘edit' [16:48:35] https://fa.wikinews.org/w/index.php?title=%D9%85%D8%AF%DB%8C%D8%A7%D9%88%DB%8C%DA%A9%DB%8C:Gadget-Weather.js&action=edit [16:48:39] jynus: Thanks too [16:48:42] aka mediawiki namespace, so sysop only [16:48:48] ah [16:48:58] Coren: do you have needed privs? [16:49:07] andrewbogott: try &uselang=en? ;) [16:49:22] I'm actually surprised this did not kill toollabs earlier -- this script has been running for half a year or so, and we never spotted the high load on the proxy [16:49:34] andrewbogott: I do, I think. +staff is global. [16:49:36] oh, but that's because it /worked/ earlier, so it didn't retry [16:49:42] * Coren tries to catch up on scrollback. [16:49:47] valhallasw`cloud: we have actually had instability there honestly [16:49:55] Coren: maybe perform a bit of surgery on that script? Make it sleep before retry, or something? [16:49:59] I recall responding to what looked like almost a ddos a month or so ago [16:50:00] andrewbogott: mouse over the links and look for the one that targets ?action=edit [16:50:08] or just kill the script altogether [16:50:16] andrewbogott: or just add ?action=edit to the url [16:50:16] it needs tools.wmflabs.org/weather but that tool doesn't exist anymore [16:50:18] bd808: yeah, I found it. Mostly was just amused at fancy farsi script [16:50:22] I think mjbmr left the WMF space? [16:51:11] is there an etiquette for just turning it off? [16:51:13] toys like this that try to turn MediaWiki into a desktop or emacs make me sad [16:51:52] andrewbogott: Retry? I'm pretty sure any failure is an issue. [16:52:48] I think https://meta.wikimedia.org/wiki/Proposals_for_closing_projects/Closure_of_Persian_Wikinews might have to do something with this [16:53:26] Coren: I haven’t even found the part of the script that hits labs, I’m just talking valhalla’s word for this [16:54:19] andrewbogott: line 183 [16:54:57] url: '//tools.wmflabs.org/weather/api/v1', [16:56:07] post? [16:57:15] well, anyway, since that tool doesn’t exist anymore… maybe removing the reference to the script is better [16:57:25] or deleting the script entirely; it can’t be doing anything useful can it? [16:57:29] nope [16:59:28] so... [16:59:41] I’m back to thinking that maybe Coren can take a drastic step? [17:00:41] I might - I'll be back soon after some desperately needed food and look into it? [17:01:15] I’m sure I could do it too if I re-log in as staff account. But I suspect Coren is more versed in the politics of such a move. [17:01:17] Coren, sounds good, thank you [17:01:31] we could also deny the url in the proxy [17:01:34] uselang=en [17:01:42] valhallasw`cloud: now that that page is essentially a no-op, it’s not really imposing a load on us is it? [17:02:07] Yeah, I see the ajax call. I could remove it entirely, [17:02:08] Correct [17:02:29] ok, so maybe we don’t care. I must’ve misunderstood earlier and thought this was still causing load [17:02:37] you can also ask matmarex to edit [17:02:44] since he has the .js chops [17:02:44] Until the webservice dies, or something like that, then the issue will return [17:03:13] I think Krinkle has a global right to fix crappy js too [17:03:49] I think the minute the weather tool goes dark this is back to causing issues iiuc [17:04:35] ok — then we should definitely change and/or murder the script [17:04:42] tools.wmflabs.org/weather seems to still exist though [17:04:47] it doesnt' say 404 [17:04:56] Krinkle: yes, because 404 also causes the script to retry [17:05:01] I added that page as a stopgap measure [17:05:12] ^this and thanks valhallasw`cloud [17:05:17] You're saying the tool was removed but someone here added the non-404? [17:05:27] OK [17:06:39] !log tools revoked mjmbr membership of 'weather' tool; replaced with tools.admin [17:06:45] I've disabled the gadget [17:06:49] Krinkle: thanks! [17:07:03] I'll post a message on their village pump [17:17:36] andrewbogott: Did you just get paged? [17:18:23] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Other: tools.weather gadget causing high load on tool labs - https://phabricator.wikimedia.org/T121104#1869853 (10chasemp) 5Open>3Resolved a:3chasemp @krinkle disabled the gadget after discussions on IRC and @valhallasw is posting a message to the village pump htt... [17:18:58] Coren: for what? [17:23:01] chasemp: It's really odd. webservice start and stop fail, but everything is happy atm afaict [17:23:14] valhallasw`cloud: did you have any time today to talk about that changest for diamond deps? [17:23:19] Coren: I didn't get anything [17:23:47] chasemp: uuuh, I haven't looked at it yet. [17:24:28] valhallasw`cloud: ok no worries idk how you feel about me changing it and merging, and I only ask as I'm bootrapping hardware nodes and running into it [17:24:40] chasemp: please go ahead [17:24:44] ok thanks [17:26:45] chasemp: maybe the entire class hsould just have a require_package('python-diamond'), so that it's installed before anything else in that class runs [17:27:11] that may be simplest honestly [17:27:46] chasemp: aaaand, i got the improved. Odd that you didn't get anything - maybe those are old and for some reason got delayed for me? [17:28:08] if this is from shinken I think i odn't get them [17:28:13] i only get the main icinga ones atm [17:28:13] 6Labs, 10MediaWiki-extensions-Newsletter: Create a larger newsletter-test instance in labs - https://phabricator.wikimedia.org/T120516#1869880 (1001tonythomas) It did work, but now I get this. ``` mwvagrant@newsletter-test:/srv/mediawiki-vagrant$ vagrant up Bringing machine 'default' up with 'lxc' provider.... [17:29:32] Coren: no page [17:34:42] Huh. Odd. [17:35:13] Coren: you’re going to write ldap perf monitors, right? [17:35:26] I got the matching catchpoint email, mind you, so it's not an old page. [17:35:46] andrewbogott: I got an alarm written up already at https://gerrit.wikimedia.org/r/#/c/258168/ that could do with a review [17:35:52] oh, if it’s catchpoint it might just be that I’m not subscribed [17:36:00] I think chasemp was working on a graphite collector? [17:36:28] Coren: https://gerrit.wikimedia.org/r/#/c/258152/ back at you [17:37:06] 'retry_check_interval => 0'? Does that work? [17:40:23] 6Labs, 10MediaWiki-extensions-Newsletter: Create a larger newsletter-test instance in labs - https://phabricator.wikimedia.org/T120516#1869892 (10bd808) I've seen that one too. As I recall it means that there are files in /srv/mediawiki-vagrant and/or /srv/vagrant-data that are not readable by the `mwvagrant`... [17:41:20] Coren: I don’t know. In theory it should reschedule immediately [17:41:30] the docs are unclear [17:41:47] I think it will too, from my reading. I just note that I don't think we're using this anywhere else atm. [17:42:40] how do i setup a proxy for my instance now? Seems to have disappeared from left menu - seem to remember YuviPanda saying it was moving somewhere [17:43:13] It should be under the "Labs Projectadmins" section of the left nav as "Manage Web Proxies" [17:43:29] jdlrobson: or just go directly to https://wikitech.wikimedia.org/wiki/Special:NovaProxy [17:43:54] yeh not showing up there bd808 [17:43:55] weird [17:46:32] Coren: Still to do, or do you have time to look again at my instance? (Just a question, I don't want to jostle) [17:48:53] Luke081515: Still to do - like I said, I doubt I'll be able to recover it given the state it is in and my inability to log in at all. [17:49:52] ok [17:50:06] Luke081515: But I'm pretty sure it's not what you did that was the issue. [17:50:35] Luke081515: We switched ldap servers earlier this week and - unless your puppet worked right - that might have been the issue. [17:51:13] ... although that shouldn't have prevented root logins too. [17:53:13] So I'm not sure what happened to it except that I don't seem to be able to go figure out what happened to it. [17:53:29] Coren: Should I create a task for it, or not? [17:53:39] Luke081515: Better if you do, regardless [17:54:14] andrewbogott may be able to tell you how plausible getting files off a dead instance is likely to be, but you'll have to get a well-deserved lecture on how it should hae been puppetized. :-) [17:54:55] It’s possible but very annoying [17:54:57] what instance? [17:55:49] rcm-3.rcm.eqiad.wmflabs [17:55:59] No logins work; including roots. [17:56:18] 6Labs, 10Labs-Infrastructure: Login into rcm-2 broken - https://phabricator.wikimedia.org/T121123#1869962 (10Luke081515) 3NEW [17:56:29] Err, -2. [17:56:30] Wait! [17:56:37] *NOW* I can log in with root. [17:56:42] oh [17:56:46] dafu? [17:57:01] * Coren makes sure puppet and ldap are up to date. [17:57:09] Coren: I can login too [17:57:13] Great [17:57:19] Holey sheets. [17:57:26] That's how a task helps :D [17:57:29] That puppet manifest does /not/ apply cleanly. [17:58:06] Luke081515: 10:1 that's the source of your worries. Make sure puppet runs cleanly. :-) [17:58:46] Ah, ok, I guess its the phabricator role, because I changed the directory, where puppet is operating? [17:58:58] So I have to disable this role in this case? [17:59:35] 6Labs, 10Labs-Infrastructure: Login into rcm-2 broken - https://phabricator.wikimedia.org/T121123#1869979 (10Luke081515) 5Open>3Resolved a:3Luke081515 After 8 hours of login failed it works now :). [17:59:46] quick solved task ;) [18:01:16] Luke081515: That may well be it. If you run a 'puppet agent -tv' as root, you'll see the errors. [18:01:27] ok, thanks [18:02:05] * Luke081515 disables this role now [18:03:57] Coren: If we get this another time, I think I know, why this was fixed: I used the "hard reboot" function at horizon, maybe that's why I can login again [18:07:30] That doesn't seem very likely to me, but I suppose it's possible. [18:07:56] hard reboot shouldn't really change anything within the instance. [18:21:07] Coren: my root login doesn’t work [18:35:37] !log rcm rcm-2: Update to current phabricator version successful [18:35:50] morebots? [18:36:44] 13:57 -!- labs-morebots [~labs-more@208.80.155.228] has quit [Ping timeout: 260 seconds] [18:36:49] can someone restart him? [18:39:54] andrewbogott: On what instance? rcm-2? [18:40:10] rcm-3.rcm.eqiad.wmflabs [18:40:20] I mistyped. We were talking about rcm-2 [18:40:22] :-) [18:40:37] It seems to have self-healed anyways, and now puppet runs cleanly I think. [18:52:58] great [18:53:27] btw, Coren, do you know offhand how/where a labs instance decides what project it’s in when a user logs in? [18:55:28] Not sure what you mean by 'when a user logs in'? [18:56:29] There is /etc/wmflabs-project that contains the project name. [18:56:33] Put there by puppet. [19:02:37] ah, right, that [19:02:43] hm, that should work then [19:11:15] andrewbogott: So I got the getent speed check running now - it's set to be fairly easy to trigger, but I didn't make it critical yet (so no paging) [19:11:29] great! [19:11:45] * YuviPanda waves vaguely [19:11:49] lots of backscroll! [19:11:53] what server should I look at in icinga to see that? [19:12:22] andrewbogott: Either labstores. I use labstore1001 [19:12:41] ok [19:13:24] * andrewbogott sees it [19:16:09] ldap is being quite sollicited by the nfs server; I'm glad it does some fairly aggresive caching as well. [19:18:35] I wonder if the load it puts on slapd is somehow bigger than that which it put on opendj? [19:19:01] * Coren isn't sure how expensive grabbing the list of uniqueMembers of a group might be. [19:19:28] More specifically, how expensive getting the list of groups a specific cn is a uniqueMember of. [19:21:55] YuviPanda: can you help me out with pam on promethium? iirc you worked on login auth recently [19:22:02] (or was that Coren?) [19:22:12] That was me. [19:22:28] yup, that was Coren [19:22:34] What's your issue? Nothing I did should have touched !labs? [19:22:36] I have managed to not touch PAM at all :) [19:23:03] You should be able to log into promethium as root with your labs root key [19:23:21] ohright, that's the metal host. [19:23:21] once there, have a look around and see if you can explain why it says 'invalid user andrew’? [19:23:26] yeah, right [19:23:31] should be running the same auth as a labs vm [19:24:23] andrewbogott: Well, the first thing I see is that you don't have the ldap client settings in. [19:24:42] which file is missing? [19:24:49] * Coren looks. [19:25:00] and/or incorrect? [19:25:44] maybe something that comes from the base labs image isn’t puppetized but needs to be [19:26:14] Possibly. I'm looking at things now; the actual ldap config is there, trying to find why nss doesn't seem to be. [19:27:22] Most things seem in place. Hmmm. [19:27:25] * Coren digs deeper. [19:29:56] Aha. The issue is in nslcd [19:30:06] Dec 10 19:28:14 promethium nslcd[23024]: [7fa04c] ldap_start_tls_s() failed (uri=ldap://ldap-labs.codfw.wikimedia.org:389): Connect error: (unknown error code) [19:30:28] Wait, codfw. [19:30:40] Why does it then say "no available LDAP server found: Server is unavailable" [19:30:40] Coren: grabbing the list of uniqueMembers of a group should be really cheap, it's directly read as attributes from the DN [19:31:04] Dec 10 19:26:59 promethium nslcd[23024]: [b57ed4] ldap_start_tls_s() failed (uri=ldap://ldap-labs.eqiad.wikimedia.org:389): Connect error: (unknown error code) [19:32:00] andrewbogott: A restart of nslcd fixed it. [19:32:09] andrewbogott: I'm guessing it got started before its whole config was in place. [19:32:14] Coren: ok... [19:32:17] hm [19:32:38] I can log in as me now. [19:32:50] andrewbogott: Also, yow! You have sooo many groooups! [19:32:59] yeah [19:33:19] that fact breaks Horizon for me — more projects than any OpenStack dev has ever seen [19:33:20] You're certainly giving the supplemental group list a good workout. :-) [19:33:37] But, anyway--- [19:33:39] thanks for fixing! [19:33:50] Now that that box has labs puppet and labs logins, I’m going to declare victory and go to the beach [19:33:59] No worries. I wish I knew for sure why nslcd was ill. Try to reboot to make sure it comes back up right? [19:34:05] bd808: probb on a labs instance running vagrant/mediawiki what's the easiest way to expose a node server running on another port that's web accessible [19:34:21] andrewbogott: U can haz victory! [19:34:49] Coren: yeah, I’ll reboot. I also need to rebuild this instance from 0 later [19:35:47] !log mailman deleting all instances — they’re broken and no one is around to care [19:36:58] !log mailman deleting all instances — they’re broken and no one is around to care [19:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Mailman/SAL, dummy [19:37:21] Coren: can I get another link to that ‘broken by ldap’ etherpad? [19:38:02] https://etherpad.wikimedia.org/p/remaining-ldap [19:38:12] bottom red? [19:38:33] That's my list from tcpdump as of last evening. [19:39:05] right k [19:39:21] chasemp: that’s it, thanks. [19:39:24] Just deleted three more [19:39:55] andrewbogott: Coren btw I have something tomorrow afternoon [19:40:02] can we push the meeting into the later day [19:40:07] I know andrewbogott you said late day is ok for now? [19:40:16] yeah, that’s fine [19:40:43] maybe 3 central? [19:40:53] Coren: what tz are you? [19:43:13] chasemp: EST [19:43:35] 3 central is 4 here which is good enough. [19:43:39] 4 your time tomorrow then? [19:43:39] ok [19:43:42] done and done [19:53:45] 6Labs, 10Tool-Labs, 5Patch-For-Review: deploy package_builder on tool labs - https://phabricator.wikimedia.org/T111730#1870490 (10valhallasw) 5Open>3Resolved a:3valhallasw Now running on tools-packages (which is a jessie host, which means everything just works(TM)) [19:54:10] jdlrobson: The `vagrant forward-port $HOST_PORT $VM_PORT` command should let you do that [19:55:28] then you will have to setup NovaProxy to forward something in to that port on your host vm [20:04:09] 6Labs, 10Tool-Labs, 5Patch-For-Review: Update Java 7 to Java 8 - https://phabricator.wikimedia.org/T121020#1870527 (10valhallasw) Ok, so it's slightly more complicated. There is a trusty package in apt.wm.o, but it's outdated and currently no-one is responsible for it. I discussed the situation with #moritzm... [20:06:53] 6Labs, 10Tool-Labs, 5Patch-For-Review: Update Java 7 to Java 8 - https://phabricator.wikimedia.org/T121020#1870535 (10MoritzMuehlenhoff) Ah, that's another gotcha I forgot to mention about the openjdk packages: Not all tests of the test suite complete :-/ What I do for the updates in Debian is to compare th... [20:07:19] valhallasw`cloud: rule of thumb about openjdk: it's usually worse than anticipated :-) [20:07:33] :D [20:08:12] so... the best comparison would be the corresponding sid build, I suppose? [20:09:20] or maybe the jessie-backports log file (which would be even closer in terms of packages), let me see whether that's available [20:09:38] there's also a ppa which could be used as comparison [20:09:53] but then again, that one might just also be broken ;-) [20:09:55] it is :-) https://buildd.debian.org/status/package.php?p=openjdk-8&suite=jessie-backports [20:10:15] the jessie-backport package is widely in use, all the cassandra clusters use it [20:10:32] ah, good [20:10:38] we made the backport since the garbage collector in openjdk-7 tends to freeze/lockup under high load [20:10:48] and the one from openjdk-8 works fine [20:11:14] so, if the test result doesn't differ gravely from the logs for jessie-backports we should be fine [20:11:31] I love how https://buildd.debian.org/status/logs.php?pkg=openjdk-8&arch=amd64 is 'Maybe-Succesful' [20:12:06] the jessie one in apt.wm.o is just the debian backport? [20:12:37] ah, wait, but that's sid [20:13:44] I never understood why some builds are Maybe-Successful... [20:23:11] ok, so it's worse than the i386 one [20:26:55] moritzm: I'm really confused why there is no amd64 build log [20:27:46] I can of course just build for jessie and then compare logs, though... [20:29:08] ah, but there's a wily build log available [20:29:37] the amd64 build is usually done by the person who uploaded the package (and that log is only on that person's machine), buildd.debian.org only lists the automatically-built packages for other archs [20:29:48] and most Debian devs use amd64 on their machines [20:30:22] ah, that explains [20:30:35] ok, so compared to https://launchpad.net/~openjdk-r/+archive/ubuntu/ppa/+build/8156693 it's significantly worse [20:31:05] 550 more tests failing... [20:32:00] hmm, that's quite a number. all across the board of the various test suite sections or are there hotspots where the difference is especially big? [20:33:46] compiler, gc, runtime, tools. Seems all over the place :/ [20:35:42] 6Labs, 10Tool-Labs, 5Patch-For-Review: Update Java 7 to Java 8 - https://phabricator.wikimedia.org/T121020#1870635 (10valhallasw) I compared the log file with the result from https://launchpad.net/~openjdk-r/+archive/ubuntu/ppa/+build/8156693, as well as the i386 build from https://buildd.debian.org/status/l... [20:35:45] hmm, doesn't sound very well. maybe there's a reason why there are no backported packages for trusty in Ubuntu yet :-/ [20:39:49] moritzm: Right. https://code.launchpad.net/~malte.swart/+recipe/openjdk-8-backport is basically as bad [20:41:54] 6Labs, 10Tool-Labs, 5Patch-For-Review: Update Java 7 to Java 8 - https://phabricator.wikimedia.org/T121020#1870646 (10valhallasw) https://code.launchpad.net/~malte.swart/+recipe/openjdk-8-backport is basically as bad, so I'm actually not that hopeful that this is going to be possible anymore... [20:42:00] andrewbogott: Did you ever do that ldap check you talked about? Otherwise, ima do it now. [20:42:45] valhallasw`cloud: yeah, since the jessie version is working fine, maybe users with a need for java 8 can hop on the kubernetes train? [20:43:18] *nod*, although I do think having a less ancient java would make sense [20:44:00] I'm not sure how well suited k8s is for developing /inside/ a pod, for example [20:45:24] only one way to kind out :-) [20:45:55] alternatively, copy your build somewhere and let people who're interested in java, try it out on their own [20:46:10] then we can check whether it's a faulty test suite or genereally broken on trusty [20:47:18] that's actually a great idea in general [20:47:35] having a tools-testing-bastion where we install newer packages so people can test them before they are deployed on the rest of the systems [20:50:42] 6Labs, 10Tool-Labs: Provide a tools-staging-bastion to allow users to test newly built packages - https://phabricator.wikimedia.org/T121146#1870664 (10valhallasw) 3NEW [20:53:12] I tired again today and new instances in the mediawiki-vagrant project are still having the ssh issue I described in https://phabricator.wikimedia.org/T121064 [20:53:25] *tried [20:55:58] andrewbogott: Did you ever do that ldap check you talked about? Otherwise, ima do it now. [21:02:53] hi bd808 [21:03:34] I can get in as root [21:03:36] looking [21:03:53] 6Labs, 10Labs-Infrastructure: ssh access to instances in mediawiki-vagrant project failing with "Connection closed by UNKNOWN" response - https://phabricator.wikimedia.org/T121064#1870726 (10yuvipanda) I can get in as root, am looking [21:04:36] bd808: hmm, try now? [21:05:18] YuviPanda: nope. still ending wiht "Connection closed by UNKNOWN" [21:05:26] which is a weird error [21:05:31] right [21:05:32] so it's PAM [21:05:43] oooh [21:05:45] not PAM [21:05:52] it's PAM [21:05:54] Dec 10 21:05:02 mwv-image-builder sshd[775]: fatal: Access denied for user bd808 by PAM account configuration [preauth] [21:06:06] ok. ldap? [21:06:10] Dec 10 21:05:02 mwv-image-builder sshd[775]: fatal: Access denied for user bd808 by PAM account configuration [preauth] [21:06:13] err [21:06:19] -:ALL EXCEPT (project-mediawiki-vagrant) root:ALL [21:06:23] that's right [21:06:25] bd808: no, not LDAP [21:06:35] bd808: since [21:06:36] root@mwv-image-builder:~# /usr/sbin/ssh-key-ldap-lookup bd808 [21:06:38] works [21:07:18] bd808: interesting [21:07:27] bd808: you aren't on the mediawiki-vagrant project [21:07:37] bd808: in LDAP, that is [21:08:30] 6Labs, 10Labs-Infrastructure: ssh access to instances in mediawiki-vagrant project failing with "Connection closed by UNKNOWN" response - https://phabricator.wikimedia.org/T121064#1870730 (10yuvipanda) `root@mwv-image-builder:~# id bd808 | grep medi uid=3518(bd808) gid=500(wikidev) groups=1003(wmf),50062(proje... [21:08:50] YuviPanda: ah [21:09:51] bd808: is htis a new project? [21:09:53] *this [21:10:05] YuviPanda: yes. I just made it yesterday [21:10:20] andrewbogott: chasemp ^ new projects don't seem to get the associated LDAP group created [21:11:49] bd808: similar to turning it off and on, can you remove and re-add yourself? :D [21:12:00] sure [21:12:55] YuviPanda: no joy [21:15:04] 6Labs, 10Labs-Infrastructure: ssh access to instances in mediawiki-vagrant project failing with "Connection closed by UNKNOWN" response - https://phabricator.wikimedia.org/T121064#1870755 (10yuvipanda) It was a new project created yesterday. [21:15:11] now to see if all new projects have this problem [21:15:13] or just this [21:18:46] nope [21:18:47] I'm around now, YuviPanda verdict? [21:18:49] just created a new project [21:18:58] chasemp: new project cration doesn't create related LDAP group [21:19:11] tihs prevents people from being able to log into instances in new projects [21:19:13] I found a real bug! [21:20:08] want to make the task too? :) [21:20:21] I assume it will involve modifying some ldap acls but idk [21:20:25] https://phabricator.wikimedia.org/T121064 [21:21:10] I made one other project yesterday too... /me looks for project name [21:21:19] 6Labs, 10Labs-Infrastructure: new projects do not allow ssh as users do not get joined to the project in ldap - https://phabricator.wikimedia.org/T121064#1870768 (10chasemp) p:5Triage>3High [21:21:20] 6Labs, 10Labs-Infrastructure: new projects do not allow ssh as users do not get joined to the project in ldap - https://phabricator.wikimedia.org/T121064#1870771 (10yuvipanda) I created a new project (`dashiki`) and I didn't get added to a `project-dashiki` group [21:21:33] YuviPanda: what project did you just create to try? [21:21:41] chasemp: `dashiki` [21:22:32] 6Labs, 10Labs-Infrastructure: new projects do not allow ssh as users do not get joined to the project in ldap - https://phabricator.wikimedia.org/T121064#1870775 (10bd808) The `lizenzhinweisgenerator` will probably have the same problem. I created it yesterday too (T120925). [21:23:24] 6Labs, 10Labs-Infrastructure: new projects do not allow ssh as users do not get joined to the project in ldap - https://phabricator.wikimedia.org/T121064#1870778 (10chasemp) @yuvipanda made a `dashiki` to verify the same behavior. we see: > Dec 10 21:05:02 mwv-image-builder sshd[775]: fatal: Access denied... [21:24:23] YuviPanda: does /usr/sbin/ssh-key-ldap-lookup bd808 usually only work if the person is in the project in ldap that the VM is also in? [21:24:28] or should that always work? [21:24:32] (afaik it always works) [21:24:43] chasemp: no it always works [21:24:55] kk [21:25:11] 6Labs, 10Attribution-Generator, 6TCB-Team, 15User-bd808: Create labs projects for lizenzhinweisgenerator - https://phabricator.wikimedia.org/T120925#1870784 (10bd808) @addshore Your project will probably have the issue described at {T121064} so follow that bug for progress on the fix. [21:26:19] akosiaris: or moritzm: are you still around? [21:27:44] YuviPanda: bd808 how does that work on wikitech if you are not in teh ldap project, should you still be able to manage VM's for the project etc [21:27:54] i.e. if not in the new project how does it work to create a VM for it? [21:28:05] chasemp: I have cloudadmin rights [21:28:11] ok [21:28:58] 6Labs, 10Labs-Infrastructure: new projects do not allow ssh as users do not get joined to the project in ldap - https://phabricator.wikimedia.org/T121064#1870810 (10chasemp) @bd808 is a cloudadmin so I imagine that is how he was able to create a VM in teh project in the first place? [21:31:19] I woudl prefer not to go mucking about w/ openldap acls in all their arcane-ness atm, YuviPanda how do you feel about saying no new projects until this is figured out and that's not right this moment? [21:31:32] yeah, I'm ok with that [21:31:49] bd808: are you ok with putting this instance in a different project for now? [21:31:52] I ccd alex and moritz and pm'd moritz as a heads up to take a look [21:32:06] we can also hand-hack the LDAP entry for bd808's entry [21:32:24] YuviPanda: I can totally wait. The whole point was to replace another host where ldap/puppet is hosed [21:32:31] I was thinking about that too, I have no objection really it should be as simple as a normal group add yes? [21:33:04] need to create groups too [21:34:46] oh I'm confused does the group get created and users are not added to it [21:34:51] or does the group fail to get created entirely [21:34:54] I just checked [21:35:01] getent group | grep gave me nothing [21:35:03] so I suppose the latter [21:35:20] assuming getent group shows me empty groups [21:35:31] well empty groups are not allowed I believe [21:35:43] so I wonder if it fails to create teh empty group and then because that's failed [21:35:47] the user cannot join [21:35:54] yeah maybe opendj allows that [21:35:58] and openldap does not [21:36:07] speculation but yes opendj and openldap do differ in their handling of empty groups [21:36:17] I would not be surprised if this is related [21:36:23] and maybe fundamentally broken atm? [21:36:39] yeah [21:36:49] that coudl break new tool creation too [21:36:56] since that's similar [21:37:16] checkng [21:38:12] Coren: btw, NFS is pretty slow - sshing to any instance with NFS takes a good 5-6 s for first hit to return, blockign shell until it happens [21:38:34] chasemp: yes, new tool (service account creation) is also broken [21:39:41] YuviPanda: bastion-restricted-01 has the same behaviour afaict. [21:39:51] Bah. Now both are fast. [21:39:55] * Coren curses. [21:40:29] I was doing logins on tools-bastion in parallel with bastion-restricted-01 [21:40:41] YuviPanda: when you say it breaks similarly you mean the group is never created etc [21:41:19] 6Labs, 10Labs-Infrastructure: Groups for new projects are not created as groups in ldap (thus users cannot join them and ssh doesn't work, etc) - https://phabricator.wikimedia.org/T121064#1870847 (10chasemp) [21:42:06] chasemp: I read through the OSM code and am pretty sure it's creating an empty group first [21:42:15] chasemp: yes, the LDAP group is never created [21:42:23] could you maybe link to that code in diffusion on the task [21:42:25] since it's first creating an empty group and then adding people [21:42:27] yeah [21:42:30] thanks YuviPanda [21:42:46] > # TODO: If project group creation fails we need to be able to fail gracefully [21:42:48] lol [21:42:54] Gak. We expected that this would cause an issue with removing the last user from a group (which was not a problem), I don't think anyone remembered the reverse. [21:43:04] 6Labs, 10Labs-Infrastructure: Groups for new projects are not created as groups in ldap (thus users cannot join them and ssh doesn't work, etc) - https://phabricator.wikimedia.org/T121064#1868319 (10chasemp) root@puppet-testing>getent group | grep -i Mediawiki-vagrant; echo $? 1 [21:43:13] YuviPanda: :D [21:43:33] # TODO: If project group creation fails we need to be able to fail gracefully [21:43:43] it is checking for success though [21:43:53] where does that failure bubble up I wonder? [21:44:10] $wgAuth->printDebug( "Failed to add project group $projectGroupName: " . ldap_error( $wgAuth->ldapconn ), NONSENSITIVE ); [21:44:33] to the useless debug log! [21:45:05] * bd808 curses the logging statements in most of MW [21:45:14] so what I don't know is, is the decision to disallow empty groups an openldap imperative or did we think it was more sane [21:45:33] and that determines whether we have to rewrite OSM portions or just do an openldap change [21:45:44] and I have really no backstory on any of the to-this-point reasoning [21:46:15] chasemp: My understanding is that disallowing empty groups is a standard-mandated behaviour. [21:46:48] sure by the why and can it be/should it be fixed [21:46:49] google tells me that it is specified in rfc2256 [21:46:50] chasemp: My suggestion was to always include the service group "user" within the group so that we never had an empty group even if there were no added users - that also needs OSM code changes. [21:47:17] chasemp: Either way, the code needs to create the group with an initial user; whatever that user is. [21:47:41] well not if we say it's more sane to allow it, if we can [21:47:45] rfc's be damned :) [21:48:14] 6Labs, 10Labs-Infrastructure: Groups for new projects are not created as groups in ldap (thus users cannot join them and ssh doesn't work, etc) - https://phabricator.wikimedia.org/T121064#1870868 (10yuvipanda) I looked at the source, it turns out that OSM first creates an empty group and then later on tries to... [21:48:25] Earwig, question on copyvios [21:48:30] go ahead [21:48:45] in index.mako, where does `query` come from? [21:49:01] is it automatically available to the template? I'm not seeing it imported anywhere [21:49:50] it's passed in as an argument to render_template [21:49:50] https://github.com/earwig/copyvios/blob/master/app.py#L106 [21:50:11] chasemp: does this look like a possible answer? -- https://ask.openstack.org/en/question/7097/ldap-class-groupofnames-requires-attribute-member/ [21:50:34] there is keystone config there for sticking a dummy member into groups [21:50:40] 6Labs, 10Labs-Infrastructure: Groups for new projects are not created as groups in ldap (thus users cannot join them and ssh doesn't work, etc) - https://phabricator.wikimedia.org/T121064#1870876 (10MoritzMuehlenhoff) >>! In T121064#1870868, @yuvipanda wrote: > I looked at the source, it turns out that OSM fir... [21:51:36] Earwig: thanks! [21:51:41] no problem [22:00:34] bd808: It's the deal but it sounds like the correct fix is an OSM change [22:00:41] well, the acceptable fix [22:01:48] do we have a test location for OSM these days? [22:02:33] not as far as I know so that makes it complicated [22:02:51] there's one 'working' installation of OSM in the whole wide world [22:03:21] I'm wondering if we couldn't just make the creator a member by passing the 'member' key in the $projectGroup hash [22:03:32] dn I mean [22:04:23] yeah I think that's what we should do [22:18:06] 6Labs, 10Labs-Infrastructure, 10Reading-Web, 6operations, 7Mobile: https://wikitech.m.wikimedia.org/ serves wikimedia.org portal - https://phabricator.wikimedia.org/T120527#1870936 (10Dzahn) [22:24:59] chasemp: who gets to bell this cat? [22:25:46] well I have the beginnings of it but a) unsure if right and b) unsure best way to test here, give me a sec and I'll share...we can collobarate and then suffer as one? [22:26:24] chasemp: so I think the only way to test is to just cherry pick on the live install [22:26:34] chasemp: so gerrit push, cherry pick, check, revert. [22:27:01] chasemp: and Krenair has been the awesome go-to person for mw related things (fwiw) [22:27:35] it's not as a simple as https://gerrit.wikimedia.org/r/#/c/258355/ is it? [22:30:00] 6Labs, 10Labs-Infrastructure, 10Reading-Web, 6operations, 7Mobile: https://wikitech.m.wikimedia.org/ serves wikimedia.org portal - https://phabricator.wikimedia.org/T120527#1870974 (10Krenair) Well, wikitech has MobileFrontend installed: https://wikitech.wikimedia.org/wiki/?useformat=mobile That domain w... [22:30:40] 6Labs, 10Reading-Web, 6operations, 10wikitech.wikimedia.org: [Regression] Unable to browse certain wikitech.wikimedia.org urls from mobile device (Apache error) - https://phabricator.wikimedia.org/T120528#1870977 (10Krenair) [22:30:49] 6Labs, 10Labs-Infrastructure, 10Reading-Web, 6operations, and 2 others: https://wikitech.m.wikimedia.org/ serves wikimedia.org portal - https://phabricator.wikimedia.org/T120527#1870980 (10Krenair) [22:32:23] 6Labs, 10Labs-Infrastructure, 10Reading-Web, 6operations, and 2 others: https://wikitech.m.wikimedia.org/ serves wikimedia.org portal - https://phabricator.wikimedia.org/T120527#1855709 (10Krenair) See T87633 [22:32:37] 6Labs, 10Wikimedia-Extension-setup, 10wikitech.wikimedia.org, 7Mobile: Install MobileFrontend on wikitech - https://phabricator.wikimedia.org/T87633#995645 (10Krenair) [22:32:39] 6Labs, 10Labs-Infrastructure, 10Reading-Web, 6operations, and 2 others: https://wikitech.m.wikimedia.org/ serves wikimedia.org portal - https://phabricator.wikimedia.org/T120527#1870989 (10Krenair) [22:32:56] chasemp: could be [22:32:58] try [22:33:42] YuviPanda, chasemp: Context? [22:33:49] What are you trying to achieve? [22:34:01] Krenair: openldap no longer allows that ldap group to be created without an initial member [22:34:24] and opendj was in the wrong here so we are thinking add teh creator as an initial member [22:36:30] it looks like it could work [22:36:47] you should fix your indentation btw chasemp [22:36:49] :P [22:37:28] oh yes I should [22:45:12] 10PAWS: user's user-fixes.py not taken into account - https://phabricator.wikimedia.org/T121160#1871024 (10Rama) 3NEW [22:46:36] so does it work chasemp? [22:47:42] not sure YuviPanda help me hot test this quick? [22:47:53] had to track down my phone (2fa) there [22:48:22] chasemp: is it deployed? [22:48:29] yeah [22:48:32] ok let me try [22:51:36] chasemp: no dice, I tried creating 'test-project-a' not showing up in 'id' [22:51:54] getent also tells me it ain't there [22:52:17] hm [22:56:46] so YuviPanda I made gtest1-temp [22:56:51] and if I go and find it and filter fo rit [22:56:52] for it [22:56:59] it says it exists and I'm a member [22:57:04] that membership info...is from ldap isn't it? [22:57:11] oh [22:57:11] is this possibly a perms issue? [22:57:13] yeah [22:57:15] it is [22:57:18] what do you mean by perms issiue? [22:57:29] like it's being created with perms that getent can't see [22:58:01] oh... [22:58:25] can't you just use ldaplist [22:59:06] krenair@terbium:~$ ldaplist -l projects gtest1-temp [22:59:09] dn: cn=gtest1-temp,ou=projects,dc=wikimedia,dc=org [22:59:12] Krenair: getent is using PAM which is what is denying people access, so that seems like the appropriate thing to use [22:59:14] hmmm [22:59:18] so teh permission issue might be a thing [22:59:21] member: uid=novaadmin,ou=people,dc=wikimedia,dc=org [22:59:21] member: uid=rush,ou=people,dc=wikimedia,dc=org [22:59:22] Krenair: try test-project-a? [22:59:38] chasemp: actually, yeah, since novaadmin is a member of all proejcts already [22:59:41] I forgot... [23:00:04] well I think it's a member post creation [23:00:16] the initial creation may still (we should verify) have been failing [23:00:27] we could just make it a member at the start then [23:00:29] hmm right [23:00:45] Krenair: I'm cool w/ that [23:01:38] chasemp: yeah, so getent can't see them but ldaplist can [23:01:42] ok I reset teh hotfix [23:01:47] let's double check that's needed here [23:01:50] ok [23:03:05] yeah I don't think it is, or I just reset it and created gtest2-temp [23:03:05] and [23:03:10] dn: cn=gtest2-temp,ou=projects,dc=wikimedia,dc=org [23:03:11] objectClass: extensibleObject [23:03:11] objectClass: groupOfNames [23:03:18] member: uid=novaadmin,ou=people,dc=wikimedia,dc=org [23:03:18] member: uid=rush,ou=people,dc=wikimedia,dc=org [23:03:19] info: use_volume=home [23:03:25] info: use_volume=project [23:03:26] cn: gtest2-temp [23:03:34] although, I'm not entirely sure why that's allowed atm :) [23:04:06] why what's allowed? [23:04:27] it seems like it's creating an empty group and then adding members which isn't supposed to be ok with openldap [23:04:31] but maybe I'm misunderstanding [23:04:55] maybe it adds novaadmin directly somehw? [23:05:08] sure enough YuviPanda [23:05:10] dn: cn=mediawiki-vagrant,ou=projects,dc=wikimedia,dc=org [23:05:12] objectClass: extensibleObject [23:05:12] objectClass: groupOfNames [23:05:13] info: use_volume=home [23:05:14] info: use_volume=project [23:05:17] info: servicegrouphomedirpattern=/home/%p%u/ [23:05:19] member: uid=novaadmin,ou=people,dc=wikimedia,dc=org [23:05:21] member: uid=bd808,ou=people,dc=wikimedia,dc=org [23:05:22] member: uid=dduvall,ou=people,dc=wikimedia,dc=org [23:05:24] cn: mediawiki-vagrant [23:05:28] bd808's ^ [23:05:31] it was there [23:05:40] right [23:05:43] so getent can't see them [23:06:05] Coren: ^ do you know if the PAM stuff that was done recently could be affecting this [23:06:07] ? [23:06:24] thanks Krenair btw [23:06:37] np [23:08:11] 6Labs, 10Labs-Infrastructure: Groups for project are created in ldap but getent cannot see them on user VMs (disallowing ssh, etc) - https://phabricator.wikimedia.org/T121064#1871509 (10chasemp) [23:11:42] 6Labs, 10Labs-Infrastructure: Groups for project are created in ldap but getent cannot see them on user VMs (disallowing ssh, etc) - https://phabricator.wikimedia.org/T121064#1871522 (10chasemp) With the idea that OSM was doing bad things I put up https://gerrit.wikimedia.org/r/#/c/258355/ and tested with a ho... [23:11:53] chasemp: neat. So the break is somewhere other than the OSM code I take it [23:12:13] I have a suspicion yes https://phabricator.wikimedia.org/T121064#1871522 [23:12:29] good and bad, patching OSM was a fun prospect but I'm not sure on the ACL fix at all [23:14:59] YuviPanda: we are back to sitting on it for a bit as a rash acl change seems silly [23:15:08] chasemp: yup [23:15:17] chasemp: I think this is a 'throw over the wall to europe' thing [23:15:43] we better bootstrap some openldap knowledge and fast my friend :) [23:36:25] 6Labs, 6Research-and-Data, 10Wikimedia-Stream: Provide useful diffs to high-volume consumers of RCStream - https://phabricator.wikimedia.org/T100082#1871596 (10yuvipanda) [23:38:54] 10PAWS, 10pywikibot-core, 5Patch-For-Review: Install developer requirements into PAWS - https://phabricator.wikimedia.org/T120860#1871599 (10yuvipanda) https://github.com/yuvipanda/paws/pull/8 [23:48:36] 10PAWS: user's user-fixes.py not taken into account - https://phabricator.wikimedia.org/T121160#1871618 (10yuvipanda) https://github.com/yuvipanda/paws/pull/7