[00:12:05] yuvipanda: do you have a phab task for the "build an elasticsearch cluster" thing you want me to do? [00:17:12] bd808: no, let me make one [00:19:02] 6Labs, 10Tool-Labs: Setup an experimental, user accessible (read+write) ES cluster for Tool Labs - https://phabricator.wikimedia.org/T120040#1843341 (10yuvipanda) 3NEW [00:19:19] 6Labs, 10Tool-Labs: Setup an experimental, user accessible (read+write) ES cluster for Tool Labs - https://phabricator.wikimedia.org/T120040#1843348 (10yuvipanda) a 3 node cluster with an authenticating proxy in front! [00:19:24] bd808: ^ [00:30:45] !log deployment updated rsvg on appserver to 2.40.11 - https://phabricator.wikimedia.org/T112421 [00:30:46] deployment is not a valid project. [00:31:09] !log deployment-prep updated rsvg on appserver to 2.40.11 - https://phabricator.wikimedia.org/T112421 [00:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL, Master [00:31:12] 6Labs, 10Tool-Labs: Setup an experimental, user accessible (read+write) ES cluster for Tool Labs - https://phabricator.wikimedia.org/T120040#1843407 (10bd808) > read-for-all This should be easy but don't forget that all the horrible things one can do to DOS/crash an Elasticsearch node are query based. Will we... [00:34:21] 6Labs, 10Tool-Labs: Setup an experimental, user accessible (read+write) ES cluster for Tool Labs - https://phabricator.wikimedia.org/T120040#1843416 (10bd808) p:5Triage>3Normal [00:34:55] 6Labs, 10Tool-Labs, 15User-bd808: Setup an experimental, user accessible (read+write) ES cluster for Tool Labs - https://phabricator.wikimedia.org/T120040#1843422 (10bd808) a:3bd808 [00:36:17] 6Labs, 10Tool-Labs, 15User-bd808: Setup an experimental, user accessible (read+write) ES cluster for Tool Labs - https://phabricator.wikimedia.org/T120040#1843425 (10yuvipanda) Right, so by default let's not expose this outside the tools project, and make up DOS protections as we go along :D Ah, so the oth... [00:39:46] bd808: I think re: dos, we need to find out what ES does if the http request that requested a query gets aborted [00:39:56] and once we set up authentication we can do rate limiting too I guess [00:42:03] I haven't tested in a long time, but I think it doesn't realize that the client socket is closed until it tries to write things back to it [00:42:43] the web bits and the internal job pool are detached and java never was any good at noticing that clients went away [00:43:28] but really we probably don't have to be more concerned than we are with the mysql instances [00:43:47] if somebody brings the cluster down then we'll bring it back up [00:44:54] yuvipanda: you know this means I'll have to build a deploy server too right? :/ [01:12:05] 6Labs, 10Labs-Infrastructure, 6operations, 7Icinga: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#1843529 (10Dzahn) 3NEW [01:14:19] 6Labs, 10Labs-Infrastructure, 6operations, 7Icinga: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#1843545 (10Dzahn) [01:20:58] bd808: yeah :( [01:25:57] 6Labs, 10Labs-Infrastructure, 6operations, 7Icinga: icinga config broken due to duplicate labstestcontrol - https://phabricator.wikimedia.org/T120050#1843587 (10Dzahn) 3NEW [01:26:34] 6Labs, 10Labs-Infrastructure, 6operations, 7Icinga: icinga config broken due to duplicate labs-ns1 / labcontrol2001 - https://phabricator.wikimedia.org/T120050#1843595 (10Dzahn) [01:35:00] yuvipanda: the deploy server isn't a big deal. It will mean however that we either (a) switch to a local salt server for everything in the tool project or (b) have a small number of nodes in the project that are attached to a different salt master. Which seems less gross to you? [01:41:18] bd808: hmm, I wonder if we can get away without a deployment server [01:41:31] bd808: salt has never worked for me and I swore off using it and haven't used it for at least 6 months now [01:41:38] bd808: do we really need the plugins? [01:42:00] bd808: I propose we either not use the plugins, or just put the jars in a regular git repo and do a git::clone [01:46:00] how do you do remote execution without salt? back to dsh? [01:48:41] mutante: I've pssh [01:48:43] on my local machine [01:48:55] mutante: also, opening 15 tabs and copy pasting > salt at that tiem :P [01:48:57] *time [01:49:59] hehe @ 15 tabs [01:50:01] bd808: in terms of grossness, (b) sounds less gross [01:50:08] looks up pssh [01:50:31] ah, the google code one [01:50:34] gothca, yea [01:51:00] gotta go, brb [01:56:37] by now i assume this means "gotta switch to another channel":) [02:00:12] yuvipanda: did you see the nova conductor critical alert? [02:00:35] maybe even the SMS? [02:01:04] yes but it's labtestcontrol2001 I see now [02:01:13] puppet setup monitoring on me and I didn't realize [02:01:17] ok I"ll silence and nvmd [02:01:24] i already made a ticket that hosts with "test" in their name should probably not send SMS [02:01:34] yes [02:01:54] do you think this may be related to the icinga issue with [02:02:04] "labcontrol2001" [02:02:15] or is labtestcontrol2001 completely separate [02:02:21] completely separate [02:02:25] alright [02:04:01] so i tried fixing the issue with labcontrol2001 but it keeps coming back [02:04:06] and because that breaks icinga config [02:04:26] i can't see really see if the ores monitoring works now [02:04:48] i can temp. fix it by nuking labcontrol2001 repeatedly [02:05:23] yuvipanda: we can maybe figure out how to get elasticsearch installed without a deploy server. The full ELK stack uses Trebuchet on all parts right now however [02:05:26] made a ticket for that too, both tagged operations/icinga/labs-infra [02:05:43] bd808: it's 6, i think he is on the way to office [02:05:47] labcontrol2001 is pretty ineffectual mutante I wouldn't worry about nuking it or doing weird stuff to get icinga going [02:06:04] it's slated for decommission soon anyway afaik [02:06:07] chasemp: it only fixes it until the next puppet run [02:06:19] eh, it seems something got renamed [02:06:30] the error is about "labs-ns1" [02:06:32] what's the issue? [02:06:40] but you have to kill labcontrol2001 to fix it [02:06:52] mutante: heh. isn't that when normal people would be leaving the office? (says the guy that just finished dinner and hopped back on irc to answer some pings) [02:07:04] that is probably andrewbogott's changes I guess, I believe he was doing dns things [02:07:27] chasemp: this https://phabricator.wikimedia.org/T120050 , yes it is [02:07:39] yea, it is related to a host being renamed [02:08:02] bd808: yes :) it is!:) [02:10:16] ah, but i could fix the "ores" monitoring thing [02:10:21] that works now [02:10:32] yuvipanda: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=ores.wmflabs.org&service=ORES+home+page [02:11:42] mutante: sorry, I don’t know why it came back [02:11:52] I guess I should’ve assigned the placeholder to labcontrol2001 :( [02:12:13] sorry for the ping andrewbogott I know it's late [02:12:37] hm... [02:13:16] it's not like icinga is down or anything [02:13:19] we just cant add new stuff [02:13:49] so i think it doesnt have to be solved right the second [02:13:55] though we should soon [02:13:59] I mean if labcontrol2001 doesn't matter we could just disable puppet there and silence it [02:14:14] it shouldn't generate the icinga config again if it's missing will it? or I guess it's in puppetdb already [02:14:16] so darn [02:14:21] if we can disable puppet, that would fix it [02:14:29] disable puppet agent, stop salt minion [02:14:40] revoke puppet cert? [02:14:47] andrewbogott: ^? [02:15:02] it re-adds itself when puppet agent runs [02:15:04] that's what happens [02:15:06] yes, disabling puppet would work [02:15:16] what i don't know is why it's duplicate [02:15:25] like, what's the other one from [02:15:35] I guess dupe from the other control node? [02:15:35] dammit everything is going wrong [02:15:39] with my laptop [02:15:48] mutante: the issue is that I moved dns from 2001 to 1002 [02:15:48] ok, let me just disable puppet on labcontrol2001 [02:15:51] some logic switch is defining it on both I guess [02:16:00] and kill it from stored configs one more time [02:16:00] and I’m leaving 2001 up on purpose in case there are stale dns caches pointing to it. [02:16:05] but yeah probably not a problem worth solving atm [02:16:10] 2001 is scheduled for decom in a few days anyway [02:16:29] so I’d say disable puppet, purge the icinga record... [02:16:38] and then you’ll have done some of my work when it comes time to decom on Thursday :) [02:16:57] (I was about to suggest an alternative convoluted solution but can’t think of why that would be better) [02:17:06] ok, fair enough [02:17:30] ok guys I'm out :) sorry for the page from labtestcontrol what hapened was the broken icinga made the new services define way later than the change so I missed it [02:17:51] "moved dns" means the same IP address is now on a different server, right [02:18:01] new ip [02:18:02] but same hostname [02:18:11] so, labs-ns1 is a new IP on a new server [02:18:28] 2001 is serving on the old IP with (in theory) no hostname associated with it [02:18:40] but clearly my transitional puppet work was missing a piece there. [02:19:02] mutante: in case you’re curious, here is the context: https://phabricator.wikimedia.org/T119762 [02:19:58] mutante: but, ok if I disappear for now? [02:21:14] i'll just disable puppet on that host [02:21:23] cool [02:21:27] ok, here I go [02:22:19] me too [02:58:40] bd808: yeah, we can deal with ELK when we get to ELK [02:59:09] well if you want me to move stashbot that's ELK [02:59:26] or EL at least [02:59:55] bd808: oh, hmm. hmm, my thought process was that E should be infrastructure and the bot itself should run as a tool (on k8s maybe?) [03:00:12] the bot i'm using is Logstash [03:00:13] I didn't fully think it through, clearly... [03:00:25] right, but we can definitely run logstash on kube [03:00:40] so if you can do the E part I can help do the L part [03:01:11] *nod* [03:04:05] yuvipanda: for what I'm doing a custom bot writing directly to Elasticsearch may end up being nicer in the long run [03:05:08] bd808: yeah I agree :D [03:05:58] while I've got you here... I can't log into stashbot-deploy.stashbot.eqiad.wmflabs [03:06:08] I tired rebooting and that didn't fix it [03:06:44] bd808: so oh looking [03:07:00] bd808: also do we really need the plugins [03:07:02] ? [03:07:18] we can get by without them [03:07:32] or we could package them into debs too really [03:07:46] they are just jars that need to be dropped into a directory [03:08:03] bd808: looks like puppet's been for a while [03:08:12] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find data item mediawiki_memcached_servers in any Hiera data file and no default supplied at /etc/puppet/modules/mediawiki/manifests/nutcracker.pp:19 on node stashbot-deploy.stashbot.eqiad.wmflabs [03:08:14] Warning: Not using cache on failed catalog [03:08:16] Error: Could not retrieve catalog; skipping run [03:08:28] I figured that was the problem [03:08:39] so that made it not pick up some ldap stuff [03:08:41] let me manually fix it [03:11:47] bd808: try now [03:12:05] I'm in. thanks! [03:12:36] bd808: yw! [03:12:47] * yuvipanda fixes more paws stuff [03:13:35] I'm looking at the elasticsearch stuff. The trebuchet dependency is in the role, not the module so we can totally avoid it [03:14:25] probably the easiest thing to do will be to make a new role just for this that is similar to role::logstash::elasticsearch [03:14:44] we will want a role for do whatever proxy stuff anyway [03:15:11] bd808: yeah [03:15:22] bd808: I think that makes sense. also allows us to drop the monitoring [03:16:44] should it go in the ::role::labs module or ??? [03:17:00] bd808: hmmmmmmmmm [03:17:02] good question [03:17:18] bd808: maybe role/tools.pp? [03:17:44] bd808: that's where I'm putting all the 'new' tools code (ones that do not inherit from the common toollabs module, which brings in all the gridengine stuff) [03:17:55] I'll move it around at some point soon but maybe in the meantime it's a good spot for it [03:18:36] or I could start by putting it in the right place instead of the gawdaweful manifests pile [03:18:46] ;) [03:18:51] bd808: unfortunately no, since we've to move them all in 'one go' [03:19:05] really? [03:19:10] bd808: having similar items in both places makes puppet freak out [03:19:17] ah [03:19:23] which is why modules/role/manifests has no top level items [03:19:44] so plan is to make sure all the things in manifests/role are autolayouted and then move them in one fell swoop to modules/role [03:20:01] but that has problems, especially with tools since there's so many groups that need to change on wikitech [03:20:04] so i've been putting that off a bit [03:20:26] also manifests/role/labstools.pp is a mess, because inheritance for not very good reasons [03:20:57] oh gross [03:21:03] yeah [03:21:05] why did you make me look at that [03:21:32] bd808: hence manifests/role/tools.pp :) [03:21:43] I'll slowly migrate things, and then labstools might be just tools::legacy [03:21:45] or something like that [03:21:49] and has only the gridengine stuff [03:21:54] *nod* [03:22:15] bd808: so I'd suggest making it role::tools::elasticsearch and putting it in tools.pp for now [03:22:22] and I can break tools.pp out into a folder later [03:22:36] bd808: hmm you can also make a 'tools' folder in modules/role/manifests [03:22:38] and put it there [03:22:40] either works for me [03:22:44] latter is cleaner I guess [03:22:56] that was my "start in the right place" idea [03:23:18] right [03:23:20] fair enough [03:23:26] I think that *should* work [03:23:28] and not freak out [03:23:30] about tools.pp [03:23:34] vs modules/tools [03:23:37] but I'm not fully sure [03:23:41] and it might freak out [03:23:51] and we can't really test with puppet compiler [03:24:06] do you have a ::role::tools class? [03:24:15] if not it *should* be fine [03:24:26] but we could test in another project [03:25:00] I'll just do it in the labs.pp [03:25:08] you can clean things up later [03:25:27] err tools.pp [03:25:33] yeah [03:25:42] that was my 'oh god this thing is a mess' reaction [03:26:31] bd808: also, the k8s hosts have their own puppetmaster (since we need working secrets / private repo) [03:26:39] bd808: not sure if there's any reason to use that for the es hosts though [03:30:10] we should be fine without that I think [03:30:22] yeah [03:30:36] * bd808 is still trying to think of how identd could work for the proxy [03:30:55] bd808: oh, so that's what we use for proxylistener [03:31:10] modules/toollabs/files/proxylistener.py [03:33:23] yuvipanda: shouldn't that be rewritten in nodejs to be web scale ;) [03:33:57] * yuvipanda rewrites bd808 in nodejs [03:34:20] bd808: if I were writing it today I'll write it in asyncio or somesuch :) [03:34:56] coroutines! [03:35:01] bd808: I wonder how hard proxying http would be with py3 [03:35:19] I was wondering about node or go actually [03:35:22] yeah [03:35:25] I'd be ok with Go too [03:35:38] does it have gc issues? [03:35:45] as a proxy? [03:35:46] it does but not at this scale [03:35:52] cool [03:35:57] and they're working on it [03:35:59] so [03:36:01] we should be ok [03:36:40] bd808: eventually I want us to not use identd [03:36:51] bd808: and instead use... something else. just not sure what this 'something else' is [03:37:00] in general I've been thinking that a go frontend proxy for elasticsearch might be the right way to try and build the whole per-project logging thing [03:37:34] * yuvipanda nods [03:37:41] a prefix-enforcing proxy1 [03:37:44] ! [03:37:44] hello :) [03:38:07] also definitely go over node :) [03:39:17] bd808: I might play around with http://vulcand.io/ at some point [03:39:21] unrelated to this [03:39:29] but might help me get rid of our custom redis based dynamicproxy code [03:39:53] I saw you rambling about that yesterday [03:40:03] seems like a reasonable goal [03:40:12] custom software is a trap [03:41:41] yeah [03:43:02] first must focus on fixing paws [03:43:06] * yuvipanda files more paws bugs [03:54:22] 6Labs, 10Tool-Labs, 15User-bd808: Setup an experimental, user accessible (read+write) ES cluster for Tool Labs - https://phabricator.wikimedia.org/T120040#1843831 (10bd808) Current plan: * 3 XL jessie instances in the tools project * new `::role::toollabs::elasticsearch` Puppet role to provision: ** Elastics... [04:21:46] yuvipanda: I got hiera hacked to fix my puppetmaster -- https://wikitech.wikimedia.org/w/index.php?title=Hiera%3AStashbot&type=revision&diff=213671&oldid=191987 [04:26:33] bd808: why... are those required for stashbot [04:26:38] deployment server? [04:26:41] yeah [04:26:49] which pulls in a full MW setup [04:26:50] ugh [04:26:52] ofcourse [04:27:03] and apparently has so wonky config needs [04:27:08] yeah, let's just...not use the plugins >_> [04:27:46] trebuchet and scap need to be split apart but that's for another day [04:27:56] s/day/$unitoftime/ [04:29:30] I did my KP duty for the week with l10nupdate stuff over the weekend [04:29:35] heh [04:29:37] KP? [04:30:17] military term for cleanup crew work. kitchen patrol I think [04:30:34] https://en.wikipedia.org/wiki/KP_duty [04:30:40] haha [04:30:42] ofc [05:12:40] yuvipanda: can you raise the security groups quota for tools? We are at 10/10 and I'd like a group for the elasticsearch stuff [05:12:50] bd808: sure [05:13:03] doing [05:13:47] !log tools increased security groups quota to 50 because why not [05:13:50] bd808: ^ [05:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [05:14:06] thx [09:02:16] 6Labs, 10Labs-Infrastructure, 6operations, 7LDAP, 7discovery-system: Allow creation of SRV records in labs. - https://phabricator.wikimedia.org/T98009#1844087 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [09:02:57] 6Labs, 10Labs-Infrastructure, 6operations, 7LDAP, 7discovery-system: Allow creation of SRV records in labs. - https://phabricator.wikimedia.org/T98009#1257018 (10MoritzMuehlenhoff) The new openldap::labs servers based on OpenLDAP (seaborgium, serpens) provide that schema, a quick test was fine. [09:21:46] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Database-Queries: Get access to an old database on tools-db - https://phabricator.wikimedia.org/T101709#1844122 (10Pouyan) any success on this :). [09:32:50] 6Labs, 6operations, 7Puppet: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1844125 (10fgiunchedi) @maxsem or @milimetric I can't access either of those instances, could you add my wikitech user 'Filippo Giunchedi' to the project(admin) ? thanks! [09:40:26] 6Labs, 6operations, 7Puppet: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1844127 (10fgiunchedi) >>! In T119541#1844125, @fgiunchedi wrote: > @maxsem or @milimetric I can't access either of those instances, could you add my wikitech user 'Filippo Giunchedi' to the project(ad... [09:52:17] 6Labs, 6operations, 7Puppet: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1844142 (10fgiunchedi) debugged a bit further, e.g. on `puppet-test02` I can get past the error by explicitly `include base` on the node definition in `site.pp` (since `compile puppet.conf` is defined... [09:52:46] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Database-Queries: Get access to an old database on tools-db - https://phabricator.wikimedia.org/T101709#1844144 (10valhallasw) s51347 = tools.fatg. You should be able to access it from the 'fatg' tool account. [11:13:58] 6Labs, 10MediaWiki-extensions-Newsletter: Internal error when creating new user in newsletter-test.wmflabs.org - https://phabricator.wikimedia.org/T119945#1844349 (10Glaisher) Not a problem with the extension itself. Looks like there is no free space available for redis on newsletter-test instance. [11:19:08] 6Labs, 10MediaWiki-extensions-Newsletter: Internal error when creating new user in newsletter-test.wmflabs.org - https://phabricator.wikimedia.org/T119945#1844354 (10Qgil) I guess a bigger instance can be requested, then? [11:24:14] 6Labs, 10MediaWiki-extensions-Newsletter: Internal error when creating new user in newsletter-test.wmflabs.org - https://phabricator.wikimedia.org/T119945#1844359 (10Glaisher) @01tonythomas and @valhallasw were discussing about that the other day. I'm not sure about how it went though. As a short term fix, I t... [11:53:35] 6Labs, 10MediaWiki-extensions-Newsletter: Internal error when creating new user in newsletter-test.wmflabs.org - https://phabricator.wikimedia.org/T119945#1844417 (10valhallasw) The issue was not in the data on redis (which was a few MB) but rather in other stuff on /srv. /srv on a small instance is only 1.5GB... [12:19:34] (03CR) 10Aude: [C: 031] "would +2 except that i think deployment of these is handled differently now and don't think i have permission for that" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/247831 (owner: 10Addshore) [12:48:46] (03CR) 10Merlijn van Deen: "Yeah, this is running on k8s and Yuvi is the only one with the keys to the castle." [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/247831 (owner: 10Addshore) [12:59:35] Hi folks, could the new maps-warper warper.wmflabs.org OAuth registration be approved please? It adds "edit existing pages" to rights. https://meta.wikimedia.org/wiki/Special:OAuthListConsumers?name=&publisher=Chippyy&stage=0 Thanks! [13:00:27] also #T73257 was re-opened by a user if we need a ticket [13:17:10] 10Tool-Labs-tools-Other, 7Epic: Convert all Labs tools to use cdnjs for static libraries and fonts - https://phabricator.wikimedia.org/T103934#1844572 (10Ricordisamoa) [13:23:07] actually delay that request, I need to specify it just for commons [14:33:48] yuvipanda: Do you need all your open PAWS OAuth requests? :P [14:34:00] You've got 8 currently... [14:34:09] Can all but the newest be declined? [14:36:28] Someone remind me: how do I kill a qsub/jstart/whatever it is process? [14:36:34] I have the process ID. [14:40:39] harej: qdel [14:40:49] Thank you Coren [14:41:11] Also 'jstop ' if it was started with jstart. Same concept, just simpler to remember. [14:41:28] (qdel always works if you have the job id) [14:41:29] Does it matter which I use? [14:41:51] No. jstop is only there for symetry/ease of remembering. [14:42:07] I see. [14:42:14] Well, thank you! [14:43:14] My pleasure. [14:54:45] 6Labs, 6operations, 7Puppet: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1844724 (10akosiaris) I just realized the `site.pp` comment above is the reason there are problems on the `puppet-test02` is having problems. labs have an LDAP enc, reusing site.pp to override/overload... [15:03:47] Yeah, the PAM fix returns the pretty motd as expected! [15:09:57] 10Tool-Labs-tools-Database-Queries, 6Phabricator: Archive Tool-Labs-tools-Database-Queries project - https://phabricator.wikimedia.org/T107699#1844789 (10jcrespo) 5declined>3Open Sorry for reopening these, but who is actively attending these? I do dozens more of these being pinged by IRC and answering labs... [16:50:02] RECOVERY - Host tools-shadow-01 is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [16:50:18] 6Labs, 10Labs-Team-Backlog: Labs: create a new scheme for /etc/security/access.conf customization - https://phabricator.wikimedia.org/T120106#1845099 (10coren) [17:05:33] That was painful. [17:29:15] PROBLEM - Host tools-shadow-01 is DOWN: CRITICAL - Host Unreachable (10.68.16.87) [17:29:49] shinken-wm, you moron. That hoes doesn't even exist anymore. [17:30:04] host* [17:30:06] * Coren blushes. [17:31:22] lol [17:39:16] RECOVERY - Host tools-shadow-01 is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [18:01:29] 6Labs, 6operations, 7Puppet: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1845269 (10MaxSem) [18:09:14] PROBLEM - Host tools-shadow-01 is DOWN: CRITICAL - Host Unreachable (10.68.16.87) [18:19:15] RECOVERY - Host tools-shadow-01 is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [18:29:17] yuvipanda the ORES system hit Hacker News ;-) \ [18:29:25] !log tools switching gridmaster activity to tools-grid-shadow [18:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:36:44] yuvipanda: up and working? I’d like to distract you with https://gerrit.wikimedia.org/r/#/c/256170/ [18:37:19] Coren: notifying us for days about deleted hosts seems to be a thing shinken does these days. [18:37:24] darkblue_b: and wired.com [18:37:31] andrewbogott: just woke up, I'll check in like 20 mins? [18:38:29] * darkblue_b picks up bad science fiction novel and wishes Wikipedia editing well ;-) [18:38:45] yuvipanda: sure [18:38:49] \o/ [18:39:19] * halfak tried to remember his hacker news login and eventually gave up [18:40:29] * halfak sees Kenrick95 in the comments [18:41:23] Darn it kenrick95 doesn't know about our stat reporting! [18:58:39] andrewbogott: Yeah, I'm not sure how to forcibly let it know it no longer exists. [18:58:52] andrewbogott: I though it'd take one or two puppet runs at most. [18:59:15] PROBLEM - Host tools-shadow-01 is DOWN: CRITICAL - Host Unreachable (10.68.16.87) [19:02:27] andrewbogott: Coren usually ^ indicates puppet's failing [19:02:43] yuvipanda: On the shinken box, you mean? [19:02:53] yes [19:04:19] No alerts about that, but I'll go look. [19:04:26] thnks [19:04:28] *thanks [19:05:03] andrewbogott: I guess the commandline works? if so lgtm [19:05:24] yuvipanda: well, it sort of works [19:05:33] I’m confused as to why there are no instances listed at the bottom of the testlabs project page [19:07:37] ah [19:07:53] well, I'm sure you know what I think of SMW... [19:07:59] and I'm usually surprised when it does work... [19:08:08] it’s crappy but I’d like to at least understand what it’s supposed to do [19:08:18] hmm, ask in the SMW channels maybe? [19:08:23] for me the mystery is more why there /are/ instance lists on other pages since there doesn’t seem to be anything in the template [19:08:27] yeah, I’ll read more and then see [19:08:35] heh [19:09:12] * Coren is a bit confused. [19:09:36] yuvipanda: puppet wasn't disabled on shinken-01, yet a manual puppet run obviously just caught up with days of changes at least. [19:10:01] * Coren tries to see if there is anything wrong with the cron entry. [19:11:52] have I logged that I was going to restart a mysql? [19:12:27] (asking on the wrong channel, obviously) [19:58:17] andrewbogott: small connundrum. /srv/glance/ wasn't handled w/ puppet on glance servers. So I changed it around to be created, tested on labcontrol1002 and saw it was owned by root not glance, and adjusted. then rolled out to labcontrol1001 and it was owned by glance and not root :) [19:58:27] so not sure which is more correct but fyi [19:58:47] (tldr /srv/glance was not managed and permissioned differently on each labcontrol) [20:00:29] Coren: did you use salt to do pam cleanup? [20:00:38] yuvipanda: Aye. [20:00:39] if so, did it actually hit any/all instances? [20:00:47] it didn't at all the last time I used it [20:00:50] so do verify :) [20:00:53] chasemp: they were different before you puppetized? [20:00:59] that's what I mean [20:01:03] yuvipanda: Afaict, it hit every instance - I made sure to run it more than once and compare. :-) [20:01:10] Coren: ok! [20:01:10] now they are the same, but unsure which was ...more correct [20:01:25] yuvipanda: The self-hosted puppets were hit too, but of course the script wasn't there to run. [20:01:48] It'd be nice if we had grains for "all the not-self-hosted" and "all the self-hosted" [20:02:15] andrewbogott: what would I test to ensure sanity? [20:02:40] chasemp: I think all that matters is that glance/images is owned by glance [20:02:52] kk just a mental then for you [20:02:56] in case of $something [20:02:57] to test you would need to upload a new image to glance, then verify that it (eventually) gets synced to labcontrol1002 [20:03:05] mental note that is [20:03:07] and also that an instance can be built using that [20:03:14] I imagine your right [20:03:48] chasemp: I think that nodepool is constantly creating and deleting images [20:04:01] So another option is just wait and see if hashar complains :( [20:04:22] hashar as an alert system confirmed [20:05:38] chasemp: there’s a cron that rsyncs that dir from 1001 to 1002. That’s the only part that might break without us noticing [20:46:54] heya, there are some local changes on deployment-puppetmaster [20:46:59] looks like scap related stuff [20:47:03] anyone mind if I reset? [20:47:36] ottomata: you should ask in #wikimedia-releng [20:47:39] i did [20:47:51] no response til a sec ago... [20:47:51] :) [20:55:31] 6Labs, 6operations, 7Puppet: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1845820 (10akosiaris) `puppet-test02` can be considered fixed btw [20:57:05] 6Labs, 6operations, 7Puppet: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1845826 (10yuvipanda) So was the problem just including things in site.pp vs including it via LDAP? [21:01:55] 6Labs, 6operations, 7Puppet: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1845854 (10akosiaris) More precisely it was the fact that the //node// was defined in both site.pp and LDAP. [21:54:19] Hi, I need some help getting a WMF contractor set up with an Instance on Labs: https://phabricator.wikimedia.org/T118246 [21:54:54] We're building The Wikipedia Library's digitial library card platform: https://phabricator.wikimedia.org/T102048 [21:55:35] @andrewbogott @Coren @yuvipanda just pinging in case you're around [21:56:38] can you tell me more about why you need an instance? [21:57:48] sure. we currently serve 3000 editors with access to 5000 accounts from major publishers. each editor signs up for an account individually, and often needs to give us their email address so the publisher can give them access. we can't collect that kind of information and share it with a third-party under the tool labs policies. [21:58:09] we already do this manually, with WMF-approved legal language and NDAs, we are just systematizing it in a web app [21:58:50] ok, so the next step would be to click ‘request project’ on https://wikitech.wikimedia.org/wiki/Main_Page [21:58:58] and enter a project request ticket [21:59:19] i think we have it already: https://phabricator.wikimedia.org/T102048 [21:59:45] or is it something different we need to request? [22:00:31] that ticket looks like it’s describing a gsoc grant, or something? [22:01:17] A labs project is a thing that provides access to create and destroy instances, usage quotas, things like that [22:01:41] this is a project approved in last year's annual plan for the Wikipedia Library to build through a contractor. We're in Community Engagement, it's been approved by Luis [22:01:50] ok, so we need a labs project then? [22:02:27] yes [22:02:45] is this what we needed to do? https://wikitech.wikimedia.org/wiki/New_Project_Request/TWL [22:03:01] nikki is a WMF contractor [22:03:30] I would encourage you and ThatAndromeda to read https://wikitech.wikimedia.org/wiki/Help:FAQ and related links… there’s a bit too much context for me to rattle off on IRC :) [22:04:04] also please subscribe to labs-l so you get news of system changes and such [22:05:07] Is Nikkimaria == thatandromeda? [22:05:08] will do. andrew, i went through this once before for another TWL project: https://wikitech.wikimedia.org/wiki/Nova_Resource:Fulltext.full-text-reference-tool.eqiad.wmflabs [22:05:36] nikkimaria is our head of volunteer coordination; thatandromeda is a hired developer [22:05:47] ah, ok [22:05:49] it's just a confusing process for me despite the context [22:06:31] So, nikkimaria is now admin for the TWL project. They can add and remove other users [22:06:34] and create instances and such [22:06:48] wow, so that's it? [22:07:26] I guess :) [22:07:44] Most everything should be self-serve unless you need resources that aren’t part of the standard quotas [22:08:25] andrew, thanks!!! [22:08:47] sure [22:08:56] for now we're good. phase 1 is pretty simple. phase 2 come summer might involve more intricate integration with other library services and proxies [22:09:35] and our user/server load is going to be totally manageable. nothing data or query-heavy for now [22:54:51] @andrewbogott Something is working right, but also weird. We have this 1) https://wikitech.wikimedia.org/wiki/Nova_Resource:Twl and ALSO this 2) https://wikitech.wikimedia.org/wiki/Nova_Resource:The_Wikipedia_Library [22:55:11] they have the same project name but different page names and different configurations [22:55:37] also, nikki can't manage the 2) version even though she's an admin [22:56:34] we just redirected one to the other and now it's working, but i'm unsure if we're doing something wrong [22:56:36] ocaasi: it looks to me like you tried to rename the twl page [22:56:44] ah [22:56:55] all of those nova resource pages are auto-generated and maintaned [22:57:09] right i did, but it gave me an error when i tried to rename it [22:57:12] you should not edit them, other than via the ‘add documentation’ link [22:57:15] so i didn't think it went through [22:57:17] ok, gotcha [22:57:41] I would advise you to delete the new page and pretend it never existed :) [22:58:11] so keep 1) and delete 2) ? [22:58:21] yes, leave twl alone [22:58:26] and delete the page you created [23:01:08] ok, i can't delete pages, but i redirected 2) to 1), is that enough? [23:02:26] yeah, looks ok [23:02:38] andrewbogott: also, oddly abartov is showing up as an admin in the project list but not on the novaresource page [23:02:41] great, thanks! [23:03:03] the novaresource pages are generated periodically, they won’t be instantly up-to-date [23:03:11] you should ignore them as much as possible [23:03:11] ok, that solves it [23:03:14] :) [23:03:18] thank you, kindly, good sir. [23:16:24] (03PS1) 10Dzahn: move gsbmonitoring to monitor/gsb.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/256601