[00:00:11] basically set up one proxy for MW and then another proxy with -restbase appended to the name, pointing to RB [00:00:31] ah [00:00:33] cool [00:00:35] external ports both 80, internal ports 8080/7321 [00:00:50] then fiddled with the mw-vagrant config to get it to actually use these instead of mediawiki-vagrant.wmflabs.org [00:15:24] or localhost or whatever [00:40:50] YuviPanda, did you review https://gerrit.wikimedia.org/r/#/c/245200/1 yet? [00:41:57] (03CR) 10Yuvipanda: [C: 031] "Needs merge + package rebuild + deploy. I'll do that tomorrow?" [labs/invisible-unicorn] - 10https://gerrit.wikimedia.org/r/245200 (https://phabricator.wikimedia.org/T69927) (owner: 10Alex Monk) [00:42:02] Krenair: there. [00:42:15] YuviPanda: Thanks. [00:42:15] 6Labs, 10Labs-Infrastructure, 5Patch-For-Review, 3labs-sprint-118: Can't delete NovaProxy instance with malformed DNS hostname - https://phabricator.wikimedia.org/T69927#1756022 (10yuvipanda) [00:42:22] Krenair: added sprint tag so I won't forget [00:42:37] Shall I make a habit of doing that? Or should only you guys do that? [00:45:18] Krenair: no only the labs team should put stuff in the sprint I think [00:45:34] ok [00:45:35] there's also 'labs-team-backlog' which is 'stuff we should do but not this sprint' [00:45:46] I'll just keep nagging on irc then :P [00:45:52] yup [00:46:29] I've been trying to clean up my gerrit backlog [00:46:52] Scheduled about 10 things for different swats [00:47:35] !autoswat [00:47:37] * mutante hides [00:48:02] That's a terrible idea for so many reasons. [00:48:26] * YuviPanda signs up mutante for swat every day [00:50:40] 6Labs, 10Tool-Labs, 5Patch-For-Review, 3labs-sprint-116, and 2 others: Allow direct ssh access to tools - https://phabricator.wikimedia.org/T113979#1756027 (10yuvipanda) I'll add more logging to this tomorrow and then turn it on. [00:56:17] YuviPanda: 451 code review actions in last 30days [00:56:26] Krenair: kidding ..kidding :) [00:56:32] nice! [00:56:55] most of mine are V+2 C+2 >_> [00:58:06] YuviPanda: not letting jenkins V:+2? :) [00:58:22] JohnFLewis: so a lot of times it is [00:58:24] 1. push [00:58:26] 2. jenkins +2 [00:58:30] 3. hit rebase button [00:58:35] 4. push C+2 V+2 [00:58:42] heh, yea, i notice the V+2 each time :p [00:58:50] when looking at the entire history, then [00:58:58] yeah [00:58:59] Chad is _by far_ the most active code review user [00:59:07] 46076 [00:59:12] rank 2 is only 17k [00:59:13] woah [00:59:16] YuviPanda: be less urgent! you're not endangered (yet!) [00:59:19] heh [00:59:20] link? [00:59:24] JohnFLewis: it breaks flow, etc :) [00:59:40] it's not like I can test anything locally. [00:59:55] YuviPanda: no, that's the developers do, you fix flow ;) [00:59:59] the things I can test locally don't go through this at all [01:00:01] Krenair: http://korma.wmflabs.org/browser/scr.html [01:00:19] that's all 'write/test/write/test/write/test/commit/+2' [01:01:10] reload/reload/reload/jenkins verified/merge :p [01:01:20] it also gives me nothing [01:01:24] if I just did a rebase [01:01:38] I also have syntastic setup on my vim to run puppet-lint [01:01:44] so it'll warn me every time I save a file [01:01:45] if it's failing [01:01:57] same for python (runs pyflakes via tox, etc) [01:02:12] so the the 'reload reload reload' gives me no advantage I can think of [01:02:22] http://korma.wmflabs.org/browser/irc.html haha, look at the top 2 :) [01:03:19] wait with complete history I'm the highest non-bot?! [01:03:25] jesus [01:03:50] with like 5k messages more than the next?! [01:04:34] hehe [01:04:45] * YuviPanda shuts mouth, goes afk [01:05:01] you mean 50k :) [01:05:45] mutante: shit [01:05:48] that makes it even worse [01:05:50] wtf. [01:05:52] ... [01:05:53] hehhe [01:06:00] JohnFLewis: check the "mailman" section :o [01:06:02] and this isn't covering any of the private channels [01:07:10] :) right [01:07:15] ... [01:07:21] what am I doing with my life... [01:08:40] YuviPanda: it could be worse, it's IRC, not mailing list. i just looked at that [01:08:47] message posters: [01:08:53] Andre Klapper 4592 [01:09:01] bugzilla-daemon 642 [01:09:16] #1 and #2 [01:09:36] what is he posting to :) [01:09:57] and good night [01:12:15] mutante: :) I'll try staying away from RIC today [01:16:17] YuviPanda: RIC? What's Ric done to you :( [01:23:24] ric rolled [01:23:28] cya.out [02:23:08] 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: Automate generation of floating/private dns aliases in the labs recursor - https://phabricator.wikimedia.org/T100990#1756151 (10Krenair) Despite being assigned to me, this is currently waiting for input from labs ops. [02:23:14] 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: Automate generation of floating/private dns aliases in the labs recursor - https://phabricator.wikimedia.org/T100990#1756152 (10Krenair) (see commit) [02:23:43] 6Labs, 7Monitoring, 5Patch-For-Review, 7Shinken, 7Upstream: shinken.wmflabs.org redirects on https-login to http - https://phabricator.wikimedia.org/T85326#1756153 (10Krenair) [02:51:22] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [02:52:12] 6Labs, 6Phabricator, 7Puppet: phabricator puppet at labs broken - https://phabricator.wikimedia.org/T116442#1756160 (10Negative24) Prod uses some variables from the private puppet repo for setting passwords and such. But the prod class also sets up mail servers and relays that would be complex to manage in l... [03:24:08] 6Labs, 6Phabricator, 7Puppet: phabricator puppet at labs broken - https://phabricator.wikimedia.org/T116442#1756194 (10Dzahn) Ok, thanks for the explanation. I see.. I truly wish we could fix all this and really use the same role class in labs and prod. (and then apply changes in labs first for testing lik... [03:25:28] 6Labs, 6Phabricator, 7Puppet: phabricator puppet at labs broken - https://phabricator.wikimedia.org/T116442#1756195 (10Dzahn) The password part we can fix with the labs/private repo. Happy to help. Not so sure about the mail server part though. [04:01:27] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [08:29:53] How do I use a virtual env with jsub >.> ? [08:30:10] I was planning on this but it doesnt seem to work [08:30:10] jsub -N geo2png -mem 8G source python/bin/activate && python ~/src/python/geo2png.py ~/data [08:30:25] I guess, I could stick that in a bash script... [08:33:15] okay, that doesnt work either... [08:52:24] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [08:58:52] <_joe_> addshore: just use ~/python/bin/python ~/src/python/geo2png.py ~/data [08:59:11] yup, I just had a path wrong >.> [08:59:13] <_joe_> where ~/python/bin/python is of course the path to your python executable in the virtualenv [08:59:22] <_joe_> instead of using source [08:59:29] <_joe_> which might mess up in some way [09:32:23] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [13:40:12] 6Labs, 10Labs-Infrastructure, 6operations, 10ops-eqiad, 3labs-sprint-118: Rack/Setpup labvirt1010 and 1011 - https://phabricator.wikimedia.org/T116019#1757149 (10chasemp) p:5Triage>3Normal [14:31:23] 6Labs, 10Deployment-Systems, 10Labs-Infrastructure, 10Salt, and 2 others: Can not use git-deploy from tin.eqiad.wmnet to labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T111925#1757315 (10hashar) 5Open>3declined a:3hashar Solved by using a git clone directly where needed. [14:47:55] 6Labs, 10Tool-Labs: Redirect Dispenser's tools - https://phabricator.wikimedia.org/T116757#1757394 (10Dispenser) 3NEW [14:49:35] 10Tool-Labs-tools-Other: Migrate http://toolserver.org/~dispenser/* to Tool Labs - https://phabricator.wikimedia.org/T68868#1757405 (10Dispenser) 5declined>3Open [14:49:36] 10Tool-Labs-tools-Other, 7Tracking: Toolserver.org tools that have not been migrated (tracking) - https://phabricator.wikimedia.org/T60865#1757407 (10Dispenser) [16:11:41] hi all! is there no php cache installed on tool labs? i don't see apc nor xcache. how would i enable that [16:15:55] 6Labs, 10Labs-Infrastructure, 6operations, 10ops-eqiad, 3labs-sprint-118: Rack/Setpup labvirt1010 and 1011 - https://phabricator.wikimedia.org/T116019#1757874 (10chasemp) So we have had labvirt1010 installing for awhile now, but labvirt1011 has been a source of difficulty. Same box type, same ilo, same... [16:16:03] 6Labs, 10Labs-Infrastructure, 6operations, 10ops-eqiad, 3labs-sprint-118: Rack/Setpup labvirt1010 and 1011 - https://phabricator.wikimedia.org/T116019#1757875 (10chasemp) [16:30:03] 6Labs, 7Tracking: New Labs project requests (tracking) - https://phabricator.wikimedia.org/T76375#1757928 (10Lucie) [16:30:05] 6Labs, 7Tracking: New Labs Project ArticlePlaceholderWiki - https://phabricator.wikimedia.org/T116771#1757926 (10Lucie) 5Open>3Invalid [16:37:43] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Forbidden - string 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 181 bytes in 0.001 second response time [16:40:20] looks like webservices are down [16:42:07] Toto_Azero: known and being investigated by Coren and andrewbogott [16:42:14] ok [16:54:12] Toto_Azero: Should be okay again. [16:54:22] Coren: it is, thx :) [16:54:51] Coren: etherpad.wikimedia.org isn't happy. related? [16:55:03] ah, coming back up it seems... slowly... [16:56:36] DanielK_WMDE_: Really shouldn't be. But etherpad doesn't need any help to explode. [16:56:49] Well, who can fix it?... [16:57:01] i no longer get an http error, but it's not loading either [16:57:40] DanielK_WMDE_: I can take a look in a bit, otherwise you can poke -operations [16:58:16] looks like it's flapping [17:00:50] question: any idea why I am getting a "libgcc_s.so.1 must be installed for pthread_cancel to work error on a script that worked well? [17:01:01] the problem comes from this line: [17:01:24] db=_mysql.connect(host='tools-db', db='s51245__totoazero', read_default_file="/data/project/totoazero/replica.my.cnf") [17:01:56] getting this only with jsub, not when running directly [17:03:38] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 917225 bytes in 2.526 second response time [17:05:04] Coren: is there a way to use apc or xcache on tool-labs? [17:05:27] (okay… so not getting it any more, strange) [17:05:45] Toto_Azero: You're hitting the memory limit; possibly because you hover very close to it. [17:06:49] DanielK_WMDE_: Possibly, though it's not installed afair. [17:06:59] i take that as a "no". [17:07:16] Coren: is there a reason not to install it? [17:07:28] no-one asked for it ;-) [17:07:30] using PHP without a byte code cache is, like, not done. it's VERY slow. [17:07:49] DanielK_WMDE_: What valhallasw`cloud said - you're the first to mention it. :-) [17:07:52] I'm also not sure how well it works in a multi-tentant fcgi situation [17:08:04] valhallasw`cloud: really? for years nobody noticed that php is unusably slow? or they just accepted it? [17:08:22] clearly it's not unusably slow ;-) [17:08:28] valhallasw`cloud: could potentially be a problem, i don't know enough about apc to comment [17:08:41] http://xcache.lighttpd.net/wiki/PhpIni suggests xcache can work with fcgi [17:08:50] valhallasw`cloud: people are just too used to suffering [17:09:06] want me to file a ticket? [17:09:07] there's an issue with php modules under trusty, though; it needs a phpenmod that has to be puppetized [17:09:13] yes, please [17:10:14] Coren: sorry, what do you mean by "you hover very close to it »? [17:11:06] Toto_Azero: The default memory allocation is rather conservative. The error message you got is a symptom of running out; but if it only happens some of the time it might be beause you are near but not always over that limit. [17:11:18] Toto_Azero: You can simply request more memory for your job with '-mem' [17:12:00] ok :) actually it happened earlier this afternoon! thank you [17:14:02] 6Labs, 10Tool-Labs: Install a PHP byte code cache on tool-labs - https://phabricator.wikimedia.org/T116780#1758200 (10daniel) 3NEW [17:15:10] valhallasw`cloud: https://phabricator.wikimedia.org/T116780 [17:20:16] DanielK_WMDE_: thanks! [17:20:40] valhallasw`cloud: now go ahead and set this to "high" :) [17:20:52] DanielK_WMDE_: it's not, because it's not actively broken ;-) [17:23:08] 6Labs, 10Dumps-Generation, 6operations, 10wikitech.wikimedia.org: Provide dumps of wikitech.wikimedia.org - https://phabricator.wikimedia.org/T54170#1758255 (10ArielGlenn) [17:23:15] valhallasw`cloud: well, it sure doesn't work ,) [17:23:35] i guess that makes it passively broken [17:26:47] DanielK_WMDE_: also, php 5.5 includes it's own opcode cache, which as far as I can see is enabled [17:27:34] oh, i thought that was just hhvm [17:27:47] if it's there, then I wonder why it's so damn slow :) [17:29:08] 6Labs, 10Tool-Labs: Install a PHP byte code cache on tool-labs - https://phabricator.wikimedia.org/T116780#1758281 (10valhallasw) Zend OPcache is enabled on Trusty hosts: ``` Zend OPcache Opcode Caching Up and Running Optimization Enabled Startup OK ``` which is the standard OS for webgrid tasks. APC/xcode... [17:29:17] DanielK_WMDE_: which tool is this? [17:29:42] valhallasw`cloud: mediawiki ;) [17:29:55] https://tools.wmflabs.org/articleplaceholderwiki/mediawiki/index.php?title=Special:Version [17:30:12] (i didn't put it there, and i wouldn't recommend it - but it's *amazingly* slow) [17:30:45] mmm. It's running on trusty, so it should have an opcode cache. So the problem is probably somewhere else... [17:31:17] valhallasw`cloud: no object cache would be one reason... [17:31:40] at least it's not doing sqlite on nfs ;-) [17:31:43] or can mediawiki use the zend cache for the object cache? [17:32:39] I'm not sure if it also provides an object cache [17:34:20] Coren: andrewbogott whelp [17:34:24] valhallasw`cloud: that would be a reason to use xcache [17:34:33] YuviPanda: whelp? [17:34:37] hey yuvi [17:34:45] Coren: just saw the outage [17:34:52] YuviPanda: can you advise about shinken vs. catchpoint vs. labs services vs. paging? [17:35:34] andrewbogott: in the sense of? [17:35:39] what are we using for what? [17:35:58] YuviPanda: mostly in the sense of: I didn’t get a page when tools.wmflabs.org went down, what should I do to get paged next time? [17:36:13] hmm [17:36:24] I take it you didn’t get paged either :) [17:36:25] it's a bit confusing now. [17:36:41] no but someone did send me an SMS [17:36:50] ok, so I think what we'll do is [17:36:50] that was me [17:37:09] 1. catchpoint for metrics [17:37:25] 2. shinken for monitoring things that don't need paging [17:37:33] 3. icinga for things that *do* need paging [17:37:38] like, our NFS pages were from icinga [17:37:42] and it seems to work well enough [17:38:06] Coren: ^ [17:38:11] I can add a tools check to icinga now [17:38:23] and make that paging [17:38:25] seems reasonable for the moment [17:38:31] Seem reasonable. [17:38:48] andrewbogott: so on 1010 vs 1011, both installed, both puppet signed, same kernel but only 1010 [17:38:50] If I'm around my desktop, catchpoint alerts are noisy enough. [17:38:58] gives a 'nova-compute not installed on buggy kernels.' error [17:39:01] I'll let you poke it [17:39:15] chasemp: ok, I’ll look [17:39:26] YuviPanda: agreed, having all paging via icinga seems better than mixed tools [17:39:43] ok [17:40:27] andrewbogott: I assume as 1011 is not applying openstack things [17:40:38] chasemp: yep, that’s the only difference. [17:40:50] YuviPanda: in theory if we have a scalable alert arch and have a "icinga" at each site/context [17:41:16] could open up some modularity there but as an all in thing what you are saying is what we've got I imagine [17:41:41] chasemp: yeah but rabbbit hoooleee [17:41:49] yes [17:42:16] in an ideal world puppet would've also warned about <% @sometghing %> [17:42:19] without the equals [17:42:38] YuviPanda: and the mis-spelling of osmething? :) [17:42:42] the compiler may have [17:42:43] ... irony [17:42:49] JohnFLewis: heh [17:43:00] that mabye, but the equals definitely [17:43:06] chasemp: ah. but compiler doesn't work for labs [17:43:16] was that the problem? [17:43:19] missing equals? [17:43:50] 6Labs: Find and clean up instances that are unreachable by ssh - https://phabricator.wikimedia.org/T103522#1758400 (10ArielGlenn) interested in what's going on with this task because it's one of the things I watch when checking salt behavior on labs. still in progress? [17:43:54] Krenair: yeah [17:44:22] Krenair: so puppet executes it, goes 'nothing going on here, heh' and outputs nothing and goes on [17:45:30] Krenair: also for applying changes to proxy's nginx config [17:45:32] it's: [17:45:37] 1. disable puppet on active node [17:45:41] 2. run puppet on other node [17:45:49] 3. see if everything's ok [17:45:53] what? [17:45:58] you have two nodes? [17:46:01] 4. run puppet elsewhere [17:46:03] yes [17:46:05] tools-proxy-01 and -02 [17:46:09] so we can failover between them [17:46:10] I'm not on those [17:46:24] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin [17:46:26] yeah [17:46:33] I can only test with my little test instance in the openstack project [17:46:39] who merged the patch? [17:46:45] * YuviPanda looks [17:46:55] I did, then reverted when it broke [17:46:57] ah [17:47:11] andrewbogott: Coren: ^ for future merges! [17:47:16] same for the global proxy too [17:47:17] Yep. Noted. [17:47:21] novaproxy-01 and -02 [17:47:33] that's allowed me to merge some dangerous changes without worrying about downtime [17:47:46] YuviPanda: https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin#WebProxy [17:48:03] valhallasw`cloud: yeah, let me add this there [17:48:07] <3 [17:48:09] YuviPanda: Wise. [17:48:19] Are you actually telling me to test this in tools, YuviPanda? :P [17:48:30] Krenair: no, this is for after merging [17:48:32] Krenair: we have toolsbeta as well, which I think has a proxy [17:48:43] which doesn't actually proxy, I think? :/ [17:49:13] I don't know what toolsbeta does and doesn't. scfc is probably the only one who knows. [17:49:16] I can set up the dynamicproxys, I've done it before [17:49:23] Can we make puppet compiler work with labs? [17:49:49] https://gerrit.wikimedia.org/r/#/c/249182/1..2/modules/dynamicproxy/templates/urlproxy.conf [17:50:30] added [17:50:39] Krenair: I've looked into it at some point, but it's not entirely trivial because labs grabs the config from wikitech [17:50:43] good evening, I'm looking to see if anyone's spun up a jessie instance recently (last couple of weeks) and can check the version of salt installed on the host for me [17:50:50] YuviPanda, added what? [17:51:01] on the instance that is [17:51:09] Krenair: oh just a section on https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin#WebProxy [17:51:12] I have, then realised I was stupid, deleted it and started a trusty instance instead apergos [17:51:18] Krenair: but it would definitely be worthwile to have it [17:51:26] heh, oh well, can't find out from you then! [17:52:11] apergos, https://wikitech.wikimedia.org/wiki/Special:NewPages?namespace=498 [17:52:52] CI is quite loud there :/ [17:53:04] yeah those are autodeleted aren't they? [17:55:18] you can look through "select rc_title, rc_timestamp from recentchanges where rc_new and rc_namespace = 498 and rc_title like '%\.eqiad\.wmflabs' and rc_title not like 'Ci-%' order by rc_timestamp desc limit 5;" [17:56:01] 6Labs, 6Phabricator, 5Patch-For-Review, 7Puppet: On labs phabricator references security extension even though it isn't present - https://phabricator.wikimedia.org/T104904#1758510 (10Negative24) [17:56:29] I mean I couuuld [17:56:48] just spin one up to throw it away. lazy bum that I am [17:57:22] found oe [17:57:23] one* [17:57:44] puppet-jmm-openldap-s1 in the puppet project [17:58:25] 6Labs, 6Phabricator, 5Patch-For-Review, 7Puppet: On labs phabricator references security extension even though it isn't present - https://phabricator.wikimedia.org/T104904#1758522 (10Negative24) 5Open>3Resolved [17:58:37] I think you should be able to log into that apergos [17:58:48] as root, anyway [17:58:59] ah checking it out [18:04:08] 6Labs: fix labs jessie instances to have correct salt version - https://phabricator.wikimedia.org/T104849#1758550 (10ArielGlenn) root@puppet-jmm-openldap-s1:/# uptime 18:01:54 up 3:03, 1 user, load average: 0.00, 0.01, 0.05 root@puppet-jmm-openldap-s1:/# dpkg -l | grep salt ii salt-common... [18:04:25] 6Labs, 10Salt: fix labs jessie instances to have correct salt version - https://phabricator.wikimedia.org/T104849#1758553 (10ArielGlenn) [18:04:37] thanks, that was helpful. very helpful indeed! [18:04:50] * apergos eyes The_Real_NSA [18:05:52] 6Labs, 10Beta-Cluster-Infrastructure, 10Salt: Setup multimaster salt for large projects using salt-syndic - https://phabricator.wikimedia.org/T78466#1758577 (10ArielGlenn) [18:06:13] Coren: can you merge Krenair's re-revert of that breaking change? [18:06:20] or want me to do it? [18:07:11] YuviPanda: I'd rather you did it; if something breaks with the proxies you're faster and more precise doing diagnostics. But if it's a flat rerevert, I can tell you the result outright. [18:07:32] Coren: no, original patch was missing an '=' [18:07:41] sure let me do it [18:38:26] valhallasw`cloud: btw, all the changes for allowing ssh as tool accounts are done [18:38:39] valhallasw`cloud: can test it trivially, going to test on tools-dev soon [18:44:26] YuviPanda: ooooh. [18:45:10] valhallasw`cloud: so I can just set ssh::server::authorized_keys_command = '/usr/sbin/ssh-ldap-key-lookup --enable-servicegroups %u' [18:45:13] on that [18:45:15] and it'll work [18:45:29] let me finish reviewing this task for a minute and then I can test [18:45:37] valhallasw`cloud: still needs logging before we can fully rolll it out [18:49:23] mmm [18:51:48] done reviewing task [18:53:35] https://wikitech.wikimedia.org/wiki/Hiera:Tools/host/tools-bastion-02 [18:56:27] valhallasw`cloud: need to figure out why puppet is soooo slooow [18:56:42] YuviPanda: I've tried to do some profiling [18:56:47] mostly it's just doing a lot [18:56:58] it's calling apt-cache status et al a lot [18:57:27] or whatever it is with which you check a single package's status [18:57:41] which is stupid, because it should just do a single command to get the whole list + status :') [19:00:28] valhallasw`cloud: aaaah the package stuff [19:00:30] sigh [19:00:52] valhallasw`cloud: hmm, so that killed sshd [19:00:57] let me see what's going on [19:01:16] *slow clap* [19:01:17] :> [19:01:18] it's the right version [19:01:29] valhallasw`cloud: I do have shell and have reverted it [19:01:37] ;-) [19:03:18] valhallasw`cloud: https://wikitech.wikimedia.org/w/index.php?title=Hiera%3ATools%2Fhost%2Ftools-bastion-02&type=revision&diff=197603&oldid=197593 was the problem [19:03:20] >_> [19:03:33] <_< [19:03:37] valhallasw`cloud: now [19:03:39] Oct 27 19:02:55 tools-bastion-02 sshd[24910]: fatal: Access denied for user tools.admin by PAM account configuration [preauth] [19:03:42] and normal ssh works [19:03:46] but, progress [19:04:31] YuviPanda: so sudo puppet agent --evaltrace -td [19:04:49] https://etherpad.wikimedia.org/p/tools-kubernetes-scenarios [19:04:51] err [19:05:04] it's 2015, why does copy paste in linux suck so fucking much grr [19:05:13] I thought I did an example run at some point and saved the log >_> [19:05:29] -:ALL EXCEPT (project-tools) root:ALL [19:05:36] so we need to switch that away I guess... [19:05:38] yeah [19:05:44] I think I mentioned that at some point :P [19:05:46] into being in the ssh lookup logic [19:05:48] yeah [19:05:50] you did [19:06:14] I wonder if that'll cause any ancilliary security issues [19:06:28] so we can either add all tools to project-tools somehow [19:06:36] that would actually be the nicest option [19:06:53] yeah [19:07:06] if we switch this away into the ssh lookup script [19:07:09] then the PAM conf can just be "-ALL EXCEPT (project-tools) root:ALL" on normal hosts and "-ALL EXCEPT (tools.admin) root:ALL" on special hosts [19:07:20] yeah [19:07:34] the alternative is clearing the PAM config on normal hosts and just having the special host one [19:07:36] we can probably reconfigure wikitech to do that [19:07:51] in any case, it's currently a big mess of different configs that overwrite eachother in puppet [19:07:56] well clearing the PAM config and making sure the ssh lookup thing is solid [19:12:11] PROBLEM - Puppet failure on tools-bastion-01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [19:12:24] that might be my fault [19:13:40] oh [19:13:43] heh [19:13:52] manual puppet run and stuff [19:14:28] 6Labs, 10Tool-Labs, 5Patch-For-Review, 3labs-sprint-116, and 2 others: Allow direct ssh access to tools - https://phabricator.wikimedia.org/T113979#1758942 (10yuvipanda) ok, so testing the problem now seems to be that we have PAM configured to only allow users in the group 'project.tools' to ssh in and too... [19:14:45] andrewbogott: do you think we can add service accounts to the project group too? [19:14:54] andrewbogott: so tools.sometging would be part of project.tools [19:15:20] YuviPanda: I don’t know why not… what for? [19:15:40] andrewbogott: basically we're trying to allow ssh as tool accoutns directly [19:15:52] andrewbogott: and PAM config says 'if you are not a member of this group, piss off' [19:16:55] currently, the tool users are only in their own group [19:16:57] how will the keys be managed in this case? [19:17:16] any of the project group members can log in [19:17:20] wait if we do that, tools.admin is member of project-tools, then does having g+w set on anything owned by them make it world writeable? [19:17:22] or am I just confused [19:17:36] andrewbogott: yeah, it just looks up the service group members and allows login by any of their keys [19:17:43] YuviPanda: no, the g+w is for the group set on the file or dir [19:17:48] buuuut [19:17:53] YuviPanda: oh, that’s handy [19:18:01] it's probably important that the default group is tools.X and not project-tools [19:18:06] so, yeah, just try adding one of the service groups by hand to the project and see if that makes pam happy [19:18:08] valhallasw`cloud: yup [19:18:23] valhallasw`cloud: but that's required anyway [19:18:29] valhallasw`cloud: since all *users* are members of project-tools [19:18:56] eeeeh. Not sure if I get what you mean [19:18:57] the default group for users is svn, not project-tools [19:19:50] valhallasw`cloud: right I'm saying default group can't be project-tools for anyone... [19:20:13] andrewbogott: uh, I guess usermod won't work in this case... [19:20:18] YuviPanda: ldapvi [19:20:34] ah ok [19:20:38] terbium I guess [19:20:44] YuviPanda: I don't get it, but it's not that relevant either :-) [19:20:57] valhallasw`cloud: yeah I guess I'm maybe trying to say it's not relevant [19:21:06] if the default group isn't tool.sometging we're screwed anyway [19:21:46] *nod* [19:22:08] RECOVERY - Puppet failure on tools-bastion-01 is OK: OK: Less than 1.00% above the threshold [0.0] [19:32:54] valhallasw`cloud: andrewbogott that works [19:32:59] \o/ [19:33:00] valhallasw`cloud: can you try sshing as tools.admin [19:33:02] ? [19:33:05] ya [19:33:13] legoktm: can you also try sshing as tools.admin@tools-dev.wmflabs.org? [19:33:19] (to meke sure *that* doesn't work) [19:33:29] oh, tools-dev [19:33:33] valhallasw`cloud: yeah [19:33:42] valhallasw`cloud: applied only to one host atm :) [19:33:44] tools.admin@tools-bastion-02:~ [19:33:45] \o/ [19:33:52] \o/ [19:33:55] hey YuviPanda, can we get an x-large instance in analytics-project? [19:34:00] right now it's not letting us make one [19:34:18] (I juggled over the weekend with valhallasw's amazing help [19:34:22] milimetric: probably. gimme a sec, let me look [19:34:25] thx much [19:34:32] milimetric: ah, did the symlinking work in the end? [19:34:42] valhallasw`cloud: dude, I now owe you twice [19:35:00] milimetric: you're probably running out of quota. let me see which one you're hitting [19:35:02] once for wikimetrics which you single-handedly rescued when it was a baby [19:35:12] and twice for this weekend - brilliant symlink thing worked [19:35:29] valhallasw`cloud: http://druid.wmflabs.org/pivot so that has 1 day of pageview data [19:35:34] :D [19:36:13] the UI is alpha (not ours) but you can query with pseudo-sql like: curl -L -H'Content-Type: application/json' -XPOST --data '{ "query":"SELECT SUM(view_count) FROM `pageviews-per-article` WHERE access=\"mobile app\" AND agent=\"user\" AND \"2015-10-13T00:00:00\" <= time AND time < \"2015-10-15T00:00:00\" GROUP BY TIME_BUCKET(time, PT1H, \"Etc/UTC\")" }' [19:36:13] http://druid.wmflabs.org/plyql [19:36:26] sorry, one line: [19:36:27] curl -L -H'Content-Type: application/json' -XPOST --data '{ "query":"SELECT SUM(view_count) FROM `pageviews-per-article` WHERE access=\"mobile app\" AND agent=\"user\" AND \"2015-10-13T00:00:00\" <= time AND time < \"2015-10-15T00:00:00\" GROUP BY TIME_BUCKET(time, PT1H, \"Etc/UTC\")" }' http://druid.wmflabs.org/plyql [19:36:27] ah just cores. let me raise [19:36:38] thx much YuviPanda [19:37:01] valhallasw`cloud: do you come to WMF events or community events? I don't remember if we've ever met in meatspace [19:37:08] !log analytics upgraded quota for cores from 40 to 56 per milimetric's request [19:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Analytics/SAL, Master [19:37:12] europe hackathons, mostly [19:37:17] and london 'mania [19:37:32] milimetric: you should also get folks to cleanup old unused instances in that project if any exist. [19:37:34] ah, must've missed each other. Are you going to the Italy wikimania? [19:37:43] YuviPanda: we clean aggressively at all times [19:37:45] milimetric: but you should be able to create I think 2 xlarges now [19:37:47] probably [19:37:48] milimetric: cool! [19:37:52] cool, thx [19:37:54] milimetric: try and tell me if it fails? [19:38:32] milimetric: we all should just buy like a big keg of beer and send it to valhallasw`cloud [19:38:35] cool, well, in Italy we'll have to take you out man, really! [19:38:36] :) [19:38:44] Italy, so wine, probably ;-) [19:39:44] awesome, looking forward to it [19:39:56] YuviPanda: worked, thx much [19:40:32] 6Labs, 10Tool-Labs: Improve puppet run time - https://phabricator.wikimedia.org/T116813#1759073 (10valhallasw) 3NEW [19:40:54] YuviPanda: look, a pretty useful graph! [19:43:06] valhallasw`cloud: nice [19:43:53] valhallasw`cloud: I guess finding a way to get all the packages in exec and dev environs 'go' in one go is going to be the win if we want one [19:44:06] milimetric: cool [19:44:10] Probably. Needs more stats ;-) [19:45:38] 6Labs, 10Tool-Labs: Add service group users to the project's group by default - https://phabricator.wikimedia.org/T116815#1759101 (10yuvipanda) 3NEW [19:45:39] YuviPanda: heh. 77 seconds in Package[] [19:45:43] andrewbogott: https://phabricator.wikimedia.org/T116815?workflow=113979 [19:46:00] valhallasw`cloud: :D [19:46:12] 6Labs, 10Tool-Labs: Improve puppet run time - https://phabricator.wikimedia.org/T116813#1759107 (10valhallasw) 77 seconds spent in `Package[]`, so that's where the big gains are... [19:55:22] YuviPanda: [19:55:23] km@km-tp ~> ssh tools.admin@tools-dev.wmflabs.org [19:55:23] Permission denied (publickey,hostbased). [20:13:08] 6Labs, 10Tool-Labs: Improve puppet run time - https://phabricator.wikimedia.org/T116813#1759214 (10valhallasw) So, as far as the peaks are concerned: for the leftmost peak (<0.01 seconds): these are cases where apt-cache is not invoked at all. For example: ``` ^[[0;32mInfo: /Stage[main]/Packages::Python_redis... [20:13:23] YuviPanda: so ensure => latest suxxors [20:13:50] YuviPanda, akosiaris, do you know how i can access labsdb1004.eqiad.wmnet postgres? [20:18:26] legoktm: cool [20:18:31] valhallasw`cloud: yeah [20:19:04] YuviPanda: and that's mostly because puppet is being stupid. /usr/bin/apt-cache policy ".*" takes 6 seconds, and that gets you a list of /all/ packages that /can/ be installed [20:19:10] instead of 0.25s per package [20:19:24] 6Labs: "library-upgrader" labs project request - https://phabricator.wikimedia.org/T116697#1759252 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Done [20:19:25] 6Labs, 7Tracking: New Labs project requests (tracking) - https://phabricator.wikimedia.org/T76375#1759255 (10yuvipanda) [20:20:23] 6Labs, 7Tracking: New Labs project requests (tracking) - https://phabricator.wikimedia.org/T76375#799068 (10yuvipanda) [20:20:25] 6Labs, 6Discovery, 6WMDE-Analytics-Engineering, 10Wikidata, 10Wikidata-Query-Service: Wikidata Metrics Labs project - https://phabricator.wikimedia.org/T115120#1759257 (10yuvipanda) 5Open>3Resolved a:3yuvipanda I've created the project but named 'wikidata-metrics' to be clearer. [20:23:33] 6Labs: Two small instances: for WikiToLearn development - https://phabricator.wikimedia.org/T115282#1759273 (10yuvipanda) Ok, so two things: # You can create a large instance, I think there's enough quota for that by default # I unfortunately don't think that's enough reason to provide a public IP - those are i... [20:27:55] Krenair: going to do your patches now [20:28:06] about to go into a meeting, so good luck [20:29:27] Krenair: hmm then I'll just do the puppet one and leave the package one [20:32:31] Krenair: merged and looks good to mem [20:33:02] Coren: hey! do we need to do any prep for the NFS outage tomorrow? [20:33:11] Coren: specifically, is labstore1001 alright? [20:33:22] Coren: in terms of puppet failures and things that might be residually there? [20:33:42] YuviPanda: It was, though there is little actual testing of storage that can be done without... well, the actual storage. :-) [20:33:51] Coren: sure but there's everything else [20:33:58] That said, the box is a blank slate over puppet. [20:33:59] Coren: also did the post actually make it to labs-l? [20:34:09] YuviPanda: It did to me. [20:34:15] ah ok [20:34:20] hmm it's not in my inbox for some reason >_> [20:34:23] only from announce [20:34:56] Maybe your inbox doesn't show dup message-ids? [20:35:15] Coren: it currently has failing puppet [20:35:41] due to lack of the backports repo I believe [20:36:13] Coren: can you take a look at that? [20:36:25] Ah; yeah. It's minor but needs fixing first. [20:36:38] yup [20:42:27] 6Labs, 10Tool-Labs, 6operations: Have a paging check in icinga for tools home page going down - https://phabricator.wikimedia.org/T116825#1759311 (10yuvipanda) 3NEW [20:59:09] 6Labs, 10Labs-Infrastructure, 6operations, 10ops-eqiad, 3labs-sprint-118: Rack/Setpup labvirt1010 and 1011 - https://phabricator.wikimedia.org/T116019#1759380 (10chasemp) andrew tracked down https://wikitech.wikimedia.org/wiki/HP_DL360Gen9#Embedded_user_partition Which seems like: http://www8.hp.com/... [20:59:50] 6Labs, 10Tool-Labs, 5Patch-For-Review, 3labs-sprint-116, and 2 others: Allow direct ssh access to tools - https://phabricator.wikimedia.org/T113979#1759389 (10scfc) I'd prefer if we don't combine human users and service groups in one group, but create another (`project-tools-servicegroups`?) and open `acce... [21:11:17] 6Labs, 10Tool-Labs, 6operations: Have a paging check in icinga for tools home page going down - https://phabricator.wikimedia.org/T116825#1759445 (10yuvipanda) >>! In T113979#1759389, @scfc wrote: > I'd prefer if we don't combine human users and service groups in one group, but create another (`project-tools... [21:11:44] 6Labs, 10Tool-Labs: Add service group users to the project's group by default - https://phabricator.wikimedia.org/T116815#1759450 (10yuvipanda) >>! In T113979#1759389, @scfc wrote: > I'd prefer if we don't combine human users and service groups in one group, but create another (`project-tools-servicegroups`?)... [21:11:54] 6Labs, 10Tool-Labs: Add service group users to the project's group by default - https://phabricator.wikimedia.org/T116815#1759452 (10yuvipanda) That seems appropriate. Changing title to match [21:13:09] 6Labs, 10Tool-Labs: Add all service group users to a project-servicegroups-$project by default - https://phabricator.wikimedia.org/T116815#1759453 (10yuvipanda) [21:13:27] 6Labs, 10Tool-Labs, 5Patch-For-Review, 3labs-sprint-116, and 2 others: Allow direct ssh access to tools - https://phabricator.wikimedia.org/T113979#1759457 (10yuvipanda) @scfc I agree with your comment, have moved it to T116815 [21:41:09] Coren: andrewbogott https://gerrit.wikimedia.org/r/#/c/249300/ makes a tools home page failure paging. Can I get a +1 before merging? [21:42:27] I'm going to go prep to go to the office now [22:08:49] YuviPanda, what package one? [22:08:59] invisible unicorn [22:09:20] oh [22:09:26] I'm around now [22:09:40] yeah but I gotta go for a bit :( [22:09:44] ok [22:10:15] Krenair: also take a look at https://phabricator.wikimedia.org/T116815 maybe? [22:10:45] * YuviPanda goes afk for a bit [22:12:15] YuviPanda, I suppose I could find a way to implement that [22:12:25] But I don't have time to commit to new things at the moment [22:12:28] Krenair: that would be pretty awesome. [22:12:33] Krenair: yeah, understood. [22:12:37] so no guarantees [22:12:44] ok :) [22:13:26] 6Labs, 10Labs-Infrastructure, 6operations, 10ops-eqiad, 3labs-sprint-118: Rack/Setpup labvirt1010 and 1011 - https://phabricator.wikimedia.org/T116019#1759822 (10chasemp) 5Open>3Resolved Great all set. I did the puppet stuff and booted one vm manually on each to verify. Seems solid. great team eff... [22:16:57] 10MediaWiki-extensions-OpenStackManager, 10Echo, 3Collaboration-Team-Current: Write presentation models for notifications in OpenStackManager - https://phabricator.wikimedia.org/T116853#1759848 (10Catrope) 3NEW [22:24:24] 10MediaWiki-extensions-OpenStackManager, 10Echo, 3Collaboration-Team-Current: Write presentation models for notifications in OpenStackManager - https://phabricator.wikimedia.org/T116853#1759881 (10Krenair) Honestly I don't think build and reboot notifications work at all any more. Used to be triggered by a m... [22:55:25] 6Labs, 10Tool-Labs, 6operations, 7Icinga, 7Monitoring: Have a paging check in icinga for tools home page going down - https://phabricator.wikimedia.org/T116825#1760039 (10Dzahn) [22:59:40] 6Labs, 10Labs-Infrastructure, 6operations, 7Monitoring, and 2 others: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1760065 (10Dzahn) let me steal this for a moment, to check out why they became status UNKNOWN. i said earlier i would but didn't get to it yet. [22:59:45] 6Labs, 10Labs-Infrastructure, 6operations, 7Monitoring, and 2 others: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1760068 (10Dzahn) a:5Andrew>3Dzahn [23:06:30] 6Labs, 10Tool-Labs, 6operations, 7Icinga, 7Monitoring: Have a paging check in icinga for tools home page going down - https://phabricator.wikimedia.org/T116825#1760122 (10Dzahn) Yuvipanda did it here: https://gerrit.wikimedia.org/r/#/c/249292/ exists now along with the SSL expiry check on the same virt... [23:15:54] 6Labs, 10Tool-Labs, 6operations, 7Icinga, and 2 others: Have a paging check in icinga for tools home page going down - https://phabricator.wikimedia.org/T116825#1760169 (10Dzahn) here's the part if it should send SMS to all of ops: https://gerrit.wikimedia.org/r/#/c/249300/ [23:16:09] 6Labs, 10Tool-Labs, 6operations, 7Icinga, and 2 others: Have a paging check in icinga for tools home page going down - https://phabricator.wikimedia.org/T116825#1760170 (10Dzahn) a:3yuvipanda [23:39:07] 6Labs, 10Labs-Infrastructure, 6operations, 7Monitoring, and 2 others: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1760227 (10Dzahn) We either need to make this an NRPE task to be executed on the monitored hosts where the certs are, or we need to copy the cert to...