[00:28:50] oki [00:29:03] is it a chat roo, pr somthing [01:11:04] YuviPanda: irccloud didn’t like me today so I’ve no backscroll… can you (again) link me to the incident timeline from earlier? [01:11:26] I promised myself I would think about this at least once more before giving up [01:12:26] andrewbogott: moment [01:34:07] 10PAWS: PAWS with bot accounts - https://phabricator.wikimedia.org/T120558#1861157 (10Legoktm) Given that MW doesn't have any way to link master account with bot account, I don't think PAWS should try either. Users should just use OAuth to login with the bot account IMO. I did that a few days ago and it worked o... [01:49:04] andrewbogott: um [01:49:07] > PING tools-worker-05.tools.eqiad.wmflabs (10.68.16.174) 56(84) bytes of data. [01:49:12] > 64 bytes from ci-jessie-wikimedia-7821.contintcloud.eqiad.wmflabs.contintcloud.eqiad.wmflabs (10.68.16.174): icmp_seq=2 ttl=64 time=0.549 ms [01:50:23] andrewbogott: same for another node [01:50:37] andrewbogott: these weren't created during the instance creation outage... [01:51:00] the ci boxes might have been [01:51:36] andrewbogott: will that take up ips already allocated? [01:52:01] I wouldn’t think so [01:52:08] yeah [01:52:20] andrewbogott: so I know that -05 and -07 were working yesterday before the outage... [01:54:32] that instance (7821) doesn’t exist anymore [01:54:37] and plus, what’s with that run-on name? [01:54:44] :( [01:54:44] contintcloud.eqiad.wmflabs.contintcloud.eqiad.wmflabs [01:54:48] yeah [01:55:17] guess I’ll dive in to the designate db [02:07:49] 6Labs, 7Nodepool: weird double-domained DNS entries for nodepool nodes - https://phabricator.wikimedia.org/T120792#1861191 (10Andrew) 3NEW [02:08:30] 6Labs, 7Nodepool: weird double-domained DNS entries for nodepool nodes - https://phabricator.wikimedia.org/T120792#1861204 (10Andrew) It's pretty clear that at some point in the process nodepool is using the fqdn (ci-jessie-wikimedia-11345.contintcloud.eqiad.wmflabs) when it should be using just the name. [02:08:47] 6Labs, 7Nodepool: weird double-domained DNS entries for nodepool nodes - https://phabricator.wikimedia.org/T120792#1861205 (10Andrew) [02:10:18] 10PAWS: PAWS with bot accounts - https://phabricator.wikimedia.org/T120558#1861217 (10jayvdb) The same problem exists for sysop accounts - i.e. the user-config.py `sysopnames` can not be used in PAWS. The current approach allows people to create/modify `user-config.py` in their $HOME It would be good to prev... [02:11:23] 6Labs, 10Tool-Labs: Dead link for "Directory NG" on ad bar - https://phabricator.wikimedia.org/T120793#1861219 (10Catrope) [02:12:14] 10PAWS: PAWS with bot accounts - https://phabricator.wikimedia.org/T120558#1861221 (10yuvipanda) If you look at the current user-config file (https://github.com/yuvipanda/paws/blob/master/singleuser/user-config.py) it already does this by setting usernames and other things *after* the $HOME/user-config.py file i... [03:01:21] YuviPanda: better? [03:01:51] checking [03:02:14] ping gives me right rdns but I still can't ssh [03:03:19] andrewbogott: tools-worker-07 has wrong rdns, but I can ssh [03:03:53] YuviPanda: I have a migraine and can’t really work on this anymore. [03:04:25] Your mission, should you choose to accept it, is to take those two ‘latest’ files and separate out all the leaks [03:04:39] (basically just be selecting only the contintcloud entries for instances that don’t exist) [03:05:00] andrewbogott: :( take care! I'll see if I can dig into it.. [03:05:06] and then produce a file that consists of just “ " [03:05:12] for all leaked records [03:05:20] then tomorrow I’ll try to figure out some automated way to delete them [03:05:21] there are hundreds [03:05:32] oh wow. [03:05:34] ok [03:05:40] Probably I’ll have to hack and/or fix the designate source to do this [03:05:46] we aren’t leaking now though [03:05:57] I have a vague recollection of knowing about/understanding the leak [03:05:57] that's good! [03:06:02] but right now too foggy to remember [03:06:30] andrewbogott: go away! [03:07:03] 6Labs, 10Labs-Infrastructure: Clean up leaked designate entries - https://phabricator.wikimedia.org/T120797#1861262 (10Andrew) 3NEW a:3yuvipanda [03:07:15] For your records :) [03:07:23] Thanks — catch you later [07:34:39] PROBLEM - Host tools-worker-04 is DOWN: CRITICAL - Host Unreachable (10.68.16.122) [07:34:53] yes that one is actually down [07:55:47] 6Labs, 6Discovery, 10Maps: Enable OSM Postgres machine access in labs - https://phabricator.wikimedia.org/T98382#1861466 (10akosiaris) 5Open>3Resolved a:3akosiaris I think this is done. Documentation is in https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database#Connecting_to_OSM_via_the_official_CL... [11:39:58] 6Labs, 10Labs-Infrastructure, 6operations, 7Puppet: ldap-yaml-enc.py fails with host_info['puppetClass'] --> KeyError: 'puppetClass' - https://phabricator.wikimedia.org/T120817#1861849 (10hashar) 3NEW [12:59:55] 10Tool-Labs-tools-Other, 7Epic: Convert all Labs tools to use cdnjs for static libraries and fonts - https://phabricator.wikimedia.org/T103934#1861960 (10Ricordisamoa) [13:26:06] 6Labs, 10MediaWiki-extensions-Newsletter: Internal error when creating new user in newsletter-test.wmflabs.org - https://phabricator.wikimedia.org/T119945#1862024 (1001tonythomas) root@newsletter-test:/# du -hs /srv/ 1.3G /srv/ [13:29:49] Can someone help me with mediawiki vargant? I cannot access the mysql database [13:37:44] 6Labs, 10Tool-Labs, 5Patch-For-Review: Redirect //stable.toolserver.org/geohack/geohack.php requests - https://phabricator.wikimedia.org/T120526#1862036 (10coren) The server already catches *.toolserver.org; so the change sufficed. Note that only HTTP is redirected (the certificate does not cover stable.too... [13:37:53] 6Labs, 10Tool-Labs, 5Patch-For-Review: Redirect //stable.toolserver.org/geohack/geohack.php requests - https://phabricator.wikimedia.org/T120526#1862037 (10coren) 5Open>3Resolved [13:46:51] bd808: Are you here? [14:01:36] 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: Investigate decommissioning labcontrol2001 - https://phabricator.wikimedia.org/T118591#1862064 (10mark) a:5mark>3None labcontrol2001 is out of warranty, so we're not going to repurpose it for production use. If you want to continue using it for dev/testing... [14:06:40] 6Labs, 10Tool-Labs, 5Patch-For-Review: Redirect //stable.toolserver.org/geohack/geohack.php requests - https://phabricator.wikimedia.org/T120526#1862067 (10Nemo_bis) Thanks! [14:29:47] 6Labs, 7Nodepool: weird double-domained DNS entries for nodepool nodes - https://phabricator.wikimedia.org/T120792#1862090 (10hashar) Indeed the DNS has twice the domain name: ``` $ dig -x 10.68.21.236 ci-jessie-wikimedia-11400.contintcloud.eqiad.wmflabs.contintcloud.eqiad.wmflabs. $ ``` The page generated o... [14:37:51] 6Labs, 10Labs-Infrastructure: Clean up leaked designate entries - https://phabricator.wikimedia.org/T120797#1862134 (10chasemp) p:5Triage>3High [14:38:08] YuviPanda: what's the story on https://phabricator.wikimedia.org/T120797? [14:39:19] 6Labs, 10Tool-Labs: Dead link for "Directory NG" on ad bar - https://phabricator.wikimedia.org/T120793#1862139 (10chasemp) p:5Triage>3Normal [14:40:59] 6Labs, 7Nodepool, 5Patch-For-Review: weird double-domained DNS entries for nodepool nodes - https://phabricator.wikimedia.org/T120792#1862146 (10chasemp) p:5Triage>3Normal a:3hashar [14:42:03] 6Labs, 7Nodepool, 5Patch-For-Review: weird double-domained DNS entries for nodepool nodes - https://phabricator.wikimedia.org/T120792#1862147 (10hashar) From a random instance: ci-jessie-wikimedia-11406 | /etc/hostname | ci-jessie-wikimedia-11406 | hostname --fqdn | ci-jessie-wikimedia-11406.contintcloud.eq... [14:43:25] 6Labs, 10Labs-Infrastructure: Clean up leaked designate entries - https://phabricator.wikimedia.org/T120797#1862150 (10Andrew) It required a hack in the designate-api source, but I cleaned up all those leaked contintcloud records, and we don't seem to be leaking any new ones. There are probably a few other le... [14:43:41] 6Labs, 10Labs-Infrastructure: Clean up leaked designate entries - https://phabricator.wikimedia.org/T120797#1862151 (10Andrew) a:5yuvipanda>3None [14:46:36] 6Labs, 7Nodepool, 5Patch-For-Review: weird double-domained DNS entries for nodepool nodes - https://phabricator.wikimedia.org/T120792#1862156 (10hashar) Change got deployed `openstack server list`: | ci-jessie-wikimedia-11418 | ACTIVE | public=10.68.21.250 | | ci-jessie-wikimedia... [14:52:54] 6Labs, 10Labs-Infrastructure: Update /etc/hosts during labs instance first boot - https://phabricator.wikimedia.org/T120830#1862183 (10Andrew) 3NEW [14:58:09] 6Labs, 7Nodepool, 5Patch-For-Review: weird double-domained DNS entries for nodepool nodes - https://phabricator.wikimedia.org/T120792#1862192 (10hashar) The Nodepool instances rely on cloud-init and the Ec2 metadata service. And the hostname is exposed without the tenant name: ``` $ curl http://169.254.169.... [15:00:21] I can't access any instance via bastion-01. Can someone help? [15:00:41] oh, sorry my fault [15:02:58] 6Labs, 10Labs-Infrastructure, 6operations, 7Puppet: ldap-yaml-enc.py fails with host_info['puppetClass'] --> KeyError: 'puppetClass' - https://phabricator.wikimedia.org/T120817#1862208 (10Andrew) We moved the default classes out of the ldap node def and into hiera; this is probably a side-effect of that.... [15:03:21] 6Labs, 7Nodepool, 5Patch-For-Review: weird double-domained DNS entries for nodepool nodes - https://phabricator.wikimedia.org/T120792#1862209 (10hashar) 5Open>3Resolved Wikitech shows https://wikitech.wikimedia.org/wiki/Nova_Resource:Ci-jessie-wikimedia-11420.contintcloud.eqiad.wmflabs | Instance Name |... [15:03:33] anyone have any experience optimizing a mediawiki instance on tools to not be dog slow ? [15:04:05] 6Labs, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 7Nodepool, and 2 others: weird double-domained DNS entries for nodepool nodes - https://phabricator.wikimedia.org/T120792#1862211 (10hashar) [15:07:59] thedj: cache everything ? :-} [15:08:18] https://www.mediawiki.org/wiki/Manual:Performance_tuning && https://www.mediawiki.org/wiki/Manual:Cache [15:11:58] http://tools.wmflabs.org/hartman/mediawiki-dev/index.php?title=Main_Page [15:12:21] currently uses: $wgMainCacheType = CACHE_ACCEL; [15:12:22] $wgParserCacheType = CACHE_DB; [15:14:05] Is the Manage Web Proxies thing broken right now? I can't see any proxies for shiny-r project. [15:20:28] bearloga: I can see the proxys for my project [15:23:39] Looks like I can't manage instances either :\ [15:25:53] bearlogo: Maybe someone removed you from that project? [15:27:29] Luke081515: maybe? I'm still listed as an admin on https://wikitech.wikimedia.org/wiki/Nova_Resource:Shiny-r but idk if that list is dynamic or manual [15:28:40] this list is automatic. Maybe we can ask other members of your project, I guess this is a project specific problem [15:28:47] YuviPanda: Can you take a look? [15:34:47] usually if I have the same issue a logout/login works [15:34:56] the authorize handling on wikitech has issues [15:38:00] chasemp: will try that, thanks [15:42:25] 6Labs, 10DBA, 5Patch-For-Review: watchlist table not available on labs - https://phabricator.wikimedia.org/T59617#1862328 (10coren) a:3jcrespo Code for the view is added, all that is missing now is the underlying data replication. @jcrespo: at your convenience, please add the `watchlist` table to replicat... [15:45:41] 6Labs, 10Labs-Infrastructure, 6operations, 5Patch-For-Review, 7Puppet: ldap-yaml-enc.py fails with host_info['puppetClass'] --> KeyError: 'puppetClass' - https://phabricator.wikimedia.org/T120817#1862336 (10chasemp) p:5Triage>3High [15:48:54] 6Labs, 10Labs-Infrastructure, 6operations, 5Patch-For-Review, 7Puppet: ldap-yaml-enc.py fails with host_info['puppetClass'] --> KeyError: 'puppetClass' - https://phabricator.wikimedia.org/T120817#1862351 (10hashar) a:3chasemp [15:50:49] chasemp: you were right! logout/login worked! thanks! [15:51:54] andrewbogott: Coren: for the LDAP stuff, neither integration or deployment-prep had the pam fix run (afaik) [15:51:58] not sure if it is an issue [15:52:21] hashar: in theory we’re going to alias the old ldap servers to the new ones once we trust them [15:52:28] so it shouldn’t matter [15:53:01] hashar: The PAM fix should not affect anything, though self-hosted puppetmaster probably want to have it run sooner than later as config will diverge increasingly until then. [15:53:26] hashar: But I thought deplotment-prep used the normal puppetmaster anyways? [15:53:30] will have to remember to poke #releng folks about it [15:53:42] role::self::puppetmaster or something yeah [15:53:52] * Coren checks. [15:53:54] with puppetmaster being deployment-puppetmaster [15:54:04] integration is similar, pm being integration-puppetmaster [15:54:23] Aha! But their puppetmaster wisely pulls at regular interval so they got the changes. :-) [15:55:02] well only sort of as instances lost labs:instance class definition by default at some point I guess.... [15:55:04] yeah they are both set to auto rebase [15:55:12] though the script has not been run on instances [15:56:06] hashar: Hm. If it rebases often enough, it will have. I did a salt run that checked for the update having been done and that ran the script. Lemme see... [15:57:05] hashar: Ah, no they haven't [run the script]. [15:57:30] That said, I can simply do another salt run and it'll catch them. [15:57:48] Since they got the required things via puppet. [15:59:32] Coren: both projects also have their own salt masters :D [15:59:44] Heh. lulz. [15:59:47] anyway I rebased deployment-prep this mornin [15:59:53] checking integration right now [16:03:58] both up-to-date \O/ [16:04:12] though puppet is broken/disabled on some instances [16:06:54] 6Labs, 10Labs-Infrastructure, 6operations, 5Patch-For-Review, 7Puppet: ldap-yaml-enc.py fails with host_info['puppetClass'] --> KeyError: 'puppetClass' - https://phabricator.wikimedia.org/T120817#1862387 (10Andrew) I think the ENC is broken. It should be merging custom hiera settings with the default r... [16:14:30] hashar: Allright; it's not a disaster since the changes have no semantic effect now - but it's important that it be done eventually. [16:15:01] hashar: The biggest flaw of the old way of doing things is that it misses on any security changes coming from upstream. [16:15:43] (It also, less importantly but very visibly, breaks auxilarry things - that is why motd broke on Trusty) [16:23:53] Coren: so not too much to worry about and that doesn't seem super urgent. Will look at running it this week [16:24:08] hashar: Definitely not urgent. [16:30:26] YuviPanda: Once you are awake and lucid, ping me for a brief conversation re puppet cleanup of lab roles. [16:31:49] Coren: will you join us for our ldap window in 30 minutes? The ‘LDAP user tools’ section could use a volunteer. https://etherpad.wikimedia.org/p/opendj-migration [16:32:34] andrewbogott: Sounds like a plan to me. [16:33:56] thanks. I just now duplicated that section since I think we’re going to have separate prod/labs phases [16:37:04] andrewbogott: Got a question in the labs section though; what do you mean by " Change /etc/ldap/ldap.conf on all labs instances"? [16:39:21] Coren: that’s redundant with the puppet patch, you can remove that line [17:16:53] is here someone who can help me with mediawiki vargant? [17:21:05] I'm trying to create a new web proxy for a labs instances, and I keep getting a 'failed' message. [17:21:59] ragesoss: there's on ongoing migration of the LDAP servers, they are currently set to read-only [17:22:16] okay, cool. [17:22:29] so I should just wait until that is over and try again. [17:22:52] ragesoss: yep — it’ll be a bit, maybe try in 90 mins or so [17:23:07] thanks much! [17:23:26] see https://lists.wikimedia.org/pipermail/wikitech-l/2015-December/084207.html for further details [17:25:37] !log rcm wiki-rcm: synchronized config at 16:15 UTC [17:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Rcm/SAL, Master [17:34:39] !log ores Deployed 1ad37c5 with ores:b745570, revscoring:, editquality:b41b7c1, wikiclass:bbfa9ce, and wb-vandalism:1075596 [17:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [18:03:46] !log help [18:03:46] I am a logbot running on tools-exec-1219. [18:03:46] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [18:03:46] To log a message, type !log . [18:03:47] just looking [18:16:51] It doesn't listen to everyone, does it? [18:16:54] !log help [18:16:55] I am a logbot running on tools-exec-1219. [18:16:55] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [18:16:55] To log a message, type !log . [18:17:02] Oh, it does. :o Okay. [18:31:26] 6Labs, 6operations, 10ops-eqiad: setup promethium in eqiad in support of T95185 - https://phabricator.wikimedia.org/T120262#1862850 (10Cmjohnson) moved to b3 connected asw-b-eqiad ge-3/0/19 updated vlan to labs-b updated dns https://gerrit.wikimedia.org/r/#/c/257654/1 currently it's set to install lvm.cfg... [18:51:56] randomly just noticed, instances can be assigned to labvirt1010 but it's not listed in ganglia [18:59:00] 10Tool-Labs-tools-Other, 6Community-Tech, 6Community-Tech-fixes, 7Tracking: Improving Magnus' tools (tracking) - https://phabricator.wikimedia.org/T115537#1863013 (10DannyH) [19:01:38] YuviPanda: can you have a look at https://phabricator.wikimedia.org/T120817 ? [19:02:22] oo yes [19:04:02] 6Labs, 10Labs-Infrastructure, 6operations, 5Patch-For-Review, 7Puppet: ldap-yaml-enc.py fails with host_info['puppetClass'] --> KeyError: 'puppetClass' - https://phabricator.wikimedia.org/T120817#1863031 (10chasemp) >>! In T120817#1862387, @Andrew wrote: > I think the ENC is broken. It should be mergin... [19:06:02] chasemp: hey! were they getting no role::labs::instance applied at all? [19:06:05] chasemp: that's defined in site.pp [19:07:22] deployment-tin.deployment-prep.eqiad.wmflabs was not afaik but then again puppet was broken there with bad ENC anyways so it could be a consequence of that BUUUT...we had a discussion about puppet master self in general and the fact that you could easily screw up that site.pp fallthrough, possibly even legit [19:07:36] and it may not be the best candidate for the wild west that is self hosted puppet master [19:07:37] ikd [19:07:39] idk even [19:07:50] I... don't understand what you just said. [19:07:55] :D [19:08:10] :) [19:08:16] chasemp: from that, the problem was only if you added *more* things to site.pp [19:08:28] actually [19:08:30] wait [19:08:33] ok slowing down [19:08:35] I know how to fix this! [19:08:41] deployment-prep needs not this ENC [19:08:44] so the ENC lookup for the labs host done on the labs project puppet master was failing [19:08:45] since it isn't using it at all [19:08:50] so it blocked puppet entirely [19:09:08] but it was failing with no roles applied maybe from the migration from ldap to maniphest idk [19:09:23] no, I killed that a few weeks ago [19:09:23] and it was getting nothing entirely I assume because of ENC breakage [19:09:35] so role::labs::instance is applied from site.pp [19:09:38] rather than from LDAP [19:09:40] right [19:09:49] and I tested to make sure that puppet runs fine [19:09:51] but that file itself is subject to change right on a self hosted pupppet master? [19:09:54] we can't rely on it can we? [19:10:01] what do you mean by 'that file' [19:10:06] site.pp [19:10:20] 'change' as in 'people can take out role::labs::instance?' [19:10:27] or do anything really [19:10:28] if they do their instance is fucked and not much we can do :D [19:10:30] yeah [19:10:39] but if you use puppetmaster self you get a lot more power with your responsibility [19:10:44] there isn't much we can do about that [19:10:46] well yes but my point is the idea is to have a master you can manipulate and changing that [19:10:48] would be a stretch [19:10:56] manipulate what? [19:11:01] so it may be more persistent to not have it be the source of truth in this case [19:11:05] manipulate the puppet master configuration [19:11:26] well, with role::labs::instance in wikitech/ldap you can accidentally uncheck that box too from the configure page... [19:11:32] whicch applied to all hosts [19:11:51] and I also am not sure how that relates to this bug at all? this is a crappy ENC I wrote that only 2 projects are using (integration and staging) [19:11:54] oh — was the issue that people removed that setting from site.pp? [19:12:19] no let's backup as I'm off in left field you guys I think [19:12:33] the ENC was broken not sure how you want to fix it is first course I imagine [19:12:55] am pretty sure andrewbogott's patch will fix the ENC [19:13:01] also why can't I ssh to deployment-puppetmaster?! [19:13:19] YuviPanda: ldap is overloaded, probably breaking auth [19:13:19] could be ldap [19:13:24] it's having issues [19:13:24] moritz is fixing I think [19:13:27] ah [19:13:29] ok [19:13:32] I'll just wait it out then [19:13:47] alright we can totally ignore the no applied role via ldap case [19:13:51] I'm not opposed to that [19:14:02] but we can't guarantee even slightly the role applied via site.pp on self hosted master [19:14:09] by nature of it beign a self hosted master [19:14:20] it may make sense to just hard code that in the ENC as such [19:14:28] and then there is never a no role applied case at all anyways [19:14:30] but the ENC isn't used by all self hosted puppetmsaters... [19:14:57] I hate self hosted puppet master use case [19:15:10] I hate teh code :D I'm killing it bit by bit! [19:15:24] > but we can't guarantee even slightly the role applied via site.pp on self hosted master [19:15:25] the whole use case is just a box of worms with so little value [19:15:29] chasemp: ^ I don't understand that [19:15:49] chasemp: I disagree about 'little value' [19:15:59] you can test everything you need via puppet apply [19:16:03] except exported resources [19:16:05] chasemp: maybe for you but you have +2 on operations/puppet [19:16:16] so all this silly dance is just for that which isn't being used in labs [19:16:28] chasemp: no [19:16:39] we regularly have cherry picked patches [19:16:57] it's tangential here to the extreme, but sure we could do that in another way but we aren't so it doesn't matter [19:17:05] if I change something and test it with puppet apply then the puppet agent comes along and puts it back the way it was [19:17:32] assuming that's setup to happen then well yes [19:17:54] so without self hosted puppetmaster I can't override anything that's in operations/puppet [19:18:26] you are not understanding what I'm lamenting at all [19:18:40] the disconnect with using puppet master ot manage instances and puppet masters to disconnect instances for testing [19:18:44] simultaneously [19:19:05] I don't have an opinion at all on the cases you are communicating [19:19:17] https://wikitech.wikimedia.org/w/index.php?title=Hiera%3ADeployment-prep&type=revision&diff=221373&oldid=211419 should fix deployment-prep [19:19:21] at least for this discussion [19:19:25] once puppet runs on the puppetmaster [19:19:37] chasemp: ok I guess I don't understand [19:20:08] > the disconnect with using puppet master ot manage instances and puppet masters to disconnect instances for testing [19:20:15] I don't understand what that means too, chasemp [19:20:17] YuviPanda: was that to remove the enc entirely and be done w/ it? [19:20:18] I you mean managing the puppetmaster with another puppetmaster? [19:20:32] well yes and no [19:20:35] so we have an instance in labs [19:20:41] and it's on the primary puppet master [19:20:44] and all is well [19:20:59] and then someone wants to test some ops/puppet crap and teh best way to do that is to setup a limited scope puppet master [19:21:03] and to move the client over to it [19:21:03] chasemp: this particular ENC is toread files from nodes/labs/ in ops/puppet and 'merge' that with LDAP. since there's nothing for deployment/prep in nodes/labs, it was just using LDAP. might as well directly point it at LDAP. [19:21:26] YuviPanda: understood, is that used anywhere else out of curiousity? [19:21:41] chasemp: in the staging project and in the integration project, but not sure if integration actually is using it [19:21:44] chasemp: staging is [19:22:09] may be it has the same breakage then and requires the allowance for no ldap roles? [19:22:15] even though it hasn't hit I guess [19:22:36] chasemp: yes, and i think andrewbogott's patch is the right thing... [19:22:51] I thik hashar had a patch [19:22:53] I think hashar wrote that, actually? [19:22:55] is that the one you mean? [19:23:08] what ? [19:23:10] bah [19:23:12] I clearly can not read [19:23:17] yes hasha's patch [19:23:39] ah it was abandonned actually, can you un-abandon things? [19:23:59] chasemp: since for all labs instnaces now, role::labs::instance comes from site.pp and anything additional comes from the LDAP 'terminus' thingy (aka if there are no additional roles LDAP terminus returns []) [19:24:03] we should mimic that here [19:24:08] well I just went around the KeyError with a default [19:24:14] yes [19:24:19] ie dict.get( 'somekey', [] ) [19:24:21] YuviPanda: but not on all instances is it the same site.pp correct? [19:24:37] on some it's a site.pp on 'bobs puppet master' and on others it's from the main labcontrol site.pp [19:24:38] but chase mentionned we should also always add role::labs::instance and went with a patch doing that [19:24:46] but handling the key error with try / except [19:24:53] > YuviPanda: but not on all instances is it the same site.pp correct? [19:24:57] I don't understand that either [19:25:15] are you trying to cover the use case where someone hacks up site.pp to remove role::labs::instance? [19:25:35] 10PAWS, 10pywikibot-core: Install developer requirements into PAWS - https://phabricator.wikimedia.org/T120860#1863099 (10jayvdb) 3NEW [19:25:38] YuviPanda: or maybe out of date puppetmasters? [19:25:43] * andrewbogott thinks that case should be permitted [19:25:51] twentyafterfour: right, but all those people are just shooting themselves in the foot [19:26:02] right but it's still our problem? [19:26:08] twentyafterfour: role::puppet::self auto-updates itself every 10 minutes so you need to go out of your way to turn that off [19:26:13] there should be some level of global always include [19:26:16] so if you go out of your way to turn that off and then fuck stuff up... [19:26:17] for dns changes and ldap things [19:26:24] and right now we do not have that covered well [19:26:28] chasemp: yes, but this is just a small ENC used in two projects only... [19:26:41] I totally get that isn't the right place [19:26:48] I'm not advocating that patch at all [19:26:57] yes, so I think there's like 4-5 different conversations mixed up here... [19:27:01] the convo shifted to self hosted puppet master theory with twentyafterfour's questions [19:27:03] yes [19:27:05] ah [19:27:08] ok [19:27:17] https://phabricator.wikimedia.org/T120159 is my only feelings on that [19:27:55] afa that ENC or that patch or whatever if it was used everywhere I might want to force labs::instance on ppl that way [19:27:56] sorry I didn't mean to derail this conversation [19:28:44] chasemp: people have root on things, so you can't really force anything. [19:29:19] it's true but you can make it so ingrained as to be ridiculous to remove and not have the removal be in the path of any normal (maybe legit) use case [19:29:22] if someone explicitly removes an include from site.pp, IMO they got what's coming for them. if you go out of your way to disable auto updates in your self hosted puppetmaster and do not update it yourself, you have what is coming for you [19:29:40] chasemp: I don't think as a team of 4ish we can support that use case :) [19:29:53] so we should allow people to shoot themselves in the foot if they choose to, as long the default isn't foot shooty [19:30:04] wait [19:30:11] I misunderstood what you're saying [19:30:21] I'm just saying it's all pointless. if people want to shoot themselves in the foot they will. [19:30:28] we can't really put safeguards areound it [19:30:48] we can't prevent bad faith use for sure [19:30:58] and even if we do, this patch is completely unrelated :D the enc requires you to get stuff merged into ops/puppet ot actually use it... [19:31:01] but atm I would say site.pp is a valid file to modify on a self hosted master [19:31:12] and doing so can have bizarre more globally labsy consequences [19:31:13] and that's not ideal [19:31:19] 'global consequences'? [19:31:37] you no longer get things all instances are meant to have regardless of what project they are in if you remove the wrong line [19:32:03] yeah, but that's if you point gun at foot and pull trigger.... [19:32:04] chasemp: so you're suggesting that we separate the base labs roles from stuff people might want to customize? [19:32:24] twentyafterfour: yu can already customize it. just not remove the include role::labs::instance [19:32:32] YuviPanda: yes agreed people can screw themselves over but we make it very easy to do is all I'm suggesting [19:32:35] which you could do earlier by unchecking a tickbox in 'configure' [19:32:38] probably too easy [19:32:43] chasemp: no, I think site.pp is *far* harder than the previous setup [19:32:51] earlier you just had to accidentally uncheck one box in the configure page [19:32:53] now [19:32:58] well relativeness of foot shooting isn't a good metric [19:33:02] as it could go from terrible to bad [19:33:04] and still be bad [19:33:05] lol [19:33:25] I don't think this is a very useful use case, but anyway, it's unrelated to the patch and I'm not sure what actionables we can get out of it [19:33:43] totally unrelated and none atm [19:33:48] right [19:33:50] ok [19:33:55] chasemp: I do get what you're saying now, at least ;) [19:34:02] and I agree [19:34:08] can you remove your -1 from https://gerrit.wikimedia.org/r/#/c/257606/ [19:34:20] I should kill that ENC too, staging is dead :( [19:36:14] Minion did not return. [No response] [19:36:14] ` [19:36:31] so how do you guys ops keep your sanity with salt being barely reliable ? [19:36:53] barely reliable? you mean totally unreliable? [19:36:57] by never using salt :D [19:37:03] by hating salt and that^ [19:37:08] yeah [19:37:23] time to have scap support running arbitrary commands [19:37:30] oh dear god no [19:37:46] :) [19:37:48] scap run 'whatever' [19:37:53] good news: python's ssh client lib has been updated to support hmac256 so something like fabric should be usable again? [19:38:03] cause 99% my commands are just salt --timeout 10 --show-timeout '*' cmd.run [19:38:08] do you mean paramiko? twentyafterfour [19:38:09] hashar: yes scap run 'whatever' is something I already built actually [19:38:14] chasemp: yeah [19:38:19] ah interesting [19:38:22] I thought it was mostly dead [19:38:23] 6Labs, 6operations, 10ops-eqiad: setup promethium in eqiad in support of T95185 - https://phabricator.wikimedia.org/T120262#1863190 (10Cmjohnson) a:5Cmjohnson>3Andrew OS is installed...no puppet certs. Assigning to @andrew [19:38:25] * YuviPanda uses pssh with a small generator script [19:38:26] chasemp: not dead [19:38:39] YuviPanda: did you ever check out http://mig.mozilla.org/ [19:38:48] not suggesting it but the model is super like mcollective [19:38:54] and it's pretty neat [19:39:19] nah, I just rage at salt and make snarky comments [19:39:50] it's become this messy problem that nobody (esp. me!) wants to touch [19:40:06] I was waiting for apergo's to declare salt mental bankruptsy [19:40:15] and then I was going ot mcollectie up on labs and ignore the rest [19:40:19] I guess [19:40:29] it's all terrible [19:40:35] all software sucks [19:40:54] I am going to move to fabric and ansible [19:40:58] well yes :) "All models are bad, some are useful" [19:40:59] everything is DOOOOMED [19:41:33] 26710 ? Sl 0:17 /usr/bin/python /usr/bin/salt-minion -d [19:41:34] 28473 ? Ssl 0:17 /usr/bin/python /usr/bin/salt-minion [19:41:37] ^^^top cause of issues [19:41:44] we end up having some dupe process [19:41:46] chasemp: mig looks neat [19:42:25] hashar: twentyafterfour https://github.com/yuvipanda/personal-wiki/blob/master/project-dsh-generator.py [19:42:28] I use that with pssh [19:42:41] and https://github.com/yuvipanda/personal-wiki/blob/master/tools-dsh-generator.py when I want to do something tools specific [19:42:46] super low overhead and just works [19:42:56] YuviPanda: interesting [19:42:59] I may steal it [19:43:17] chasemp: :D it has an option to generate a list of *all* labs hosts too [19:43:22] has saved me in the past [19:44:31] well [19:46:24] YuviPanda: that's awesome, that's a feature I should add to iscap [19:46:33] I should just fork iscap into it's own tool by now [19:46:46] that is..make it a separate tool instead of part of scap [20:06:24] just checking...a self hosted puppetmaster in labs should have the ldap-yaml-enc.py script? [20:06:57] ebernhardson: nope, not unless you’re doing something fancy [20:07:06] YuviPanda: may correct me [20:07:22] andrewbogott: ok, it doesn't so that means everything is right :) thanks [20:07:27] ebernhardson: nope [20:07:41] ebernhardson: only if you are doing fancy things :D [20:07:48] ah same as what andrewbogott said [20:26:03] 6Labs, 10Labs-Infrastructure: Update /etc/hosts during labs instance first boot - https://phabricator.wikimedia.org/T120830#1863533 (10scfc) This would also "fix" T119672. Note though that for instances in #Tool-Labs, `/etc/hosts` is overwritten by Puppet. [20:32:45] 6Labs, 10Labs-Infrastructure, 6operations, 5Patch-For-Review, 7Puppet: ldap-yaml-enc.py fails with host_info['puppetClass'] --> KeyError: 'puppetClass' - https://phabricator.wikimedia.org/T120817#1863574 (10hashar) I have rebased puppet on deployment-puppetmaster and ran puppet agent on it to update the... [20:34:32] 6Labs, 10Labs-Infrastructure, 6operations, 5Patch-For-Review, 7Puppet: ldap-yaml-enc.py fails with host_info['puppetClass'] --> KeyError: 'puppetClass' - https://phabricator.wikimedia.org/T120817#1863581 (10chasemp) 5Open>3Resolved [20:34:41] chasemp: :-} [21:59:33] YuviPanda: is ldap still a bit sad? My SMW queries seem to be not right -- https://wikitech.wikimedia.org/wiki/Help:MediaWiki-Vagrant_in_Labs/Hosts [22:00:01] bd808: andrew would know except irccloud is having issues I guess [22:00:16] bd808: also http://tools.wmflabs.org/watroles/role/role::puppet::self :D [22:04:07] bd808: it should be sorted now [22:11:07] bd808, I don't think SMW talks directly to LDAP [22:15:25] Krenair: I suppose that's probably right. I wonder if maybe this was caused by the jobs being messed up for a while [22:48:57] nice /topic [23:08:14] labs hosts reject my ssh connects - can be related to LDAP stuff? [23:08:59] chasemp: ^ can you take a look? [23:09:05] Coren: ^ you too if you're around [23:09:20] SMalyshev: what vm and what's your username? [23:09:31] smalyshev [23:09:40] db01.eqiad.wmflabs [23:10:04] can log in to bastion.wmflabs.org but not to my labs machines [23:10:25] SMalyshev: it is a client of wdqs-puppetmaster.wikidata-query.eqiad.wmflabs [23:10:37] can you update your puppet mster to propagate new ldap settings I wonder? [23:10:40] looking to see here [23:11:04] although, old stuff isn't down yet [23:11:05] chasemp: no I can't since I can't access that either :) [23:11:14] so it would be weird if this didn't work for that reason I think [23:11:24] SMalyshev: :) [23:12:10] production access works fine so it's something in labs [23:12:57] hmm. in fact I can access wdq-redirect.eqiad.wmflabs which is in the same group [23:13:56] well I think it's indeed an ldap thing not a you thing [23:14:03] and I'm not sure where to address so I'm looking [23:14:05] {'info': 'TLS: hostname does not match CN in peer certificate', 'desc': 'Connect error'} [23:15:27] interestingly enough https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=configure&instanceid=6978ff87-d776-45fd-9a2e-4412d3926add&project=wikidata-query®ion=eqiad claims that host (wdq-beta) does not exist [23:15:32] also https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=configure&instanceid=8978fa1c-c37a-4b69-bef8-6bf049d56aac&project=wikidata-query®ion=eqiad for db01 [23:15:49] so something is broken in wikitech too... [23:18:43] wikitech is probably just pulling data from nova & ldap [23:20:04] Although nova knows wdq-beta exists [23:20:21] what is the error you're getting there SMalyshev? [23:33:30] it's related to old values in root@db01:~# cat /etc/ldap.yaml [23:33:36] but I'm trying to track back on why [23:33:56] really tho it's old values there and certificate issues on those old values [23:36:17] weird problem - all of a sudden I can't ssh into wikimetrics1.eqiad.wmflabs [23:36:23] I was able to just this morning [23:36:39] and I can still ssh into wikimetrics-stagin1.eqiad.wmflabs (those are set up almost exactly the same) [23:37:03] milimetric: does it have a self hosted puppet master in teh project? [23:37:09] yes [23:37:23] sorry, chasemp, yes [23:37:24] check to see last pull in [23:37:39] milimetric: ldap things changed and though it shouldn't be stale on the old boxes it appears it is [23:37:45] there are relevant and needed ldap changes I believe [23:37:55] SMalyshev: try again on db01.wikidata-query.eqiad.wmflabs [23:38:02] aha, so how can I login there to take care of it/ [23:38:12] milimetric: :) which box is the puppetmaster? [23:38:31] oh, chasemp, they're all their own puppetmasters [23:39:02] every vm in that project is a self puppet master? [23:39:05] it looks like wikimetrics1, limn1 don't work and wikimetrics-staging1 works [23:39:21] chasemp: no, I mean the ones that don't work, limn1 and wikimetrics1 [23:39:43] both are self hosted puppet masters then? [23:39:48] wikimetrics-staging1 is also its own puppet master, but that must've been updated today when we were doing deploys [23:39:48] chasemp: works now, thanks! [23:39:59] chasemp: yes, both are self-hosted puppet masters [23:40:06] SMalyshev: I didn't update every vm in teh project but I did do that one and the mster [23:40:13] and you guys will have to sort out the git things there a bit [23:40:16] no lost stuff but [23:40:22] it was wayyyyyyyyyy behind [23:40:24] and in a bad state [23:40:26] at the time [23:40:31] milimetric: ok [23:40:33] give me a sec here [23:40:46] np :) sorry us self-hosters always give you problems [23:41:03] we've had a task to fix it for a very long time :/ [23:41:15] * PopCornPanda waits for limn1 to become unrecoverable one of those days [23:41:45] PopCornPanda: that's not possible, I know the setup of that box by heart because I did it like 10 times manually [23:41:46] chasemp: yes, wdqs-puppetmaster works, but not wdq-beta :( [23:42:05] milimetric: sure, but if you can't ssh in... :) [23:42:18] delete / create instance [23:42:19] SMalyshev: it should checkin in a few? I'll try to come back but I hvae some deeper cause stuff I need to look at [23:42:21] I'm sure you can recreate it, but that box will die sometime [23:42:27] yeah, true [23:42:32] i've faced that truth a while back [23:42:35] :) [23:42:59] but thanks to you the fab deploy tool works so it's pretty easy to recover [23:43:13] limn1 is one of the primary reason I drove ebernhardson to make the shiny dashboard thing not be in ops/puppet [23:43:43] yeah, lessons learned [23:43:54] +1 [23:44:00] PopCornPanda: I'm interested in your thoughts on how I should puppetize dashiki dashboards [23:44:11] milimetric: what're they running on? [23:44:14] milimetric: uwsgi? [23:44:17] limn1 of course [23:44:26] milimetric: haha, no, I mean the tech stack :D [23:44:31] oh, no, they're static html sites that you build by passing wiki pages as configs [23:44:47] oh I see. so you just need a static file server? [23:45:07] basically. the deploy mechanism could do the building [23:45:11] +1 [23:45:18] and fabric seems easy enough to leverage for that [23:45:38] milimetric: there is the role::simplelamp which has a simple apache / mysql / php server, that i reccomend using. it'll serve static files just fine [23:45:41] the static site config would really just need apache to point a proxy to a folder [23:45:46] and the php / mysql would just not be doing anything [23:45:54] you just put stuff in /var/www/ and be done with it [23:46:07] oh, PopCornPanda but it would need an arbitrary number of dashboards per instance [23:46:16] what does that mean? [23:46:21] different domain names? [23:46:23] milimetric: bit of git connundrum there on wikimetrics1 and I'm not sure how you want to resovle it [23:46:31] so I disabled puppet and fixed the relevant file manually [23:46:36] so you can kinda self serve there [23:46:38] can you try to login? [23:46:40] so like edit-analysis.wmflabs.org would have a virtual host that points to /var/lib/dashiki/dist/compare-VisualEditorAndWikitext [23:46:56] (I didn't want to commit the password stuff and testing stuff but it's all kinds of conflicty) [23:46:57] * milimetric tries to login to wikimetrics [23:47:14] milimetric: ok, so that seems like a generic use case for 'serve multiple domains of static files off one server'. Does that sound right? [23:47:14] thx chasemp I logged in! [23:47:22] ok limn1 is in teh same state [23:47:27] i'll go clean the git there [23:47:34] take a gander at wikimetrics1:/var/lib/git/operations/puppet [23:47:37] cool [23:47:39] but basically we just have the custom commit as HEAD [23:48:10] well I may have tripped over my own feet in tryign to pull and created an unneeded rebase situation but there was a conflict there already [23:48:13] ++>>>>>>> d041010... ATENTION: DO NOT PUSH TO PRODUCTION! [23:48:23] so uh I'm gunna let you take a peek at that one :) [23:48:25] YuviPanda: yes, that generic description sounds good to me, I'm interested more in how I can leverage hiera to set up an arbitrary number of such things on the same instance [23:48:44] SMalyshev: rest of the project vm's should catch up here shortly [23:48:49] milimetric: yup, I'll have a solution for the generic case sometime this week. so you can add a host via hiera and it'll setup a folder and fab can push / symlink as it sees fit [23:48:51] yeah, I'll probably just reset [23:48:58] ooh, YuviPanda sounds fun :) [23:49:02] milimetric: can you open a bug? [23:49:03] YuviPanda: FYI stale /etc/ldap.yaml on self hosted things is causing ppl look at old ldap [23:49:06] which should be up right [23:49:13] bu tit's throwing cert errors for mismatch hostname [23:49:17] YuviPanda: sure, lemme do this git cleanup first [23:49:36] chasemp: will keep in mind. limn1 / wikimetrics1 are special instances though, since they have massive commits on top that cause rebase conflicts... [23:49:42] usually they auto update and do fine.. [23:49:47] but yeah, I'll keep in mind chasemp [23:49:55] brb getting off train thingy [23:50:53] YuviPanda: fwiw ssh-key-ldap-lookup was a pleasantly well written surprise :) [23:50:54] so thanks [23:54:20] chasemp: ok, wdq-beta still not happy. Will wait for 10 mins [23:56:26] chasemp: I'm getting "sudo: ldap_start_tls_s(): Connect error" [23:56:30] SMalyshev: something else is up there [23:56:32] (when I try to sudo) [23:57:00] chasemp: ah, now it works! [23:57:17] milimetric: you probably need to puppet up on the master run puppet agent --test on teh master [23:57:24] and then puppet agent --test on any clients? [23:57:28] YuviPanda: these aren't massive commits, btw. In limn1's case it's more annoying, but wikimetrics1 really just adds a passwords file and includes it [23:57:34] chasemp: what ldap.yaml should say? [23:57:44] etc/ldap.yaml [23:57:49] chasemp: but I can't do any of that without sudo though [23:57:53] omitting the passwords :) [23:57:58] milimetric: :) [23:59:13] milimetric: puppet is borked on wikimetrics No file(s) found for import of 'stages.pp' at /etc/puppet/manifests/site.pp:16 on node wikimetrics1.analytics.eqiad.wmflabs [23:59:19] let me peek here