[00:26:45] Quick Q: When I add or remove a puppet class from the puppet group list for an instance, how long should it take for this to be in effect on the puppet master? [00:32:17] cajoel: should be as soon as you run puppet again. [00:32:30] cajoel: you can traverse ldap yourself to triple check it's gone or not :) [00:32:34] unless ldap's broke [00:32:42] which happens once in a while [00:33:04] jeremyb: would you mind double checking the Class settings for the group? [00:33:10] maybe I should create a new group? [00:33:24] errr, i've not done much with the ENC [00:33:39] cajoel: did you check ldap yourself? [00:33:56] I did not, didn't even realize ldap was holding these relationships.. :) [00:34:14] wouldn't know where to start (which ldap server? which creeds?) [00:34:24] For the curious: ENC? [00:34:36] external node classifier [00:34:54] in this case ldap is the store but OSM is the ENC [00:35:25] cajoel: try e.g. ldaplist on bastion1 [00:35:43] Ah, okay. BTW, I never had trouble with the Configure page being different from what Puppet sees. [00:36:01] jkrauska@bastion1:~$ ldaplist [00:36:02] The search returned an error. [00:36:21] cajoel: well you need to give params :) [00:36:56] If there's one thing I hate more than ldap, it's command line ldap.. :) [00:37:18] hah [00:37:24] cajoel: Which instance are you working on? [00:37:30] flow1 [00:37:47] trying a brute force change of the group [00:38:12] gist: geoip is failing for me, and I'd like to remove it.. I thought I did, but it's still coming across on puppet runs... [00:39:11] cajoel: https://wikitech.wikimedia.org/wiki/Nova_Resource:I-0000096d still lists "geoip". Where did you disable it? [00:40:35] https://wikitech.wikimedia.org/wiki/Special:NovaPuppetGroup [00:41:17] That only customizes the UI. [00:41:53] erm... /me stabs ldap [00:42:33] cajoel: So you need to re-add the class to NovaPuppetGroup, then Configure, disable, then remove the class from NovaPuppetGroup (if you want). [00:43:17] yikes [00:43:28] hah [00:44:25] cajoel: errr, there's a mediawiki feature called flow... [00:44:36] scfc_de: what's scary is that I followed your instructions.. :) [00:45:06] also, are you doing anything with AS #s? [00:45:10] (meaning understood) [00:45:33] jeremyb: BGP AS #s? [00:45:53] ya [00:46:04] um, yes.. that's the intent of the project [00:46:18] tallying bytes / AS [00:46:36] from flow stats on routers [00:46:47] right, cool [00:46:58] cajoel: Jumping in late… the 'NovaPuppetGroup' page allows you to configure the set of puppet classes available in a project. [00:47:06] … which is a different question from 'which classes are on this instance.' [00:47:06] andrewbogott: I grok now [00:47:11] Ah, ok. [00:47:27] i've skipped grokking :) [00:47:52] what's not fun is that you can configure a set, delete it, and it's still applied on your instance.. :) [00:48:40] well, and it's not fun that I can't seem to use the geoip class on labs. [00:49:23] maybe it's a city vs. lite thing [00:49:30] and volatile repo [00:49:35] appears to be something with the volatile repo.. [00:49:36] cajoel: Indeed… theoretically that 'NovaPuppetGroup' page should be phased out and the set automatically generated… but that's pending the Great Puppet Refactor. [00:49:54] Error 400 on SERVER: Not authorized to call find on /file_metadata/volatile/GeoIP [00:50:00] right [00:50:08] so figure out how to make it use lite :) [00:51:02] class geoip::data::lite [00:51:11] can I get a $1? [00:51:30] no bitcoin for u [00:52:17] maybe we should drop an environment check and force lite for labs? [00:58:00] Hi all, any way to access the tool-labs replica dbs from a different labs project? [00:59:52] dschwen: Yes. But you need a) to copy /etc/hosts and iptables -t nat and b) accounts on LabsDB -- I'm not sure if the latter are automatically created for service groups in all Labs projects. [01:00:32] oh, I missed the nat part [01:00:50] (You can do without hosts and iptables, but than you have to connect to labsdb0815:4711.) [01:01:03] Files are in /data/project/.system on Tools. [01:03:47] ferm seems to be failing in labs too [01:03:53] anyone know if ferm was tested in labs? [01:04:11] I realize it's kinda redundant with security groups... but.. [01:04:50] an ssh tunnel would do the trick as well for me, but for some reason I can only go from tools-login to my project and not the other way. [01:05:04] I assume there is some ssh agent magic happening [01:05:28] with my project set up to access a global authorized_keys file [01:05:34] cajoel: failing = ? [01:06:08] root@flow1:/etc/ferm# service ferm start [01:06:09] * Starting Firewall ferm Error in /etc/ferm/conf.d/10_nrpe_5666 line 8: [01:06:10] chain INPUT [01:06:11] { [01:06:13] proto tcp dport 5666 [01:06:14] { [01:06:15] saddr $ INTERNAL <-- [01:06:16] no such variable: $INTERNAL [01:07:36] Q: If there a class that was recently added to the production puppet master, how long until it hits labs? Do they sync? [01:07:58] +is [01:08:21] sounds like a question for abartov [01:08:22] gah [01:08:26] andrewbogott: [01:08:37] * jeremyb has been pinging asaf a lot today :) [01:08:49] cajoel: Labs uses the same branch as production for puppet. [01:08:59] So as soon as it's been merged in gerrit it should be available for labs. [01:09:13] andrewbogott: cannot find module pmacct ... which I added today. [01:09:28] will double-check my work [01:09:38] are you using self-hosted puppet? [01:09:46] andrewbogott: but how often is it synced from gerrit? [01:10:10] I am using selfhoster. [01:10:13] selfhosted [01:10:32] Ah, well, in that case you aren't getting your puppet classes from upstream but rather from the local checkout on your instance... [01:10:35] that's what 'self-hosted' means :) [01:10:49] So you'll need to update your local checkout. One moment and I will find you a link [01:11:32] cajoel, item one here: https://wikitech.wikimedia.org/wiki/Help:Self-hosted_puppetmaster#FAQ [01:11:53] crap, just unchecked role::puppet::self , and then hosed the SSL [01:12:25] "CAUTION: There is currently no easy way to rollback from role::puppet::self to the central puppetmaster. It's not hard to manually force puppet to run against the central puppetmaster, but there is no programmatic way of removing packages, git clones etc. " [01:12:25] I can dump the self hosted now. [01:12:35] CAUTION indeed.. [01:12:38] If you don't want to use self-hosted, best to start a fresh instance. [01:12:42] If you can stand it [01:12:46] I can stand it [01:12:49] but not at 5pm [01:12:56] :) [01:13:05] thanks for the help. [01:13:05] cajoel: Re ferm, what's the content of /etc/ferm/conf.d/00_* [01:13:22] # Autogenerated by puppet. DO NOT EDIT BY HAND! [01:13:22] it's 5pm??! weird coast [01:13:22] # [01:13:23] # 10_nrpe_5666: [01:13:25] cajoel: Start your new instance now, then it'll be all ready and happy when you come in tomorrow [01:13:25] domain (ip ip6) { [01:13:27] table filter { [01:13:28] chain INPUT { [01:13:30] proto tcp dport 5666 { saddr $INTERNAL ACCEPT; } [01:13:31] } [01:13:32] } [01:13:32] } [01:14:14] it's pretty fast to load, etc.. I have one package I'm building by hand until faidon makes me a deb [01:14:19] so it's not completely automated yet... [01:14:22] tomorrow [01:14:28] thanks folks [01:14:39] cajoel: No, I mean 00_*. That's where INTERNAL should be defined. [01:14:54] there are no 00 files [01:15:01] so it's maybe an ordering issue with ferm [01:15:09] it downloaded the 10 file [01:15:18] and then tried to restart ferm (notify) [01:15:20] and barfed [01:16:36] cajoel: You're right. It's defined in base::firewall only, and that's a complicated topic :-). [01:18:26] scfc_de: so I need to add base::firewall at a minimum? [01:18:40] and ferm should probably make it a require ?? [01:20:24] I have never used it in practice, but it looks that way; better ask paravoid though, I think he wrote the module. [01:29:53] cajoel: Are you using labs to design a system that will eventually be outside of labs? [01:30:05] Because for long-term labs projects you don't need to worry much about firewalls, labs instances are firewalled by default. [01:39:46] andrewbogott: I think cajoel is gone, but today's/yesterday's discussion on -operations was that he wants to set up NAT and use ferm for that. [06:55:04] andrewbogott, Coren: virt11.pmtpa.wmnet has 100% full on /var/lib/nova/instances [06:55:29] instances are failing there [06:56:38] Hm, I take it the scheduler isn't smart enough to not send new instances there :( [06:56:45] it is [06:56:58] it was at 98% previously [06:57:03] but instances are just using space [06:57:10] this one, for instance: https://wikitech.wikimedia.org/wiki/Nova_Resource:I-00000621 [06:57:29] Coren: this pretty directly affect you, based on that instance ;) [06:57:34] *affects [06:57:47] that is eating 170GB of disk space on that host [06:57:54] oh, so instance storage is overprovisioned you mean [06:58:38] yeah, we'd underprovision everything else otherwise [06:58:49] this is eating 50G: https://wikitech.wikimedia.org/wiki/Nova_Resource:I-0000040d [06:59:39] 300G eaten by _base [07:00:04] normally I would pester whoever created ve-roundtrip, but… it's you :) [07:00:43] it's not :) [07:01:08] "instancecreator_username=Ryan Lane" [07:01:11] yeah [07:01:19] I don't see how that's accurate :D [07:01:29] yeah, has no underscore [07:01:35] clearly a plant [07:02:19] I'm going to cold-migrate that instance to virt10 [07:02:39] yuvipanda, you don't know who's the admin of that box do you? [07:02:50] Ryan_Lane: works for me [07:03:06] no, sadly, I don't. But I think gwicke_away would, if he isn't the admin himself [07:03:34] it's gwicke_away [07:03:44] the tools db needs to be moved badly [07:03:48] to real hardware [07:04:07] and we need to fix the issue with ephemeral base images bring raw rather than being qcow2 [07:04:29] and we need to fix it upstream. I put a hack into the puppet repo to replace the code that does it [07:04:50] but it doesn't seem to be working properly [07:04:56] Yeah, I thought you already fixed new images so they started out minimal... [07:04:57] oh :( [07:05:35] a bunch of hosts are about to hit 100%, btw [07:06:00] Yeah -- I'm not sure what to do about that other than nag people to clean things up... [07:06:18] Juggling only helps if there's space to juggle to [07:06:23] you can cold migrate instances from the full ones to the less full ones [07:06:27] virt12 is at 30% [07:06:31] virt10 is at 70 [07:06:42] virt7 is at 75 [07:07:12] it's obvious we need more hardware at this point :) [07:07:25] I was hoping we'd just move to eqiad and increase the number of hosts there [07:07:35] and then ship the tampa hosts to eqiad [07:07:54] yeah, me too. there are a lot more hosts in eqiad. [07:08:00] Now it's a race... [07:08:28] this is what I just ran, btw: root@virt0:~# ./cold-migrate ve-roundtrip2 virt10 visualeditor [07:08:46] after it finishes I'm going to manually delete that instance's directory from virt11 [07:09:15] past freeing up a little space to keep things running, I don't have time to do any more juggling tonight [07:09:59] I was writing code and my instance just stopped responding :( [07:09:59] heh [07:11:03] Is bastion1 on virt11? [07:11:30] dunno [07:11:43] if you can't ssh to it, then likely :) [07:12:06] It's working now, just was unhappy earlier. [07:12:47] Hm, nope, it's on 10 [07:24:05] andrewbogott: ok, virt11 has some freed up space [07:24:21] I think it's necessary to reboot all virt11 instances [07:28:36] Is there a salt wildcard to match virt11? [07:28:47] Or should I just do 'em by hand? [07:29:11] you need to reboot them via openstack [07:29:16] 'k [07:29:30] which unfortunately isn't super easy [07:30:23] one thing I do is get a list of instances from the database [07:30:46] well, both instance and project [07:31:05] 'nova list' can't do that? [07:31:12] nope. nova list is per tenant [07:33:41] ok. I sent outage reports to ops list and wikitech-l [07:33:58] I gotta run [07:34:44] 197 instances? That doesn't seem right... [07:34:51] Oh, I bet I'm getting deleted ones [07:37:01] Ah, 47, much better. [07:47:46] Ryan_Lane: um… 'nova --project bastion reboot bastion2' isn't working on virt11… would you expect it to? [08:13:31] !log bastion rebooted bastion2 due to virt11 issues [08:13:33] Logged the message, dummy [08:16:04] !log search rebooted solr-wlm2 due to virt11 storage failure [08:16:05] Logged the message, dummy [08:16:50] !log orgcharts rebooted orgchart [08:16:51] Logged the message, dummy [08:18:29] !log dns rebooted new-ns0 [08:18:30] Logged the message, dummy [08:18:34] !log nagios rebooted nagios-dev [08:18:36] Logged the message, dummy [08:20:14] !log nginx rebooted nginx-devunt [08:20:15] Logged the message, dummy [08:21:19] you have to run it on virt0 [08:22:24] Yep, I'm well into the tedious part now :) [08:23:00] !log account-creation-assistance rebooted accounts-puppetmaster [08:23:01] Logged the message, dummy [08:26:34] !log opensim rebooted opensim-grid1 [08:26:35] Logged the message, dummy [08:32:25] !log tools rebooted tools-db [08:32:27] Logged the message, dummy [08:42:26] !log metavidwiki rebooted metavidwiki [08:42:28] Logged the message, dummy [08:43:34] !log mobile rebooted mobile-varnish [08:43:35] Logged the message, dummy [08:48:02] !log math rebooted latexml-test [08:48:04] Logged the message, dummy [08:48:57] !log deployment-prep rebooted deployment-cache-text1 [08:48:58] Logged the message, dummy [08:52:27] !log snuggle rebooted snuggle-large [08:52:29] Logged the message, dummy [08:53:27] !log sartoris rebooted sartoris-target4 [08:53:28] Logged the message, dummy [08:54:53] !log tools rebooted tools-exec-09 [08:54:54] Logged the message, dummy [08:57:39] !log snuggle rebooted snuggle-redis [08:57:41] Logged the message, dummy [08:57:46] !log scrumbugz rebooted scrumbugz [08:57:48] Logged the message, dummy [08:58:51] sheesh [09:31:55] :D [12:41:35] @notify Coren [12:41:35] This user is now online in #wikimedia-dev. I'll let you know when they show some activity (talk, etc.) [12:42:16] labs: a/s/l? [12:42:22] eh, i mean, how's the migration status:) [13:21:21] mutante: migration delayed by mhoover having been too sick to work last week. [13:21:25] But some bits are coming together. [13:22:50] andrewbogott: cool! thx. i was asking because there is a change by Gage pending review [13:22:55] that is about monitoring virt0 [13:22:55] good morning andrew [13:22:59] and adding NRPE etc [13:23:13] and i though, how much sense does it make on virt0, vs the replacement right now [13:23:16] thought [13:24:06] but on the other hand, he reacted to outage due to disk space issue [13:24:58] andrewbogott: i mean, of course not cool that mhoover is too sick to work :? [13:27:11] !log tools tools-login: rm -f /var/log/exim4/paniclog (OOM) [13:27:12] Logged the message, Master [13:28:25] andrewbogott: https://gerrit.wikimedia.org/r/#/c/107424/ [13:35:37] mutante, what does installing nrpe on virt0 do? [13:35:48] Does it monitor virt0, or does it turn virt0 into a monitoring host? [13:36:49] !log tools After reboot of tools-exec-09, all continuous jobs were successfully restarted ("Rr"); task jobs (1974113, 2188472) failed ("19 : before writing exit_status") [13:36:50] Logged the message, Master [13:38:26] andrewbogott: just get the nagios nrpe daemon installed there [13:38:32] I am not sure whether that includes ferm rules though [13:38:37] andrewbogott: install /etc/init.d/nagios-nrpe-server [13:39:03] andrewbogott: a service that listens and accepts commands from icinga to then execute checks locally [13:39:14] like check_disk , check_dpkg etc [13:39:19] there is a ferm::rule statement, but apparently that does not install ferm itself nor enable the ferm service [13:39:26] that can't be checked from remote, though they are in base module monitoring [13:39:32] so expected on everything pretty much [13:39:46] or there is no disk space monitoring [13:40:00] but also what hashar said, if the IP is public it needs ferm [13:40:17] though it's not as dangerous as executing random commands [13:40:30] we just hardcode the check commands and don't pass args over the net [13:40:42] so what you could do is make a check look OK when it's not .. [13:40:43] mutante: well that would be nice [13:40:57] virt0 most probably have Augeas managed iptables rules [13:41:04] so one would have to whitelist nrpe port [13:41:19] hashar: it should be that way, that's why we dont make generic NRPE commands [13:41:28] but have a separate for each process check etc [13:42:01] there is also http://blog.medin.name/blog/2012/12/02/securing-nrpe-with-certificate-based-authentication/ [13:42:54] yeah might be overkill for us though [13:43:03] since the checks are over a "private" network [13:43:57] ok, I'm a bit lost, but… sounds like that patch isn't ready to merge… would one of you add comments in gerrit? [13:44:25] * hashar ducks [13:44:52] heh, i already pasted some [13:45:01] copy/paste from IRC earlier [13:45:36] hashar: dont_blame_nrpe=0 [13:45:38] :) [13:45:44] it's a real option [13:45:50] # Values: 0=do not allow arguments, 1=allow command arguments [13:45:54] # *** ENABLING THIS OPTION IS A SECURITY RISK! *** [13:45:59] I'm looking into using ferm for NAT on Tools instances; what's the best way to change that default policy? [13:46:49] andrewbogott: don't worry then, just for the new server [13:47:06] andrewbogott: only drawback is not having disk_space monitoring on virt0 [13:47:14] but it didnt exist before [13:47:44] or it makes gage dive into ferm, which is also good [14:53:06] i'm confused about jsub [14:53:13] i'm submitting a job once per hour [14:53:28] but the last time it was executed as at 3:17am (UTC). [14:53:40] DanielK_WMDE_: What does qstat say? [14:53:50] nothing [14:54:28] it's in local-potd-feed's crontab. can you see that? [14:54:53] Moment. [14:55:44] if i submit the job manually, it seems to work [14:55:47] * DanielK_WMDE_ tries again [14:56:47] yea, works nicely, within less than a minute [14:58:18] /var/lib/gridengine/default/common/accounting shows three jobs having run yesterday afternoon and one this morning (3:17Z). [14:58:44] The crontab only is run at 3:17?! [14:58:55] "17 3 * * * ..." [14:59:16] For hourly, that should be "17 * * * * ..." [15:04:30] scfc_de: uh... o_O [15:04:51] :-) [15:04:52] scfc_de: who put that 3 there?! I swear it wasn't me ;) [15:05:04] scfc_de`: sorry for bothering you, i'll now go and hide in shame... [15:05:22] I've looked at the crontab first and I didn't notice it either :-). [15:48:29] hm, the toolserver migration guide sais i should put the rewrite rules into the .htaccess in my home. really, in my home, not my ~/public_html? [15:49:04] (i guess this is a question for the toolserver channel...) [21:14:16] !log deployment-prep finished updating to elasticsearch 0.90.10 [21:14:17] Logged the message, Master [21:14:24] question for you all: how can i detect that a user of a labs instance is using shared storage for $HOME [21:23:18] drdee: Never tried, but in a shell, either parse "df $HOME", or traverse up $HOME, check if "mountpoint -q" succeeds and then look that up via mount? [21:23:37] ty, let me try that [21:25:25] scfc_de: output is: [21:25:27] Filesystem 1K-blocks Used Available Use% Mounted on [21:25:28] projectstorage.pmtpa.wmnet:/analytics-home 52428800 6831616 45597184 14% /home [21:25:36] so that means it's shared? [21:26:33] I *think* all devices with ":" should be remote; how rock solid does your test need to be? [21:26:53] 100% :) [21:28:44] http://unix.stackexchange.com/questions/72223/check-if-folder-is-a-mounted-remote-filesystem uses "df" as well, but only three upvotes. [22:37:21] So, I'm trying to puppetize the instances in my labs project and have no idea what I'm doing. [22:37:22] I [22:37:51] I've found quite a bit of docs on how production uses puppet, but no puppet for dummies on labs [22:42:08] I've heard about "selfhosting", but cannot find help on it [22:45:33] oh, found it https://wikitech.wikimedia.org/wiki/Help:Self-hosted_puppetmaster [22:58:31] drdee: A device name with a ":" in it is guaranteed to be remote, afaik, but the absence of a ":" doesn't guarantee the converse (because bind mounts, etc) [23:13:03] Coren: ty [23:56:19] Coren: ping