[00:00:43] Hmm... looks like CSP can handle paths. So yeah, no real advantage to separate domain from that perspective. [00:00:59] csteipp: as for CSP, that's a good point. While CSP domain whitelisting itself is already a minor edge case (for the event when you're already breached), in this case it would not provide protection since tools would have to whitelist tools.wmflabs.org itself, which the same authors have access to sub directories, [00:01:05] Oh even better, I didn;t know that [00:01:59] alright [00:02:02] so… things? [00:02:02] err [00:02:04] I meant [00:02:08] tools-static.wmflabs.org/cdnjs? [00:03:29] WFM [00:03:59] csteipp: Krinkle thank you very much :) [00:04:00] * YuviPanda merges [00:09:07] Krinkle: is provisioning now :) [00:09:17] Krinkle: I wonder if there’s an easy way to generate a cdnjs.org type page [00:26:56] YuviPanda: provisioning how? [00:27:08] YuviPanda: I don't know. Currently migrating some cvn things :/ [00:27:12] Krinkle: ah ok :) [00:27:38] Krinkle: https://gerrit.wikimedia.org/r/#/c/205788/ [00:39:40] YuviPanda: Hm... size=>100%FREE [00:39:45] How does that work :D [00:39:49] Krinkle: lvm magic [00:40:07] you can do XX%FREE [00:40:15] of what [00:40:20] of unallocated space [00:40:25] by default only 20G is allocated [00:40:35] and depending on the size of the VM you picked you can allocate rest as you see fit [00:41:35] https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/labslvm.pp [00:41:39] Ah, it says there too [00:42:04] (100%FREE being the most useful value) [00:42:11] yeah [00:42:17] you can’t really do 80% Free and 20% Free [00:42:23] since the first 80% free changes the other 20% free [00:42:24] :D [00:42:25] you can actually [00:42:26] hmm [00:42:29] ? [00:42:32] so 80% Free and then 100% Free [00:42:32] I would imagine so [00:42:34] No, what you can do is 80%FREE then 100%FREE [00:42:34] will give you two volumes [00:42:39] with 80 / 20 [00:42:46] so it depends on the order? [00:42:57] But after the disk is created and data exists and puppet runs [00:43:17] ah, not free space, but free unallocated space [00:43:21] you can do ordering via requires and othe rpuppet mechanisms [00:43:22] yep [00:44:47] YuviPanda: So what happens if you create anohter 100%FREE mount [00:44:50] that would be 0 bytes [00:45:12] The resuorce would fail [00:45:16] good [00:45:17] Is quarry running? [00:45:19] So you'd get a puppet error. [00:45:27] Coren: And it supports a fixed value as well? [00:45:46] Krinkle: Yes, provided there actually is at least that amount unallocated. [00:45:51] Yeah [00:45:54] and otherwise puppet error [00:45:56] (as it should) [00:46:01] Dispenser: hey! [00:46:04] Dispenser: yup, seems to be... [00:46:18] hmm, http://quarry.wmflabs.org/query/3291 just completed [00:46:20] a minute ago [00:46:22] so I guess it is? [00:46:34] http://quarry.wmflabs.org/query/897 Hasn't worked for the last hour [00:46:50] hit run again? [00:46:59] (sometimes the query runner gets stuck) [00:47:06] how much time does it usually take to run? [00:47:11] tried that a few times [00:47:16] YuviPanda: No indexing on .git I assume? [00:47:24] Krinkle: ? for cdnjs? [00:47:27] Should be long, 1743 pages in the category [00:47:33] Shouldn't* [00:47:34] YuviPanda: nginx dir index I mean [00:47:38] autoindexing [00:47:46] Not sure if that's excluded by anything, just FYI [00:47:54] Krinkle: I’m still messing around with the nginx config... [00:48:08] Krinkle: the way I do the other stuff means that just setting up a simple prefix isn’t as simple as I’d like it to... [00:48:46] YuviPanda: Also can we get support so firefox's awesome bar can find my queries again? [00:48:59] <YuviPanda> Dispenser: oh, yeah… been meaning to implement that... [00:49:01] <YuviPanda> whoops. [00:49:17] <YuviPanda> Dispenser: I’m going to log in and look at what your query’s doing. moment [00:49:39] <Dispenser> Query status: running [897] [00:50:06] <Dispenser> "Arcade_games" [00:51:11] <YuviPanda> wow I’ve totally forgotten where things log to [00:53:12] <YuviPanda> Dispenser: > OperationalError: duplicate column name: page_id [00:53:13] <YuviPanda> hmm [00:53:48] <YuviPanda> Dispenser: I think that’s from your query, for some reason [00:54:19] <Dispenser> SELECT page.page_id, page.page_title, rd.page_id [00:54:39] <Dispenser> AS "fuck1, AS "fuck2" [00:55:43] <YuviPanda> does that work? [00:55:52] <Dispenser> Yes [00:55:54] <YuviPanda> Dispenser: haha [00:55:58] <YuviPanda> Dispenser: I’m putting a fix in place now [00:57:06] <YuviPanda> Dispenser: deploying now. [00:59:52] <YuviPanda> Dispenser: can you give me the query that was hanging forever? [00:59:54] * YuviPanda would like to test [01:00:47] <Dispenser> http://quarry.wmflabs.org/query/897 [01:01:45] <Dispenser> Maybe you should just code an error message [01:01:57] <YuviPanda> Dispenser: yeah but the query as it stands works... [01:02:12] <YuviPanda> Dispenser: I just deployed a ‘fix’ that would show you the error message instead of just giving you ‘running’ forever [01:02:16] <YuviPanda> do you have the query that was stuck in running? [01:02:48] <Dispenser> Let me test [01:03:24] <Dispenser> Its looking like it hangs [01:03:58] <YuviPanda> Dispenser: ugh you’re right. [01:04:06] <YuviPanda> Dispenser: I added <title>s tho [01:04:11] * YuviPanda digs into why they’re hanging [01:05:02] * Dispenser wishes <title> was the user's given query name [01:05:10] <YuviPanda> Dispenser: it is... [01:05:23] <YuviPanda> should be, at least. [01:05:38] <YuviPanda> Dispenser: oh lol, stupid error [01:05:38] * YuviPanda fixesw [01:07:48] <Dispenser> While your around... http://quarry.wmflabs.org/query/899 had a issue where the newer runs were killed and it was still serving the old results. Maybe add a "completed on YYYY-MM-DD" [02:52:44] <Negative24> YuviPanda: just a heads up, I would like to investigate T96484 tomorrow. Ideally with your help [02:54:39] <Krinkle> !log cvn https://cvn.wmflabs.org now points to cvn-apache8 (UbuntuTrusty; Apache 2.4). The old cvn-apache5 (UbuntuPrecise; Apache 2.2) will be deleted shortly after archiving web logs to NFS. [02:54:43] <labs-morebots> Logged the message, Master [03:00:12] <Krinkle> !log cvn Archived cvn-apache5 access logs to /data/project/cvn-common/backup/cvn-apache5-accesslogs.tar.gz [03:00:16] <labs-morebots> Logged the message, Master [03:51:49] <Krinkle> YuviPanda: In case you didn't know, the site is here https://github.com/cdnjs/new-website [03:51:56] <Krinkle> Not sure how conifgurable it is [03:52:02] <YuviPanda> Krinkle: it seems a bit… complicated [03:52:13] <YuviPanda> Krinkle: however, https://github.com/cdnjs/website :) [04:19:23] <YuviPanda> Krinkle: new-website requires mongodb... [04:19:30] <YuviPanda> old one is a static website. [04:19:44] <YuviPanda> Krinkle: I wonder if I should just build my own :P [04:20:06] <YuviPanda> Krinkle: think I’ll mirror https://github.com/googlefonts too [04:20:12] <YuviPanda> that should cover most of the things people want to use CDNs for [04:20:24] <YuviPanda> err https://github.com/google/fonts [04:22:16] <YuviPanda> Krinkle: boo, http://brick.im/ is more perfect but hasn’t been updated in ages... [04:23:13] <legoktm> inb4 archive.org.wmflabs.org [04:24:05] <YuviPanda> :P [04:33:54] <bd808> legoktm: you think that's a joke, but I've heard rumors... [04:35:41] <YuviPanda> Krinkle: I’m just going to write my own, it looks like [04:36:14] <Krinkle> YuviPanda: You might be able to re-use static's browser [04:36:21] <YuviPanda> Krinkle: oh, hmm. [04:36:51] <Krinkle> https://tools.wmflabs.org/static-browser/ [04:36:58] <Krinkle> The current index doesn't scale [04:37:06] <Krinkle> but watch what happens when you click one [04:37:10] <Krinkle> a src file [04:37:17] <Krinkle> cancel/view/download/copy [04:37:20] <Krinkle> that seems useful [04:37:30] <YuviPanda> Krinkle: so my plan is to go the cdnjs/website way [04:37:41] <Krinkle> And whatever dir scan logic you create static-browser could use as well [04:38:01] <Krinkle> YuviPanda: How does that work? [04:38:10] <YuviPanda> Krinkle: static html generation, basically [04:38:22] <YuviPanda> you do whatever, dump one big html out. [04:38:26] <YuviPanda> well, one big set of HTML things out :) [04:38:40] <YuviPanda> although I might go my old way, which is to dump a large number of consistently named JSON things out [04:38:43] <YuviPanda> and then use AJAX [04:38:48] <YuviPanda> to request appropriate things [04:39:00] <YuviPanda> so I’d dump a small index of just names that I’ll load, and then can load additional ones on request. [04:39:08] <YuviPanda> so it’s a SPA + has an ‘API’+ :) [04:39:14] <YuviPanda> and scales incredibly well :) [04:39:29] <YuviPanda> I think you’ve to hit about a million files on one dir before you run into problems... [04:39:35] <YuviPanda> and you can always split up the dirs [04:40:17] <Krinkle> SPA? [04:40:38] <YuviPanda> single page application [04:40:42] <Krinkle> Right [04:40:44] <YuviPanda> (not sock puppet account) [04:41:08] <harej> or single purpose account [04:41:28] <bd808> or a hot tub with bubbles [04:45:07] <harej> or my alma mater, the School of Public Affairs [04:51:18] <YuviPanda> Krinkle: hmm, definitely more complicated than I thought... [04:59:53] <wikibugs> 6Labs, 6operations: Shinken down - https://phabricator.wikimedia.org/T96817#1226829 (10yuvipanda) 3NEW [05:01:10] <wikibugs> 6Labs, 6operations: Shinken down - https://phabricator.wikimedia.org/T96817#1226836 (10yuvipanda) There seem to be two entries for these hosts on ldap? @Andrew is this possibly due to the stuff you've been doing about new virt* hosts? [05:07:21] <wikibugs> 6Labs, 6operations, 5Patch-For-Review: Shinken down - https://phabricator.wikimedia.org/T96817#1226852 (10yuvipanda) ^ is a temporary fix only, however. [05:12:07] <shinken-wm> PROBLEM - Puppet failure on tools-bastion-02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [06:07:14] <shinken-wm> PROBLEM - Host tools-bastion-02 is DOWN: CRITICAL - Host Unreachable (10.68.16.44) [06:08:05] <YuviPanda> I know, shinken-wm [06:08:06] <YuviPanda> it’s ok [06:09:46] <shinken-wm> RECOVERY - Host tools-bastion-02 is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [06:10:13] <YuviPanda> see? it’s all good as new :) [06:37:36] <shinken-wm> PROBLEM - Puppet failure on tools-exec-08 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [06:53:56] <shinken-wm> PROBLEM - Puppet failure on tools-master is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [07:07:37] <shinken-wm> RECOVERY - Puppet failure on tools-exec-08 is OK: OK: Less than 1.00% above the threshold [0.0] [07:18:57] <shinken-wm> RECOVERY - Puppet failure on tools-master is OK: OK: Less than 1.00% above the threshold [0.0] [10:17:27] <wikibugs> 6Labs, 6operations, 5Patch-For-Review, 7Shinken: Shinken down - https://phabricator.wikimedia.org/T96817#1227179 (10Aklapper) [11:11:14] <Qcoder00> Hi [11:11:21] <Qcoder00> Doing some planned mantaiance? [11:11:34] <Qcoder00> I am getting very slow access ( from the UK) [11:36:43] <eranroz> hi, s7.labsdb isn't replicated. it has lag of 3.5 days. the other databases have very small lag. anyone knows why? [11:41:11] <Qcoder00> SUL finalization affecting ? [11:42:50] <eranroz> but why only on s7? is there a special wiki there? [12:12:03] <wikibugs> 6Labs, 6operations, 5Patch-For-Review, 7Shinken: Shinken down - https://phabricator.wikimedia.org/T96817#1227315 (10Andrew) new instances have two ldap entries -- one with the ec2 id, and valid associated domains, one with the fqdn and invalid associated domains. Does shinken search for both dns? [12:15:09] <wikibugs> 6Labs, 6operations, 5Patch-For-Review, 7Shinken: Shinken down - https://phabricator.wikimedia.org/T96817#1227316 (10Andrew) Ah, yeah, since you're searching by instancename I probably broke things for you :( Those instances are there to test the new ldap host-entry generation and I presumed them to be har... [12:46:48] <andrewbogott> hashar: ready for me to start the migration of deployment-prep? [12:46:58] <hashar> andrewbogott: good morning! [12:47:07] <hashar> andrewbogott: lets do!! [12:47:21] <Coren> hashar: Speaking of deployment-prep... :-) [12:47:29] <andrewbogott> !log cvn migrating cvn-app4 and cvn-app5 to labvirt1005 and labvirt1006, respectively. Seems to be working now. [12:47:31] <labs-morebots> Logged the message, dummy [12:47:49] <andrewbogott> hashar: ok, first up, the bastion. [12:48:09] <andrewbogott> !log deployment-prep migrating to new labvirt nodes [12:48:11] <labs-morebots> Logged the message, dummy [12:48:24] <hashar> I was over paranoid yesterday I guess [13:17:04] <Coren> YuviPanda: Do you have a random unrebooted precise instance around I can do a test on? [13:19:43] <andrewbogott> hashar: ok, deployment-bastion is finished moving. Any ill effects? [13:20:09] <andrewbogott> deployment-db1 is now in progress, deployment-db2 will start shortly. [13:21:48] <andrewbogott> hashar, I’m going to go back to sleep for a bit. ping me here, or email, if you have any trouble. [13:22:18] <andrewbogott> New migrations start once every 20 mins. The complete migration will take 17 hours. [13:22:48] <Coren> andrewbogott: Speaking of; what is the maximum rate of instance reboot you feel safe about? [13:23:27] <andrewbogott> Coren: no idea, really. Every 20 seems to be working ok [13:25:37] <hashar> andrewbogott: seems good [13:25:44] <hashar> andrewbogott: will poke you by email if something is weird [13:25:47] <hashar> sleep well! [14:10:29] <Coren> hashar: I'd really really prefer it if we spent some time today to make sure nothing will happen to deployment-prep when idmap is turned off. If you're too busy to do it, can you delegate someone to work with me for a couple hours to make sure all is good? [14:12:39] <hashar> Coren: I can poke greg about it [14:12:49] <hashar> maybe we can fill a subtask and mark it as #Blocked-On-Releng [14:12:59] <hashar> then figure out who knows about nfs / idmap and can sync with you [14:13:13] <hashar> I remember we had some troubles a year or so ago when we migrated from pmtpa to eqiad [14:13:20] <hashar> due to the nfs config being slightly different [14:13:48] <Coren> hashar: Actually, moving the users to LDAP (which we did to fix that) means that for those users at least things are guaranteed to be okay. :-) [14:19:27] <wikibugs> 6Labs, 7Blocked-on-RelEng, 3Labs-Q4-Sprint-2, 3Labs-Q4-Sprint-3, and 3 others: Schedule reboot of all Labs Precise instances - https://phabricator.wikimedia.org/T95556#1227614 (10coren) [14:19:53] <andrewbogott> hashar: btw, note that live migration doesn’t update the wiki instance pages, so they’ll still look like they’re on virt10xx hosts until I have a chance to force a refresh there. [14:20:06] <wikibugs> 6Labs, 3Labs-Q4-Sprint-2, 3Labs-Q4-Sprint-4, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Disable idmap entirely on Labs Precise instances - https://phabricator.wikimedia.org/T95555#1227628 (10coren) [14:20:23] <wikibugs> 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3Labs-Q4-Sprint-4, 3ToolLabs-Goals-Q4: Labs NFSv4/idmapd mess - https://phabricator.wikimedia.org/T87870#1227630 (10coren) [14:22:03] <wikibugs> 6Labs, 3Labs-Q4-Sprint-2, 3Labs-Q4-Sprint-3, 3Labs-Q4-Sprint-4, 3ToolLabs-Goals-Q4: Do a rolling restart of Tool Labs precise instances - https://phabricator.wikimedia.org/T95557#1227645 (10coren) [14:35:08] <hashar> andrewbogott: seems the migration is going fine. The shinken monitoring we have complain from time to time but promptly come back with OK status [14:35:31] <andrewbogott> hashar: great! db1 and db2 are done moving, which seemed the riskiest bits. [14:35:42] <hashar> Coren: so for nfs idmap, I cant take it this week. But you can poke SF folks later today in #wikimedia-releng or poke greg [14:36:00] <hashar> Coren: I dont mind helping, but that would be for next week [14:36:10] <Coren> hashar: KK. I'll poke there then. [14:37:16] <andrewbogott> Coren, note that you probably won’t be able to reboot an instance in ‘migrating’ state. So best keep an eye out for that if you use a scripted reboot, if you’re unlucky you might miss one [14:37:41] <Coren> andrewbogott: No, with deployment-prep whatever happens will be very deliberate. [14:37:51] <Coren> No automation there. :-) [14:37:53] <andrewbogott> ‘k [15:31:10] <andrewbogott> YuviPanda: your new autosigner code is already running some places, right? [15:31:33] <andrewbogott> hm, he won’t be awake for hours [15:34:19] <Coren> Well, he could be awake if he was strictly PST but he's been vaccilating from effective timezone to effective timezone. :-) [15:43:30] <harej> i'm permanently on west coast time :/ [15:45:26] <shinken-wm> PROBLEM - Host tools-webproxy-jessie is DOWN: CRITICAL - Host Unreachable (10.68.17.147) [15:45:26] * Coren stares at the webproxy. [15:46:54] <Coren> Wait, does that host even exist anymore? [15:51:58] <andrewbogott> It doesn't [15:52:03] <andrewbogott> I was just asking the same question [15:52:09] <andrewbogott> No idea why shinken suddenly decided to care. [15:52:13] * andrewbogott looks in ldap [15:53:27] <andrewbogott> bah, it’s still in ldap for some reason. I’ll clean it up. [15:53:37] * andrewbogott looks forward to switching from the old leaky code to his new leaky code [16:07:51] <bd808> Coren: Will someone be running the reboot-if-idmap via a salt command at some point? [16:13:08] <Coren> bd808: Probably not by salt because I want to be more deliberate about how many instances can reboot at a time and the rate at which they do, but yeah - it'll be run on all instances eventually. [16:13:47] <bd808> *nod* I was just wondering how hard I should hunt for instances I may have helped spawn that need the reboot [16:14:32] <Coren> bd808: The ones which you know where an ill-timed reboot can be problematic are the only ones you need to worry about. [16:14:52] <bd808> perfect [16:19:51] <valhallasw`cloud> Coren, is there more technical/background docs on tool labs other than https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Overview and https://www.mediawiki.org/wiki/Wikimedia_Labs/Tool_Labs/Design ? [16:36:43] <Coren> valhallasw`cloud: Much of it is spread out in places. What bits are you looking for? [16:37:21] <grrrit-wm> (03PS1) 10Subramanya Sastry: Return a unique list of channels (remove dupes). [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/205881 [16:38:05] <valhallasw`cloud> Coren: currently mainly on the mail system, but also just to get a generic overview of all the things [16:38:46] <Coren> https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Architecture has good info, if a bit cursory, and so does https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin [16:38:51] <grrrit-wm> (03PS2) 10Subramanya Sastry: Return a unique list of channels (remove dupes). [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/205881 [16:39:45] <Coren> mail I think is the least documented one. It's mostly a stock exim with the config in puppet, but needs moar docs - open a phab task so I can put it on the board? [16:39:56] <valhallasw`cloud> *nod* [16:40:29] <valhallasw`cloud> also I need to figure out how the entire puppet/hiera/etc magic works, but that's probably documented on one of the labs-wide doc pages [16:40:31] <Coren> The 'web server' documentation is also out of date now after the work by Yuvi. [16:41:11] <valhallasw`cloud> I'm wondering whether we should really have docs generated from the same repo where all the puppet classes are [16:42:34] <Coren> valhallasw`cloud: Interesting idea. I don't know how practical it is in practice - puppet describes state, and it's not clear how architectural notes / howtos reasonably fit in there. [16:44:10] <valhallasw`cloud> in some sense, puppet (plus the host config on wikitech) is the actual architecture, and the docs are the ideas behind it [16:45:02] <valhallasw`cloud> and, maybe more simply, a change in the actual arch passes through puppet, and can be held back there in code review if it's not accompanied by docs [16:45:05] <valhallasw`cloud> also I hate mediawiki for docs =p [16:45:16] <Coren> valhallasw`cloud: No, I can see the sense in it - I'm mostly thinking logistics/practial rather than conceptual. [16:45:59] <Coren> valhallasw`cloud: We do something analogous with the toollabs repo where man pages are updated alongside the code. [16:46:52] <Coren> valhallasw`cloud: It's just that the puppet manifest doesn't seem like a good landing place for documentation as a rule - there is no way to surface that cleanly in an easily accessible place (unlike man pages of a package, say) [16:47:17] <valhallasw`cloud> Coren: no, it would need something that builds it into a web page (e.g. sphinx) [16:47:28] <valhallasw`cloud> might even build it as manpage as well ;-) [16:47:57] <valhallasw`cloud> man 5 toollabs [16:47:57] <Coren> valhallasw`cloud: Sounds like a good free-time project for a volunteer admin. :-P [16:48:25] <valhallasw`cloud> (sphinx can actually do that) [16:49:55] <valhallasw`cloud> Coren: also, is there a technical reason our puppet manifests have to be in the operations/puppet repo? [16:52:33] <Coren> Hm, because submodules suck hard? But mostly because we have no needless duplication (with the attendent desyncs) of classes, resources, etc. [16:55:51] <valhallasw`cloud> Mhm. Yeah, there would be a risk that local manifests won't be generalized (but that's the case for any 'framework' that people write code for as well) [16:56:08] <valhallasw`cloud> I thought there was maybe something that wikitech needs everything to be in ops/puppet, or something like that [16:56:15] <valhallasw`cloud> the entire VM system is still sort of black magic to me [16:58:27] <valhallasw`cloud> :/ [17:11:16] <wikibugs> 7Tool-Labs: Reduce amount of tools-local packages - https://phabricator.wikimedia.org/T91874#1228263 (10scfc) [17:15:54] <wikibugs> 7Tool-Labs: Document email setup - https://phabricator.wikimedia.org/T96884#1228278 (10valhallasw) 3NEW [17:16:12] <valhallasw`cloud> wait, why is tool-labs a tag [17:16:18] <wikibugs> 7Tool-Labs: Reduce amount of tools-local packages - https://phabricator.wikimedia.org/T91874#1228285 (10scfc) [17:16:35] <valhallasw`cloud> chasemp, why did you change tool-labs to a tag? [17:16:52] <valhallasw`cloud> (and not, say, project or team?) [17:17:34] <chasemp> either it is a product of the labs cleanup ticket or a mistake I'm not sure [17:17:40] <chasemp> must have been awhile ago as I don't recall? [17:17:44] <valhallasw`cloud> https://phabricator.wikimedia.org/project/profile/539/ [17:17:52] <valhallasw`cloud> 16 apr [17:18:06] <valhallasw`cloud> might just have been attempting to re-set to the original values [17:18:07] <chasemp> ah I was correcting vandalism from Physicaladdress checked Is Sprint. [17:18:11] <chasemp> yes [17:18:32] <valhallasw`cloud> ok, then I'll move it back to project :-) [17:19:34] <chasemp> it doesn't say "set this projects icon to from x" [17:19:35] <chasemp> just to [17:19:38] <chasemp> which is kind of annoying [17:20:12] <valhallasw`cloud> *nod*. I just looked it up with https://www.mediawiki.org/w/index.php?title=Phabricator/Projects&action=history [17:22:06] <valhallasw`cloud> ok, thanks for the info [17:36:34] <bearND> YuviPanda: Are you the right person to ask to grant mholloway read-write access to caesium? [17:43:14] <YuviPanda> bearND: access request! we have a process for it now :) https://wikitech.wikimedia.org/wiki/Requesting_shell_access [17:43:20] <YuviPanda> see https://wikitech.wikimedia.org/wiki/Requesting_shell_access#Escalating_Existing_Shell_Access [17:43:23] <Krenair> isn't that a prod host? [17:43:29] <Krenair> this is -labs [17:43:32] <YuviPanda> bearND: I saw your email - been a super hectic two weeks [17:43:36] <YuviPanda> Krenair: *shrug* is ok [17:45:16] <bearND> yeah, I'm not sure if this host is part of prod or not [17:46:51] <YuviPanda> bearND: it is :) [17:46:57] * YuviPanda groans [17:47:02] <YuviPanda> hi andrewbogott / Coren [17:47:19] <Coren> Heya YuviPanda [17:47:20] <YuviPanda> andrewbogott: I’ve put a fix in place for now for the dupliate ldap entries bug [17:47:27] <andrewbogott> YuviPanda: thanks [17:47:29] <bearND> YuviPanda: ok, thanks. I'll ask mholloway to fill this out [17:47:38] <YuviPanda> I hate this timezone. everyone’s awake way before I am and everyone is asleep way before I am [17:47:38] <valhallasw`cloud> fhocutt: hey! how's life? [17:47:44] <andrewbogott> Sorry about the duplicates — I tried to make them harmless but didn’t know about that code. [17:47:56] <YuviPanda> andrewbogott: looking at your comments on the puppet script now [17:48:58] <YuviPanda> Coren: and no, no random unrebooted precise instances... [17:50:10] <Coren> YuviPanda: That's okay, deployment-prep is random enough. :-) [17:56:46] <YuviPanda> > This looks great. Do you mind adding one more feature? Break out the code that purges certs with no match in ldap and wrap it with 'puppet cert list --all.' That will cleanup certs for deleted instances as well as invalid cert requests. [17:56:55] <YuviPanda> andrewbogott: ^ I don’t know what you mean by ‘wrap it with’? [17:57:23] <andrewbogott> YuviPanda: I’ll write it if you’ll test and debug it :) [17:57:43] <YuviPanda> andrewbogott: ah sure :) but as another patch? :D [17:58:09] <andrewbogott> 'k [18:00:15] <fhocutt> valhallasw`cloud, it's interesting! [18:01:06] <fhocutt> job opportunities, speaking, still settling in after a move [18:01:08] <fhocutt> yourself? [18:01:56] <wikibugs> 10Tool-Labs: Multiple queue runners on tools-mail - https://phabricator.wikimedia.org/T74867#1228380 (10valhallasw) Okay, I think I know what's going on. Starting with the log message, ``` mainlog.1:2015-04-21 15:29:08 1Yka6m-0008I1-AH re-exec of exim (/usr/sbin/exim4) with -Mc failed: No such file or directory... [18:02:24] <valhallasw`cloud> fhocutt: cool! I'm starting to put lots of time in writing my thesis, which (as you probably know ;-)) takes ages to get anywhere [18:02:35] <valhallasw`cloud> fhocutt: and I've started doing some more stuff for tool labs [18:03:05] <fhocutt> valhallasw`cloud, good luck with the thesis [18:03:48] <YuviPanda> hi valhallasw`cloud [18:03:54] <valhallasw`cloud> hey YuviPanda [18:05:09] <YuviPanda> andrewbogott: I’m merging my change now :) [18:05:16] <andrewbogott> ok [18:05:24] <valhallasw`cloud> fhocutt: thanks. I'll get there, at some point ;-) [18:07:11] <YuviPanda> andrewbogott: done. thanks for review :D If you wanna mess around wtih it I’ll happily help debug / test [18:07:44] <YuviPanda> Coren: hey! If we wanted to, say, switch to jessie before middle of next month, do you think that’s doable? [18:07:46] <YuviPanda> err [18:07:50] <YuviPanda> switch NFS to jessie [18:08:02] <YuviPanda> primarily because metrics collection should start around then [18:08:06] <YuviPanda> :D [18:08:15] <Coren> YuviPanda: I'm hoping for sonner than that, tbh. [18:08:22] <YuviPanda> Coren: hah! :D great [18:08:30] <YuviPanda> Coren: do you have a time in mind? this month even? [18:09:12] <Coren> YuviPanda: Pending on how well the last round of idmap cleanup goes, I'm gunning for the 30th. [18:09:22] <YuviPanda> wonderful [18:09:44] <Coren> labstore1002 could do it now, really, but I'd rather not switch to it with idmap turned on. [18:09:54] <YuviPanda> yeah [18:09:56] <YuviPanda> totally [18:10:01] <YuviPanda> so idmapd is the only blocker? [18:10:11] * Coren nods. [18:10:58] <Coren> And switching to Jessie is pretty much a blocker to snapshots and replication -- it *could* be turned on now, but given the checksum bottleneck on Precise that seems unwise. [18:11:42] <Coren> We saw how little it take for things to degrade into runaway failure. [18:11:58] <YuviPanda> right [18:12:20] <Coren> And snapshots introduce a bit of write amplification -- not much, but I don't want to chance it now. [18:12:27] * valhallasw`cloud is confused. How does tools-mail even include the Exim4 manifest... [18:12:49] <Coren> valhallasw`cloud: wikitech, IIRC [18:13:10] <valhallasw`cloud> Coren: it includes exim::simple-mail-sender (probably because of the MTA role), but I can't find that manifest anywhere [18:13:38] <Coren> Wait, you can't find the inclusion or the class itself? [18:13:58] <valhallasw`cloud> exim::simple-mail-sender is listed on https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000000d1.eqiad.wmflabs [18:14:08] <valhallasw`cloud> but not under 'configure' [18:14:12] <valhallasw`cloud> and I can't find it in the puppet repo [18:15:10] <valhallasw`cloud> ...because it was renamed to ::sender, but wikitech didn't get the memo? https://gerrit.wikimedia.org/r/#/c/138547/ [18:15:21] <YuviPanda> lol [18:15:23] <Coren> Oh, eww. That sometimes happen when classes are moved around/renamed. Existing instances end up with old crap. [18:15:25] <YuviPanda> so that’s been failing since forever? [18:16:29] <valhallasw`cloud> I don't know? it's still getting Exim4 included somehow. Can I get puppet to dump configuration and figure it out from there? [18:17:04] <YuviPanda> I think you can. moment [18:17:05] <Coren> Part of the long list of "fundamental issues with wikitech"; the classes made available to instances are distinct from, and separate from what is configured on the instances, and distinct from and separate from what actually exists in puppet -- and there are no mechanism to sync them up. [18:17:13] <YuviPanda> valhallasw`cloud: just run puppet agent -tv and see if it fails? [18:17:58] <valhallasw`cloud> yeah, just works [18:18:08] <valhallasw`cloud> but depending on the run it installs exim4-light or exim4-heavy :P [18:18:11] <YuviPanda> valhallasw`cloud: hmm, look at /var/lib/puppet/client_data/catalog [18:18:34] <valhallasw`cloud> which means that half the time tools-exim is running -light, so we probably can get away with just -light as well? [18:18:37] <valhallasw`cloud> tools-mail* [18:19:03] <YuviPanda> haha :) [18:19:11] <YuviPanda> I suspect a fair bit of config might be unpuppetized [18:20:04] <Coren> YuviPanda: Actually, no. tools-mail has no unpuppetized config that I know of. [18:20:22] <YuviPanda> that we know of :D [18:20:57] <Coren> YuviPanda: With 99% certainty in that case. [18:21:10] <YuviPanda> that’s cool :D \o/ [18:21:11] <Coren> toollabs::mailrelay <-- is the class you want to look at, valhallasw`cloud [18:21:28] <YuviPanda> valhallasw`cloud: thoughts on just, uh, recreating it as trusty? There was a ticket to move it anyway [18:21:40] <valhallasw`cloud> Coren: yeah, I found that one, but it doesn't include Exim4 as far as I can see? [18:21:46] <valhallasw`cloud> Exim4 as in the Exim4 manifest [18:21:52] <valhallasw`cloud> i.e. https://github.com/wikimedia/operations-puppet/blob/21c72942dd7bf25dbe0759d2f867082e966bfb45/modules/exim4/manifests/init.pp [18:21:55] <Coren> package{ 'exim4-daemon-heavy': [18:21:55] <Coren> ensure => present [18:21:56] <Coren> } [18:22:44] <valhallasw`cloud> Coren: *nod*, but exim4 requires -light, which is causing them to be (un)installed continuously [18:22:59] <Coren> Hm. At the time, -light didn't work right. [18:23:19] <valhallasw`cloud> well, about half the time, we have -light running if I can believe puppet [18:23:36] <Coren> I suspect the problem is exactly that 'standard' includes -light and the override in toollabs::mailrelay works only 50% of the time because order issues. [18:24:02] <YuviPanda> you can make standard not include exim now with a param [18:24:17] <Coren> valhallasw`cloud: You probably can and, iirc, the effect is that some of the outgoing mail would only be going out half the time. IIRC, the principal issue with -light is that it doesn't speak correctly to some spf servers. [18:25:02] <Coren> But that was ages ago, and I'm not sure I remember why -light didn't work. For all I know, it works fine _now_ [18:25:04] <valhallasw`cloud> which 'standard'/ [18:25:24] <YuviPanda> valhallasw`cloud: there’s a puppet class called standard [18:25:35] <Coren> valhallasw`cloud: It's probably worth trying to remove the entire -heavy stanzas and see if all is fine. [18:26:29] <valhallasw`cloud> YuviPanda: worst manifest name ever [18:26:34] <YuviPanda> :D [18:26:42] <Coren> valhallasw`cloud: Nope. [18:26:51] <valhallasw`cloud> grep standard => hundreds of comments [18:26:53] <Coren> valhallasw`cloud: I've seen a puppet manifest with a class named 'manifest' [18:26:57] <valhallasw`cloud> ...:( [18:27:12] <valhallasw`cloud> and of course it's in site.pp :{ [18:28:11] <valhallasw`cloud> okay, and apparently that's hiera config [18:28:44] <valhallasw`cloud> but I think our hiera config can only be specified on the level of all tools? [18:29:06] <YuviPanda> valhallasw`cloud: you can do it per host in ops/puppet [18:29:32] <YuviPanda> valhallasw`cloud: look inside hieradata/labs/deployment-prep/ [18:29:41] <valhallasw`cloud> oh, there's some complicated order in what gets included? [18:29:52] <YuviPanda> kindof but not really [18:30:18] <YuviPanda> valhallasw`cloud: look at hieradata/labs/deployment-prep/host/deployment-mx.yaml [18:30:23] <valhallasw`cloud> gazillion locations of config, yay. [18:30:23] <YuviPanda> you need the same for tools-mail [18:30:40] <valhallasw`cloud> yeah, and how does one specify that? what's 'deployment-mx', a host name? a class? .... [18:30:43] <Coren> valhallasw`cloud: Puppet's ordering of parsing and inclusion is trivially simple: "effectively random" [18:30:53] <valhallasw`cloud> :D [18:31:06] <valhallasw`cloud> 'juts make sure nothing disagrees with eachother' [18:31:19] <Coren> valhallasw`cloud: The good rule of thumb with puppet is "if you depend on ordering, you fail" [18:31:25] <YuviPanda> valhallasw`cloud: hostname yeah [18:31:39] <andrewbogott> YuviPanda: https://gerrit.wikimedia.org/r/#/c/205897/ [18:31:54] <YuviPanda> andrewbogott: looking [18:31:54] <andrewbogott> I’m off for a bit — may be more than a bit if headache persists. [18:32:07] <valhallasw`cloud> YuviPanda: just tools-email? [18:32:19] <YuviPanda> valhallasw`cloud: tools-mail is the hostname no? [18:32:38] <valhallasw`cloud> yeah, but that was not what I was trying to figure out =p [18:33:23] * YuviPanda is confused? [18:33:26] <valhallasw`cloud> also in some places it's host/x and in some others hosts/x? [18:33:45] <valhallasw`cloud> whether it should be tools-mail, i-whatever, tools-mail.eqiad.wmflabs, etc [18:34:27] <YuviPanda> ah [18:34:29] <YuviPanda> tools-mail yes [18:34:50] <valhallasw`cloud> or it can be on role level, which is probably better [18:36:40] <Coren> role is very much better [18:37:02] <valhallasw`cloud> and there is no hiera config file on tools-mail.... [18:37:41] <YuviPanda> make one? [18:37:58] <YuviPanda> all these are documented somewhere, let me find link [18:38:22] <valhallasw`cloud> ....wut? [18:38:32] <valhallasw`cloud> yes, https://wikitech.wikimedia.org/wiki/Puppet_Hiera [18:39:18] <YuviPanda> whoops, doesn’t document the per host thing [18:39:20] * YuviPanda does [18:39:50] <valhallasw`cloud> it does partially, but it doesn't tell me how the data actually reaches the server [18:40:11] <valhallasw`cloud> and that's probably because it's merged at the puppet master or something? [18:40:45] <YuviPanda> done [18:40:49] <YuviPanda> yeah [18:41:11] <YuviPanda> let me find that [18:41:19] <valhallasw`cloud> or, even better, I can set exim4::variant = 'heavy' [18:41:46] <YuviPanda> valhallasw`cloud: https://github.com/wikimedia/operations-puppet/blob/production/modules/puppetmaster/files/labs.hiera.yaml hierarchy [18:42:01] * YuviPanda is in 3 conversations atm, ugh [18:42:07] <valhallasw`cloud> YuviPanda: thanks [18:43:21] <YuviPanda> valhallasw`cloud: so it does mwyaml (wikitech) first, and if it isn’t found there keeps going down ‘hierarchy' [18:43:56] <valhallasw`cloud> YuviPanda: right, so I need to make sure role is also processed somehow [18:44:21] <valhallasw`cloud> YuviPanda: also, why is 'labs' above private/instancename? [18:44:42] <YuviPanda> valhallasw`cloud: labs.yaml above labs-private [18:44:47] <YuviPanda> so public overrides private [18:44:50] <YuviPanda> that seems wrong yeah [18:45:12] <valhallasw`cloud> YuviPanda: more importantly, labs is generic while private/instancename is specific [18:45:19] <YuviPanda> true [18:45:22] <YuviPanda> that needs to be fixed [18:45:29] <YuviPanda> it’s ok for now because well labs has no real private repo :D [18:45:34] <valhallasw`cloud> and private should probably override public on the same level, yes [18:45:34] <YuviPanda> and so there’s nothing there [18:48:55] <valhallasw`cloud> oh, but the role backend is a mess anyway it seems [18:49:08] <valhallasw`cloud> because it's below all the other ones [18:49:11] <YuviPanda> and not supported for labs :P [18:49:20] <YuviPanda> I’d say just put it on the per-host one for now [18:50:20] <valhallasw`cloud> no, I don't think we need the role one [18:50:35] <valhallasw`cloud> because there's also $classpath [18:50:43] <valhallasw`cloud> which is not anywhere in https://github.com/wikimedia/operations-puppet/blob/production/modules/puppetmaster/files/production.hiera.yaml [18:50:44] <valhallasw`cloud> ugh. [18:51:03] <YuviPanda> $classpath? [18:51:15] <YuviPanda> where do you see that? [18:51:35] <valhallasw`cloud> https://wikitech.wikimedia.org/wiki/Puppet_Hiera [18:51:48] <valhallasw`cloud> hieradata/${::site}/$classpath.yaml If you need to configure something differently per-site (so, eqiad or codfw) globally it should go here. But it should happen only for very general, base classes. $classpath is computed as in the puppet autoload mechanism - so foo::params::param would be searched inside hieradata/${::site}/foo/params.yaml as param [18:52:16] <YuviPanda> valhallasw`cloud: that’s just for prod [18:52:19] <YuviPanda> :) [18:52:21] <valhallasw`cloud> -_- [18:52:25] <YuviPanda> labs is an afterthought of course [18:52:25] <valhallasw`cloud> yay documentation [18:52:30] <YuviPanda> only the things under the Labs section apply [18:52:37] <YuviPanda> valhallasw`cloud: there’s a specific Labs part [18:53:07] <valhallasw`cloud> subsections of subsections of subsections [18:53:23] <YuviPanda> yup [18:53:28] <YuviPanda> labs is an afterthought, etc :) [18:59:13] <Krenair> MariaDB [metawiki_p]> select max(rev_timestamp) from revision; [18:59:13] <Krenair> +--------------------+ [18:59:14] <Krenair> | max(rev_timestamp) | [18:59:14] <Krenair> +--------------------+ [18:59:15] <Krenair> | 20150419000905 | [18:59:15] <Krenair> +--------------------+ [18:59:16] <Krenair> 1 row in set (0.00 sec) [18:59:18] <Krenair> Coren, ^ [18:59:33] <Krenair> When I checked earlier that was 20150418210147 [18:59:41] <Krenair> So it's moving but still a few days old [19:00:03] <Coren> Krenair: Did you check more than 3 hours ago? [19:00:12] <Krenair> yes [19:00:16] <Coren> Krenair: (I.e.: is it catching up or lagging more?) [19:00:24] <Krenair> it would've been more than 6 hours ago [19:00:26] <Krenair> it's catching up [19:00:49] <Coren> Krenair: I think someone had a table lock for a few days and Sean kicked it - but it takes time to catch up. [19:00:55] <Krenair> alright [19:02:07] <shinken-wm> RECOVERY - Puppet failure on tools-bastion-02 is OK: OK: Less than 1.00% above the threshold [0.0] [19:07:09] <wikibugs> 10Tool-Labs: Multiple queue runners on tools-mail - https://phabricator.wikimedia.org/T74867#1228707 (10scfc) IIRC @mark switched to `exim-heavy` for proper LDAP lookups, but this was never followed up because the `exim-light` package came in the way. So IMHO we should switch back to `exim-light` which is all t... [19:09:06] <valhallasw`cloud> it seems grrrit-wm is broken again [19:09:07] <valhallasw`cloud> fun fun fun [19:09:26] <valhallasw`cloud> anyway, YuviPanda / Coren, patch @ https://gerrit.wikimedia.org/r/#/c/205910/ [19:09:59] <YuviPanda> valhallasw`cloud: it isn’t broken, it just goes to -operations :) [19:10:00] <wikibugs> 10Tool-Labs, 5Patch-For-Review: toollabs::bastion uses $ldap::role::config::labs::ldapconfig without including ldap::role::config::labs - https://phabricator.wikimedia.org/T96266#1228746 (10scfc) 5Open>3Resolved a:5scfc>3yuvipanda [19:10:10] <valhallasw`cloud> oh, I'm just blind [19:10:12] <valhallasw`cloud> that explains [19:10:31] <YuviPanda> valhallasw`cloud: merged for you [19:10:34] <YuviPanda> do a puppet run? [19:10:34] <valhallasw`cloud> <3 [19:10:49] <valhallasw`cloud> sec [19:11:13] <wikibugs> 10Tool-Labs, 5Patch-For-Review: toollabs::bastion uses $ldap::role::config::labs::ldapconfig without including ldap::role::config::labs - https://phabricator.wikimedia.org/T96266#1228753 (10yuvipanda) @scfc Ouch, I ddin't see this bug nor your patch. Apologies :( [19:11:28] <valhallasw`cloud> Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declaration: Package[exim4-daemon-heavy] is already declared in file /etc/puppet/modules/exim4/manifests/init.pp:33; cannot redeclare at /etc/puppet/modules/toollabs/manifests/mailrelay.pp:24 on node i-000000d1.eqiad.wmflabs [19:11:30] <valhallasw`cloud> ORLY [19:11:45] <Coren> I had tried to -1 the patch [19:11:53] <Coren> But Yuvi jumped the gun and merged too fast. [19:12:01] <YuviPanda> ouch [19:12:06] <Coren> "That's toollabs::mailrelay and you need to remove the related package{} stanzas from there (otherwise you'll get multiple definitions)" [19:12:20] <YuviPanda> alright, I’m going to let Coren handle this [19:12:32] <YuviPanda> Coren: am reverting [19:12:49] <YuviPanda> reverted [19:12:54] <valhallasw`cloud> puppet stupidity *sigh* [19:13:02] <Coren> :-) Poop occurs. :-) [19:13:29] <valhallasw`cloud> and then I should just explicitly include exim4 in mailrelay? [19:13:58] <Coren> valhallasw`cloud: If you really want to - but we generally don't explicitly include things included in standard [19:14:15] <valhallasw`cloud> hm, okay [19:14:18] * YuviPanda leaves to office [19:14:35] <Coren> (And yes, having the config mingled in with hiera is... well, I'm not a fan) [19:15:25] <YuviPanda> alternatives welcome :) [19:15:49] <shinken-wm> PROBLEM - Puppet failure on tools-mail is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [19:16:18] <valhallasw`cloud> maybe I can just do class { 'exim4': variant => 'heavy' } ? [19:16:27] <Coren> YuviPanda: That ship sailed long ago. :-) I just didn't realize at the time how opaque that would make some bits of configuration. [19:16:44] <Coren> valhallasw`cloud: It's too late by then because standard already included it. [19:16:59] <valhallasw`cloud> oh, right [19:17:17] <Coren> valhallasw`cloud: Just axe the package {} stanzas, and maybe put in a comment that "relies on exim being heavy from hiera... blah blah" [19:19:18] <valhallasw`cloud> https://gerrit.wikimedia.org/r/205914 then [19:19:30] <valhallasw`cloud> ugh trailing space [19:19:57] <YuviPanda> valhallasw`cloud: Coren https://gerrit.wikimedia.org/r/#/c/205915/ [19:20:32] <valhallasw`cloud> YuviPanda: except I'm not sure if we want to kill the entire mail class [19:20:47] <YuviPanda> valhallasw`cloud: you want to kill the entire mail sender class no? [19:21:02] <YuviPanda> valhallasw`cloud: anyway, killing the mail sender is a must [19:21:02] <valhallasw`cloud> no, I just want it to use -heavy instead of -light [19:21:15] <YuviPanda> hmm [19:21:31] * YuviPanda contemplates on how to do this cleanly [19:22:14] <Coren> YuviPanda: His fixes does it. [19:22:18] <YuviPanda> yeah [19:22:25] <YuviPanda> I think [19:22:33] <YuviPanda> toollabs::mailrelay should not rely on standard doing things [19:22:35] <wikibugs> 10Tool-Labs: Set up ganglia/icinga for tools-mail exim paniclog - https://phabricator.wikimedia.org/T96898#1228811 (10valhallasw) 3NEW [19:23:07] <valhallasw`cloud> YuviPanda: mm, you want to disable the standard inclusion and then have it explicitly in mailrelay? [19:23:09] <Coren> YuviPanda: That's not a realistic expectation. 80% of our manifests rely on standard or base in some manner. [19:23:21] <YuviPanda> valhallasw`cloud: yup [19:24:13] <YuviPanda> valhallasw`cloud: this also allows easier customization in the future without having to hack around [19:24:25] <YuviPanda> Coren: in this case I think it totally is - it’s a mail host, it should manage its own mail things [19:24:29] <YuviPanda> that’s how we do it in prod [19:24:33] <YuviPanda> and for deployment-mx too [19:24:36] <YuviPanda> and it’s fairly trivial [19:24:44] <Coren> Yeah, okay, that's not insane. [19:25:38] <valhallasw`cloud> so we just add class { 'exim4': queuerunner => 'queueonly', config => template("mail/exim4.minimal.${::realm}.erb"), }, basically [19:25:55] <valhallasw`cloud> and we actually overwrite that config already [19:26:12] <valhallasw`cloud> so that would actually clean up a bit [19:26:15] <valhallasw`cloud> ok, let me do that [19:26:39] <YuviPanda> :D [19:27:59] <wikibugs> 10Tool-Labs: Set up graphite/icinga for tools-mail exim paniclog - https://phabricator.wikimedia.org/T96898#1228844 (10yuvipanda) [19:28:30] <valhallasw`cloud> YuviPanda: so many names that have nothing to do with what the tool does :P [19:28:57] <YuviPanda> valhallasw`cloud: :D [19:31:41] <valhallasw`cloud> oh yes this is going to be much better [19:34:19] <valhallasw`cloud> 3 files changed, 8 insertions(+), 46 deletions(-) [19:34:24] <valhallasw`cloud> that's the sign of a good change ;D [19:34:37] <valhallasw`cloud> https://gerrit.wikimedia.org/r/#/c/205914/ [19:35:15] <valhallasw`cloud> I'm not sure about the require (and unsure how to test as well) [19:36:11] <valhallasw`cloud> thanks for suggesting that, YuviPanda [19:40:14] <valhallasw`cloud> YuviPanda: wat. the stats are collected by diamond, then passed to graphite and finally warnings are sent by icinga? [19:40:26] <valhallasw`cloud> and diamond is configured via puppet [19:40:31] <valhallasw`cloud> my brain hurts [19:40:49] <shinken-wm> RECOVERY - Puppet failure on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [19:51:36] <YuviPanda> valhallasw`cloud: shinken not icinga but yesb:) [19:52:02] <valhallasw`cloud> oh, icinga is the prod monitoring tool? [19:52:14] <YuviPanda> Yeah [19:52:19] <YuviPanda> and is badly puppetized [19:52:35] <YuviPanda> Shinken is for labs and hopefully for prod in the future [19:53:25] <wikibugs> 10Tool-Labs: Set up diamond/graphite/shinken for tools-mail exim paniclog - https://phabricator.wikimedia.org/T96898#1228913 (10valhallasw) [19:53:40] <valhallasw`cloud> YuviPanda: and where are the shinken alerts set up? on the shinken host, right? [19:53:53] <YuviPanda> Yeah [19:53:55] <YuviPanda> Well [19:54:00] <YuviPanda> Puppet had the config itself [19:55:55] <valhallasw`cloud> I think it makes sense to extend the diamond collector for exim [19:56:45] <valhallasw`cloud> because it just does queue size now, and it would make sense to extend it to stuff like number of frozen messages, oldest message in queue, etc [19:56:50] <valhallasw`cloud> newest message in queue [19:57:19] <YuviPanda> Yeah totally [19:57:40] <YuviPanda> We can just copy the upstream one locally and add things to it and then contribute it back [19:57:45] <YuviPanda> It is fairly clean python [19:58:23] <valhallasw`cloud> yup [19:58:43] <valhallasw`cloud> testing diamond stuff sucks though :P [19:58:58] <valhallasw`cloud> oh well [19:59:09] <valhallasw`cloud> as does exim's queue output [19:59:39] <YuviPanda> valhallasw`cloud: haha true :) I usually just run it locally on the host from my homedir and make it 'print' things [20:00:06] <valhallasw`cloud> and then I have to figure out how to disable exim's warning mails :P [20:00:19] <YuviPanda> That :l [20:00:33] <YuviPanda> we also need to move it to trusty and make it redundant [20:00:40] <YuviPanda> One host going down shouldn't take it down [20:00:58] <valhallasw`cloud> it's email so it's fine :p [20:01:38] <YuviPanda> Dunno :p [20:09:06] <hashar> Coren: andrewbogott: YuviPanda: the Puppet modules for OpenStack are now an official project! :) https://review.openstack.org/#/c/172112/ [20:09:28] <Coren> Not sure if gusta. [20:09:42] <Coren> In general, external puppet modules are... mitigated in their usefulness. [20:10:15] <andrewbogott> hashar: yeah, we tried to adopt them a while ago and they were… difficult. [20:10:20] <andrewbogott> But it’s good that they’re taking it seriously [20:10:39] <andrewbogott> what I really want is for OpenStack support for puppet, not puppet support for OpenStack :) [20:11:02] <hashar> they are getting a technical leader apparently https://www.mail-archive.com/openstack-dev@lists.openstack.org/msg49290.html [20:11:24] <hashar> and doing a lot of acceptance tests [20:11:30] <hashar> so maybe one day we can revisit it :) [20:12:50] <andrewbogott> YuviPanda: did you say you want to move tools over to the new dns? [20:13:08] <YuviPanda> andrewbogott: I filed a bug, yeah [20:13:16] <andrewbogott> when are you thinking? [20:13:22] <hashar> have you found a solution for the DNS split horizon? [20:13:25] <YuviPanda> Dig tools-webproxy-02 doesn't work [20:13:30] <YuviPanda> andrewbogott: ^ [20:13:31] <hashar> aka yield different IP for wmflabs.org. entries [20:13:38] <YuviPanda> Ping does but dig doesn't [20:14:06] <valhallasw`cloud> 3m 1.9K 1Yl0j9-12345-GH <> *** frozen *** [20:14:06] <valhallasw`cloud> tools.abcdefg@tools.wmflabs.org [20:14:22] <valhallasw`cloud> YuviPanda: ^ that's the format exim -bpr gives you >_< [20:14:24] <YuviPanda> andrewbogott: is there no way at all to make public floating IPS hittable from labs] [20:14:32] <valhallasw`cloud> and that's their suggested output 'for further processing' [20:14:51] <andrewbogott> YuviPanda: is this related or are you changing the subject? [20:15:11] <YuviPanda> andrewbogott: related because we would need split DNS too [20:15:20] <YuviPanda> Or have public IPS work from inside labs [20:15:51] <andrewbogott> I need more context for this. [20:17:00] <YuviPanda> andrewbogott: give me 30mins? On way to office [20:17:15] <YuviPanda> andrewbogott: basically if you dig tools.wmflabs.org or any DNS assigned to floating ip [20:17:22] <YuviPanda> You get back the floating IP by default [20:17:34] <YuviPanda> But since you can't reach floating IPS from inside labs [20:17:51] <andrewbogott> ah, ok, I remember — this is hacked up in the dnsmasq config [20:17:57] <YuviPanda> We have dnsmasq aliases for them now. So if you hit tools.wmflabs.org a hard coded private IP is returned [20:17:57] <YuviPanda> Yesh [20:18:07] <YuviPanda> The correct solution is to have public IPS be routablr [20:18:45] <YuviPanda> In the absense of which we need to be able to hack something up so internal requests get a different IP them external onrd [20:18:48] <YuviPanda> Ones [20:18:50] <andrewbogott> is this related to that ndots issue? [20:19:12] <YuviPanda> No that is just dnsmasq not following the DNS spec [20:19:34] <andrewbogott> ok [20:20:24] <Krinkle> andrewbogott: I'm getting connection failure from cvn instances. This has never happened in many months so I assume this is related to the migration? [20:20:33] <Krinkle> shinken-wm: PROBLEM - Host cvn-app5 is DOWN: CRITICAL - Host Unreachable (10.68.16.170) [20:20:33] <Krinkle> 21:19 shinken-wm: RECOVERY - Host cvn-app5 is UP: PING OK - Packet loss = 0%, RTA = 478.85 ms [20:21:03] <andrewbogott> Krinkle: is it going down and coming back a lot? [20:21:15] <andrewbogott> I’ve no idea if that’s related; I don’t know why it would be [20:21:46] <Krinkle> andrewbogott: It happened 5 times in the past 12 hours. And never in the past 6 months. [20:21:56] <andrewbogott> ok [20:22:10] <wikibugs> 10Tool-Labs, 5Patch-For-Review: Multiple queue runners on tools-mail - https://phabricator.wikimedia.org/T74867#1229129 (10scfc) You're way too fast for me. We **don't** use `exim-heavy`. IMHO https://gerrit.wikimedia.org/r/#/c/164366/ needs to be reverted. Nothing more. [20:22:12] <Krinkle> both app4 and app5 [20:22:15] <Krinkle> But not the apache server. [20:22:20] <andrewbogott> Krinkle: did you experience an actual outage, or just get the notice from shinken? [20:22:22] <Krinkle> Maybe related to it being an old instance type or something like that [20:22:37] <Krinkle> Right now only shinken. I haven't investigated yet and can't right now. [20:22:54] <Krinkle> if not related to the old instance type, perhaps the labsvirt it is on is the issue. [20:23:13] <andrewbogott> there really shouldn’t be anything happening, unless the network is saturated by the work of migrating other things [20:25:13] <Krinkle> andrewbogott: Did those instances migrate yet? [20:25:17] <andrewbogott> yep [20:25:53] <andrewbogott> 21:19 must be… your local time? [20:26:22] <andrewbogott> There was definitely some kind of crazy network storm a few minutes gao. [20:26:23] <andrewbogott> ago [20:27:36] <andrewbogott> YuviPanda: if you have a bug about labs instances reaching public labs IPs, please link? Otherwise I’ll make one [20:28:05] <YuviPanda> Still on phone tho making way to office for meeting I forgot I had [20:28:16] <YuviPanda> andrewbogott: make one we can merge if needed [20:28:58] <YuviPanda> Being able to use floating IPS internally also gives us a lot of freedom for redundancy planning [20:36:07] <Krinkle> andrewbogott: Ah, yeah, local time. [20:36:13] <Krinkle> andrewbogott: Just got another one, this time "FLAPPINGSTART" [20:36:15] <Krinkle> whatever that is? [20:37:40] <Krinkle> And I'm now seeing actual issues too in the channels. I/O unavailable. and network timeout. [20:52:25] <wikibugs> 10Quarry, 6Analytics-Kanban: it would be useful to run the same Quarry query conveniently in several database - https://phabricator.wikimedia.org/T95582#1229295 (10ggellerman) [20:56:24] <wikibugs> 10Tool-Labs, 5Patch-For-Review: Multiple queue runners on tools-mail - https://phabricator.wikimedia.org/T74867#1229314 (10valhallasw) According to @coren, we also need exim-heavy for outgoing SPF e-mails. If we want to use -light (which is fine with me), the last patchset is still useful, as it cleans up the... [20:58:38] <valhallasw`cloud> YuviPanda, does diamond run as root, or does it at least have read rights in /var/log/exim? [20:59:22] <valhallasw`cloud> hm, no, clearly runs as 'diamon' [20:59:25] <valhallasw`cloud> diamond* [20:59:32] <valhallasw`cloud> so I have to get file contents through sudo....? [21:17:57] <valhallasw`cloud> YuviPanda: https://github.com/BrightcoveOS/Diamond/commit/689a56e6121582a48ba772f8e3d6a6383573ab17 [21:18:11] <valhallasw`cloud> ['a', 'b', 'c'].extend([...]) = None :-p [21:18:17] <valhallasw`cloud> clearly no-one uses the code ;-) [21:20:51] <valhallasw`cloud> YuviPanda: but my code is working \o/ [21:22:06] <valhallasw`cloud> except that there's no paniclog, and I'm not sure how to detect the difference between 'weird read error' and 'no paniclog' [21:22:32] <valhallasw`cloud> especially through sudo [21:22:50] <YuviPanda> valhallasw`cloud: sorry was in a meeting. [21:22:56] <valhallasw`cloud> np [21:23:13] <YuviPanda> valhallasw`cloud: there’s a puppet stanza in toollabs::mailrelay that adds sudo for particular commands to diamond [21:23:15] <YuviPanda> it runs as diamond [21:23:36] <valhallasw`cloud> yeah, so it can't read log files without sudo [21:23:48] <YuviPanda> yeah and you’ve to basically sudo cat to read files [21:23:57] <YuviPanda> see the minimalpuppetagent.py file in ops/puppet [21:24:00] <YuviPanda> it’s super ugly [21:24:22] <valhallasw`cloud> yeah, I'm sudo cat'ing now [21:26:41] <valhallasw`cloud> YuviPanda: anyway, ssh tools-mail, cd /home/valhallasw/eximparser, run diamond -f -l --skip-pidfile -c diamond.conf [21:28:20] <YuviPanda> valhallasw`cloud: Coren is there an archive of all bots approved to run on enwiki? [21:28:24] <YuviPanda> or just all bot approval requests? [21:29:15] <YuviPanda> anomie: ^ [21:29:18] <YuviPanda> valhallasw`cloud: looking [21:29:20] <Coren> YuviPanda: Just the BAG requests, and the record isn't 100% either [21:29:42] <YuviPanda> Coren: yeah but I only see current requests on https://en.wikipedia.org/wiki/Wikipedia:Bots/Requests_for_approval [21:29:54] <YuviPanda> aha! [21:29:54] <YuviPanda> https://en.wikipedia.org/wiki/Category:Wikipedia_bot_requests_for_approval [21:29:55] <YuviPanda> found it [21:30:25] <valhallasw`cloud> YuviPanda: I had to parse silly date and size formats :{ [21:31:32] <andrewbogott> Krinkle: hypothetical fix to a possible cause of the issue you mentioned: https://gerrit.wikimedia.org/r/#/c/205978/ [21:32:55] <YuviPanda> valhallasw`cloud: :D [21:33:05] * YuviPanda sshs to look [21:37:52] <YuviPanda> valhallasw`cloud: +1 :D [21:37:54] <YuviPanda> put it in puppet! [21:38:04] <valhallasw`cloud> YuviPanda: now we just have to wait for the panic log to be nonempty again ;D [21:38:12] <valhallasw`cloud> First bed and sleep [21:38:29] <YuviPanda> : [21:39:04] <YuviPanda> valhallasw`cloud: \o/ [21:40:40] <YuviPanda> valhallasw`cloud: wanna stick around for merging / testing your tools-mail patch or should I let it be for a while? [21:40:54] <valhallasw`cloud> YuviPanda: can we do it tomorrow? [21:41:02] <YuviPanda> valhallasw`cloud: sure! [21:41:18] <valhallasw`cloud> good :-) and maybe we should wait for scfc to comment [21:41:26] <YuviPanda> yeah [21:41:26] <valhallasw`cloud> as we might want to use -light after all [21:41:34] <valhallasw`cloud> (not sure what the difference is, really) [21:41:50] <YuviPanda> me neither [21:48:46] <andrewbogott> YuviPanda, hashar: https://phabricator.wikimedia.org/T96924 [21:49:03] <wikibugs> 6Labs: allow routing between labs instances and public labs ips - https://phabricator.wikimedia.org/T96924#1229482 (10yuvipanda) [21:49:21] <YuviPanda> andrewbogott: \o/ yes. [21:50:35] <YuviPanda> andrewbogott: I think a solution to that is a blocker for migrating otols [21:50:36] <YuviPanda> *tools [21:50:36] <wikibugs> 10Tool-Labs, 5Patch-For-Review: Multiple queue runners on tools-mail - https://phabricator.wikimedia.org/T74867#1229495 (10scfc) It might be that we //will// need `exim-heavy` for SPF, but //at the moment// there are no SPF checks in `modules/toollabs/templates/exim4.conf.erb`. Your latest patchset in https:/... [21:51:16] <andrewbogott> YuviPanda: yeah [21:51:51] <wikibugs> 6Labs, 10Tool-Labs: Move tools to designate - https://phabricator.wikimedia.org/T96641#1229496 (10yuvipanda) [21:51:53] <wikibugs> 6Labs: allow routing between labs instances and public labs ips - https://phabricator.wikimedia.org/T96924#1229497 (10yuvipanda) [21:52:00] <YuviPanda> andrewbogott: https://phabricator.wikimedia.org/T96642 is the other one [21:53:05] <andrewbogott> YuviPanda: ok, and… the designate behavior is right or wrong? [21:53:23] <YuviPanda> andrewbogott: hmm, I do not know. I know it’s different from dnsmasq. [21:53:40] <andrewbogott> should ndots be 3 for designate since hosts have.an.extra.wmflabs ? [21:53:54] <YuviPanda> but ping does work. [21:53:56] <YuviPanda> oh hmm [21:55:02] <YuviPanda> andrewbogott: not sure. [21:55:16] <YuviPanda> definitely needs someone who knows more about DNS than I do... [21:55:57] <andrewbogott> YuviPanda: or you could try it [21:56:07] <YuviPanda> oooh, that’s an option! [21:56:08] * YuviPanda tries [21:56:15] <YuviPanda> hmm, I deleted the instance with it on tho :| [21:56:19] <YuviPanda> do you have a test in stance I could use? [21:56:54] <andrewbogott> I think util-abogott is using pdns [21:58:57] <YuviPanda> andrewbogott: no different [22:01:04] <YuviPanda> andrewbogott: it might just be valid behavior, though. [22:01:06] <YuviPanda> andrewbogott: ping works [22:01:19] <YuviPanda> I guess dig doesn’t care about ndots? [22:01:34] <YuviPanda> and dnsmasq is just… a ‘eh whatever’ kind of place? [22:08:36] <wikibugs> 6Labs: allow routing between labs instances and public labs ips - https://phabricator.wikimedia.org/T96924#1229535 (10hashar) I originally reported it as {T39985}, in short that is due to NAT. Some more details at https://rt.wikimedia.org/Ticket/Display.html?id=4824#txn-147723 I originally fixed it using iptab... [22:08:51] <hashar> andrewbogott: are you using Neutron on labs? [22:09:12] <andrewbogott> YuviPanda: oh yeah, now that I think of it I wouldn’t expect dig to work, since it doesn’t go through resolv.conf [22:09:18] <andrewbogott> hashar: no, nova-network [22:09:38] <hashar> ahh [22:09:40] <YuviPanda> Oh [22:09:43] <YuviPanda> I didn't know that [22:09:45] <hashar> first google hit is "Deprecation of Nova Network" :D [22:09:51] <YuviPanda> But why does it work on dnsmasq [22:09:55] <andrewbogott> YuviPanda: although, as you say, it works on dnsmasq. So… ??? [22:09:58] <andrewbogott> Does dig foo.projectname work? [22:10:23] <andrewbogott> hashar: it’s died and come back to life more times than <blasphemous analogy> [22:10:33] <hashar> :-D [22:10:35] <YuviPanda> Haven't tried yet - came for lunch [22:10:48] <YuviPanda> I think it is ok to close it as won't fix tho [22:11:04] <hashar> for the task "allow routing between labs instances and public labs ips" , I guess the best would be to reach out to nova-network folks and ask for help / review [22:11:07] <andrewbogott> YuviPanda: yeah, until we have a reason to care, let’s not care :) [22:11:08] <YuviPanda> I think it was because dnsmasq was also the dhcp server [22:11:14] <YuviPanda> andrewbogott: yesh [22:11:22] <andrewbogott> hashar: if, in fact, we aren’t intentionally preventing it. [22:11:24] <andrewbogott> Which I bet we are [22:12:20] <hashar> the NAT might not be applied on the internal interface for ingress packets [22:12:29] <hashar> so they would end up being routed to the default route / internet [22:12:42] <hashar> with a source of private instance IP and destination of the public IP [22:13:16] <hashar> then bounce back on the external network address but with an internal private IP which the router probaqbly just plain reject [22:13:28] <hashar> I havent dealt with such issues for a decade though :/ [22:14:12] <andrewbogott> hashar: I’m not clear on why the hack was done in dnsmasq though — wouldn’t it be just as easy for you to have a local hosts file? [22:14:44] <hashar> then you have to puppetize the hosts file [22:14:53] <hashar> and get it applied on every single instances that needs that [22:15:00] <hashar> brandon trick was to get it centrally [22:15:08] <hashar> so any project would benefit from the aliased entries [22:15:26] <hashar> be it instances in tools, integration, beta cluster or whatever else [22:15:42] <hashar> the typical example being the webproxy where you code refers to foobar.wmflabs.org [22:15:55] <andrewbogott> sure [22:16:02] <hashar> and stall when running it on your instance until you figure out to do an hosts hack [22:16:08] <andrewbogott> but… *shrug* puppet is easy, and every instance is already puppetized. It could be done labs-wide. [22:16:19] <hashar> yeah [22:16:33] <hashar> but then why bother with flat hosts files when you have a central dns resolver ? :] [22:16:35] <andrewbogott> Anyway, that’s probably better than hacking the resolver, but better yet would be to just have routing work. [22:16:47] <hashar> yup [22:16:49] <hashar> +2 on routing [22:17:11] <hashar> if you get some contacts with nova-network folks, that would be a good use case for them to look at [22:19:50] <Coren> Welp, deployment-prep reboot went without a hitch. Yeay. [22:22:28] <hashar> \o/ [22:22:40] <hashar> I am off for real. Have to drive tomorrow [22:22:43] <hashar> see you! [22:23:40] <shinken-wm> PROBLEM - Puppet failure on tools-webgrid-07 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [22:24:58] <YuviPanda> andrewbogott: tools-webproxy-01.tools works too :) [22:25:19] <andrewbogott> with dnsmasq you mean? [22:25:22] <YuviPanda> andrewbogott: no, without [22:25:28] <YuviPanda> pinging, I meant [22:25:30] <andrewbogott> oh [22:25:32] <andrewbogott> huh [22:25:44] <andrewbogott> oh, pinging, it should. That seems well defined by resolv.conf [22:25:47] <YuviPanda> yeah [22:25:53] <YuviPanda> I guess we should just ignore dig.. [22:26:35] <andrewbogott> ok, I’m out for the night. [22:26:51] <Coren> YuviPanda: Don't ignore it, just don't misuse it. Dig really only should be use to query specific nameservers for fqdns. :-) [22:26:58] <andrewbogott> fyi, the migration is causing network spikes each time an instance copies. I set a throttle setting but I’m not yet clear on if it’s observed. [22:27:04] <andrewbogott> Worst case I think we just live with it. [22:27:14] <YuviPanda> :) fair enough [22:27:17] <Coren> It's a DNS debugging tool, not a resolution debugging tool. :-) [22:27:33] <YuviPanda> I’m going to do the tools-dev switchover soon [22:27:39] <YuviPanda> Coren: can you add me as mod for labs-l? [22:27:51] <Coren> YuviPanda: Sure. I thought you already were. [22:27:53] <YuviPanda> or just subscribe labs-announce to labs-l so labs-announce can email labs-l and not have it be rejected? [22:28:09] <Coren> I shall do both. What email addy are you subscribed with? [22:28:57] <YuviPanda> Coren: yuvipanda@gmail.com [22:29:35] <Coren> Wait, why am I the only labs-l admin? andrewbogott_afk, you are also volunteering. :-) [22:29:52] <YuviPanda> :D [22:33:58] <Coren> Hm. I realize that I don't actually *know* the list admin password anymore. [22:34:07] <KTC> lol [22:34:40] <Coren> Not a good idea to rely on my browser. Ima change it and send it to you guys. [22:36:55] <Magog_the_Ogre> How does Mediawiki render SVGs into PNGs? Is this something I could do automatically without uploading an image? [22:37:20] <Coren> Magog_the_Ogre: It does a render for the thumbnails. [22:37:38] <YuviPanda> Magog_the_Ogre: it uses rsvg IIRC. you can use rsvg directly [22:37:45] <Coren> Magog_the_Ogre: And I *think* the IE6 support will actually use those too [22:37:58] * Coren has to walk the dogs. [22:38:28] <Magog_the_Ogre> Coren, do you guys actually still have to program for IE6?! [22:38:46] * Magog_the_Ogre does the sign of a cross over his heart and mumbles a prayer [22:43:53] <Coren> Magog_the_Ogre: It's reduced-feature support but yeah - there is still a sizeable install base, especially in the developping world. [22:45:29] <Magog_the_Ogre> ohhhhh [22:45:36] <Magog_the_Ogre> I work for a company with a US-only footprint [22:45:51] <Magog_the_Ogre> our statistics show IE6 support at something like .02% [22:48:12] <Coren> IIRC, our last numbers show IE^ at about .25% globally, but as high as 15-20% in some countries. [22:49:00] <Coren> But it's still dropped from the "Grade A" support so it has a reduced featureset (no js, some css gone) [22:49:37] <Coren> The Analytics people could tell you more if you are interested, I suppose. [22:50:36] <Coren> IIRC we only dropped IE6 support for jquery stuff ~ a year ago. [22:52:23] <Magog_the_Ogre> I love reading about that stuff [22:52:46] <Magog_the_Ogre> someone providing me the analytics data from our company was a demographics geek's wet dream [22:53:43] <shinken-wm> RECOVERY - Puppet failure on tools-webgrid-07 is OK: OK: Less than 1.00% above the threshold [0.0] [23:01:18] <YuviPanda> Coren: we don’t serve JS to IE6 tho [23:13:43] <wikibugs> 6Labs, 6operations, 5Patch-For-Review, 7Shinken: Shinken down - https://phabricator.wikimedia.org/T96817#1229705 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Alright, so shinkengen handling duplicate hostnames is a thing it has to do during the DNS migration :)