[09:07:42] vgutierrez, did you get stuff under /etc/acmecerts showing up as dirs that should not be? [09:07:54] you're right :/ [09:25:29] 10Traffic, 10Operations, 10Patch-For-Review: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 (10ema) We're currently using swift-rw (eqiad only) as the origin server for upload cache misses. Thumb traffic can however be served active/activ... [09:42:34] godog: shall ATS serve thumb traffic A/A? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/498028/ [09:50:04] 10Traffic, 10Operations, 10Performance-Team, 10media-storage, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) 05Open→03Stalled [09:50:10] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: Normalize thumbnail request URLs in Varnish to avoid cachebusting - https://phabricator.wikimedia.org/T216339 (10Gilles) 05Open→03Stalled [09:50:13] 10Traffic, 10Operations, 10Performance-Team, 10media-storage, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) [10:24:11] damn... the expectedfail annotation in get_file_metadata() test masked the issue :( [10:24:25] ema: yeah looks good! [10:35:43] 10Acme-chief: acme-chief >0.13 generates wrong metadata for the endpoint used in file based deployment layout - https://phabricator.wikimedia.org/T218862 (10Vgutierrez) p:05Triage→03High [11:20:38] Krenair: fixed in https://gerrit.wikimedia.org/r/c/operations/software/acme-chief/+/498046 and tested successfully with a puppet client for both scenarios (file, and directory) [11:29:12] vgutierrez, approved [11:29:19] thx [11:29:21] vgutierrez, you know in puppet 5 this API is documented as being able to return application/json [11:29:45] we're returning yaml right now [11:30:03] yeah but it'd be easy to change to json [11:30:08] indeed [11:30:11] not sure how easy it'd be to change to pson [12:05:45] godog: I now see slightly more connections to codfw than to eqiad on cp-ats-codfw, as expected [13:39:57] vgutierrez or ema, I'm looking at the puppet diff for https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/496858/ and concerned about this diff: [13:40:02] https://www.irccloud.com/pastebin/myaXO4m6/ [13:40:32] Does it matter that I'm adding a public IP endpoint on a host that otherwise only has private ones? lvs1006 is itself on a public IP, so it /should/ work… but I don't like being the one big outlier there [13:41:21] the lvs sits in all the VLANs, so it shouldn't be an issue AFAIK [13:41:29] ok [13:41:37] Ready for me to merge that, or do you have other concerns? [13:43:22] maybe bblack can take a final look to the CR [13:44:36] ok [13:55:32] yeah there's some issues with that CR, but I need to finish some coffee to nail it all down [13:55:42] thanks! [14:27:49] andrewbogott: updated, but there's still one part I can't quite figure out, but could fix the rest [14:35:18] bblack: updated. It's the role::lvs::realserver::pools: bit that's unclear? [14:35:29] I'll try to re-read that code too [14:44:02] 10Acme-chief: acme-chief >0.13 generates wrong metadata for the endpoint used in file based deployment layout - https://phabricator.wikimedia.org/T218862 (10Vgutierrez) 05Open→03Resolved [14:51:58] hm, the pool defined in role/common/scb.yaml implies that the way I had it before was right [15:01:20] andrewbogott: yeah I'm not sure either way tbh, it's hard to find an exactly identical example, but I'll look again after a couple meetings [15:08:22] I confirmed (via reading code and testing with the compiler) that it's only used to look up the service IP, which is only defined for ldap-ro. So I'm pretty sure it's right now. [15:59:32] ok [16:00:20] my other concern is just I have no idea about how we're restricting access to this, I kind of assume either router ACLs or ferm rules on the LDAP hosts, which I guess will still work through LVS. But I assume even though we're allocating a public IP to this, that's mostly for wmcs reachability and not actually for truly public access [16:01:05] if it's router ACLs we might need to edit them to include this new LVS service IP too [16:04:36] As far as I know there aren't router ACLs, just firewall rules. [16:05:18] And yeah, I don't expect it to be a public service, although reliable read-only ldap could be useful for some prod services outside of cloud things. [16:05:42] Do you think I should add ferm rules on the lvs servers themselves as well? Or is it enough to have them on the ldap servers? [16:59:46] 10Traffic, 10Horizon, 10Operations, 10Upstream, 10cloud-services-team (Kanban): Horizon Designate dashboard not allowing creation of NS records - https://phabricator.wikimedia.org/T204013 (10GTirloni) [17:11:22] andrewbogott: we don't have ferm on LVS at all, but if it's got ferm rules on the service hosts, those ferm rules should still work [17:15:49] bblack: cool [17:16:01] Are you around for a bit to help with aftershocks if I merge it now? [17:20:09] hopefully no aftershocks, but yes. let me stare at it just a little more. [17:20:14] 'k [17:21:17] what's with the conftool-data/discovery/services.yaml stuff added? [17:21:49] that was an attempt to resolve a puppet issue… it didn't help but seemed right anyway. Am I misunderstanding what that's for? [17:22:07] that's for discovery-dns, but I don't think we're doing that for LDAP here, just LVS within a DC [17:22:49] ok, I'll remove — we can worry about that later if I ever build a cluster in codfw [17:23:17] done [17:23:45] the idea with discovery-dns is yeah, we'd have LVS services in both DCs, and a central ldap hostname that does a/a or a/p between the two. [17:24:43] So if we ever had VMs in codfw it would save on custom images. But, we can cross that when we come to it. [17:24:50] (probably a/a in this case, so that eqiad clients hit eqiad and codfw clients hit codfw, but it fails over if we disable one side) [17:28:51] running it through puppet compiler to see (PS10) [17:29:26] https://puppet-compiler.wmflabs.org/compiler1002/15266/ [17:31:18] looks ok I think [17:31:29] we'll have to step through the pybal stuff manually for restarts [17:31:52] yeah, the diff on the ldap server is pretty much unreadable but the lvs server side seems ok. [17:31:58] Tell me more about stepping through the pybal stuff? [17:32:25] basically: (1) Merge the change (2) run puppet on the ldap replica hosts (3) run puppet on affected lvses (4) carefully step through manual pybal restarts on lvs1005 + lvs1002 [17:33:09] pybal won't pick up the config change without a restart, so we restart the backup host 1005 first and see that it even works there correctly, and then restart 1002 (which will blip all LVS traffic over to lvs1005 briefly until its fully back) [17:34:05] I can take care of 3/4 once 1/2 is done [17:34:14] whenever you're ready :) [17:34:20] ok. So that sounds like 2 and 3 don't have to happen in a particular order, since it's a no-op until 3/4? [17:35:04] it's basically a no-op on the LVS side until 4, but it's easier to revert back out if we can see problems in 2 before we do 3 [17:35:13] going to disable puppet on the affected lvses first [17:35:30] And the affected lvs servers are just 1005 and 1002? [17:35:33] yeah [17:35:51] done [17:36:03] ok, you beat me to it [17:36:11] so merge when ready, confirm the ldap replicas puppetize ok [17:36:21] yep, have to rebase and then I'll merge [17:40:15] the patch applied cleanly, with this: [17:40:17] https://www.irccloud.com/pastebin/oAxxzpq5/ [17:40:28] Which looks right to me, I think [17:41:08] what was 10.2.2.17? [17:41:38] I think that's from an earlier errnoeous version of this same patch [17:41:41] *erroneous [17:41:43] it's weird that was in the diff [17:41:47] that's restbase's IP heh [17:41:56] ok [17:42:16] trying lvs1005 [17:44:36] andrewbogott: can you pool them in confd? [17:45:17] Mar 21 17:44:52 lvs1005 pybal[10169]: [config-etcd] INFO: connected to etcd://conf1004.eqiad.wmnet:4001/conftool/v1/pools/eqiad/ldap-ro/ldap-ro/ [17:45:20] Mar 21 17:44:52 lvs1005 pybal[10169]: [config-etcd] INFO: connected to etcd://conf1004.eqiad.wmnet:4001/conftool/v1/pools/eqiad/ldap-ro/ldap-ro-ssl/ [17:45:23] Mar 21 17:44:52 lvs1005 pybal[10169]: [config-etcd] ERROR: failed: [Failure instance: Traceback (failure with no frames): : 404 Not Found [17:45:33] seeing this spam over and over, assuming it's from lack of having any backends pooled [17:45:36] I don't immediately know how to pool a host but I can look it up [17:45:54] if you have root shells on the two ldap replica hosts, just type "pool" [17:46:00] ok :) [17:46:12] seems happy [17:46:31] still getting nothing from confd, hmmm [17:46:46] ok, I got a timeout before, now I get "Can't contact LDAP server" instead [17:46:49] so the IP is live at least [17:47:26] kinda [17:47:37] there's no backends pooled and I can't find it in confd [17:47:47] do you still have the output from the bottom of puppet-merge, when confd stuff happened? [17:49:31] because the live confd data doesn't seem to know about the ldap-ro cluster at all [17:49:32] here's the whole output: [17:49:34] https://www.irccloud.com/pastebin/LokYvGok/ [17:49:51] not the agent run, the puppet-merge on the puppetmaster [17:50:07] oh, ok [17:50:09] after it syncs up the actual puppet stuff, at the bottom of the output it deals with creating new confd keys, etc [17:50:30] I don't think I have that anymore [17:50:39] I'll log in and force a manual confd sync [17:50:43] I think I know how to do that... [17:51:17] that did quite a bit! Must've missed it before [17:51:33] ok [17:51:54] "pool" didn't return any error before? in any case try again now [17:52:31] it has more to say this time [17:52:38] https://www.irccloud.com/pastebin/IePBN5cT/ [17:52:40] seems promising [17:52:51] ah there we go [17:52:56] yep, the service works now [17:53:18] ok, doing lvs1002 [17:55:51] andrewbogott: try connecting to the new service IP? [17:56:06] (with an ldap client I guess) [17:56:31] ldapvi (my ldap-editing tool of choice) seems happy with both ldap: and ldaps: [17:56:41] I'm going to hack a VM to use that endpoint and see if I can still log in :) [17:57:11] ok, let me know if something goes wrong, otherwise I assume we're done with this LVS part [17:57:17] seems like! [17:57:37] thanks! We may wind up adding a third ldap backend but I'm hoping two is enough. [17:58:08] ok [18:02:09] pointing a client VM to the new endpoint seems to work just fine [20:33:57] this equinix/telia circuit saga is driving me crazy [20:33:58] :) [20:34:41] Definitely a downside of paying for the X-connect ourself [20:53:02] 10netops, 10Cloud-Services, 10Operations, 10cloud-services-team (Kanban): Allocate public v4 IPs for Neutron setup in eqiad - https://phabricator.wikimedia.org/T193496 (10GTirloni) [22:10:48] FYI, ping offload is 100% up and running in codfw: https://grafana.wikimedia.org/d/000000513/ping-offload [23:07:42] nice :) [23:15:47] * jynus is disappointed the new service is not call pingoid [23:15:53] *called [23:16:20] Is it written in node? [23:24:25] any idea what's up with jenkins on https://gerrit.wikimedia.org/r/c/operations/puppet/+/498264 ? [23:40:13] XioNoX: [23:40:14] 23:21:41 Class 'icinga::monitor::traffic' is already defined at /srv/workspace/puppet/modules/icinga/manifests/monitor/traffic.pp:4; cannot redefine at /srv/workspace/puppet/modules/icinga/manifests/monitor/traffic.pp:10 [23:40:31] it took me several minutes of staring to find that in the middle of the log, but it's there heh [23:40:41] bblack: thx, I also asked on another channel [23:40:51] I missed that line and saw the TOX errors lower down [23:43:09] all green now