[06:56:18] volans: let me know if you want to do es1012 decommission with me or you have enough data to troubleshoot yesterday's errors [07:45:46] marostegui: I think there are still errors, I tried yesterday evening and it failed for an analytics node :( [07:47:49] :-( [08:34:59] elukey: the analytics node probably deserved it [08:35:26] kormat: for sure [08:35:58] it ran for a long time java and kerberos, not sure if there is a worst destiny for bare metal hw [08:39:17] marostegui: yeah it's a bit weird if you could wait a bit we'll try to fix it [08:39:26] volans: sure [08:39:39] it manifests itself as the same error we fixed on Tue. but must be different :) [09:31:36] I am a little confused, why we have two global ipv4/ipv6 addresses on gerrit1001/gerrit2001 ? [09:32:12] ahhh "scope global deprecated" [09:32:38] TIL 'deprecated' [09:35:02] elukey: ? [09:36:24] XioNoX: on gerrit1001, if I run `ip addr`, I see [09:36:26] inet 208.80.154.136/26 brd 208.80.154.191 scope global eno1 [09:36:26] valid_lft forever preferred_lft forever [09:36:26] inet 208.80.154.137/32 scope global eno1 [09:36:26] valid_lft forever preferred_lft forever [09:36:26] inet6 2620:0:861:2:208:80:154:137/128 scope global deprecated [09:36:29] valid_lft forever preferred_lft 0sec [09:36:31] inet6 2620:0:861:2:208:80:154:136/64 scope global [09:36:34] valid_lft 2591997sec preferred_lft 604797sec [09:36:55] elukey@gerrit1001:~$ dig gerrit.wikimedia.org +short [09:36:55] 208.80.154.137 [09:37:31] elukey: it is set on alias adresses so that packets arn't mistakenly sourced from that interface [09:37:33] https://github.com/wikimedia/puppet/blob/production/modules/interface/manifests/alias.pp#L21-L22 [09:37:39] https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/gerrit.pp#L25-L28 [09:38:10] jbond42: ah, cute [09:38:28] there is an explanation under https://tools.ietf.org/html/rfc4862#section-2 [09:39:32] jbond42: the confusing part for me was related to the fact that DNS records for gerrit1001/gerrit2001 point to .136 for example, but gerrit.w.o returns .137 [09:39:59] elukey: that makes sense [09:40:01] I prev added firewall rules for the analytics filters and didn't add .137, so people are still not able to checkout from say stat1008 [09:40:05] I have heard about Gerrit? [09:40:23] nono me asking questions about ips hashar, nothing broken :) [09:40:42] I am not even sure why we have services IP for the Gerrit service to be honest [09:40:44] kormat: it doesn't to me yet :) [09:41:01] elukey: i think that is correct most traffic you want to come from the host address however the gerrit process is likley bound directly to the alias so gerrit packtes will be sourced from there (im making some assumptions now) [09:41:05] elukey: clients only ever access gerrit as gerrit.wm.o. so they only care/know about the virtual IP [09:41:08] or why the hosts have a public IP given the publicly exposed services are on the public IP service IP [09:41:19] but I guess that is a side track / slipping path :] [09:42:59] jbond42: yep you are right! tcp6 0 0 208.80.154.137:29418 :::* LISTEN 4333/java [09:44:15] kormat: yep makes more sense now, but it is still a little weird for me [09:44:20] can we have gerrit behind a LVS IP? [09:44:36] (I know the answer) [09:47:10] kormat: just to understand more the use case (I am slow I know), what is the benefit in this case, as opposed to say gerrit binding on the preferred address .136 (the one related to gerrit1001.w.o) ? [09:47:41] elukey: in that case you'd have to update the DNS entry for gerrit.wm.o when you change which gerrit node is active [09:47:54] which leads to a period of instability as the DNS change propagates [09:48:57] kormat: mmm ok but then I'd expect the same addr listed on gerrit2001, and I don't find it [09:49:06] XioNoX: yeah we should be able to move it behind LVS IP, I don't see why it would not be possible? There might even be a task about it [09:49:17] elukey: i imagine that the puppet role adds/removes the service IP based on which node is master [09:49:26] there is one also to drop the 29418 port for Gerrit ssh and change to port 22 (the default for ssh) [09:49:51] kormat: all right that was the missing part, thanks, now the dots are joined, thanks :) [09:49:54] hashar: the reason is that we can't do LVS cross-DC [09:50:04] elukey: np :) [09:50:12] so the LVS would not help here, unless we have multiple hosts within the same DC to failover to [09:50:20] kormat: elukey: the vip/deprecated address on gerrit is in a different subnet so it would need a DNS update [09:50:59] i suspect the reason is more likley along the lines of what hashar said i.e. so that gerrti can listen on vip:ssh and we can asscess the server via the host:ssh [09:51:08] volans: but even with a single host, we could have the service IP carried on our LVS frontend rather than on the gerritXXXX hosts? Maybe that will align it to the standard of how things are usually done for service ip [09:51:24] even if there is a single host [09:51:33] that depends if gerrit should depend on LVS or not, the usual circular dependency stuff [09:52:02] jbond42: in the above paste from elukey it looks to be the same subnet [09:52:04] one sure thing is that if the current setup is adhoc / an exception, I am more than happy to have it aligned with whatever is the standard [09:52:08] or at least make it simpler [09:52:19] volans: it would help to get rid of a VIP on a host subnet ;) [09:52:35] totally and manually managed dns records that are exception now [09:52:42] and marked specifically in netbox to achieve that [09:52:54] so I'm in for whatever solution help us get rid of this :) [09:53:02] kormat: gerrit 2001 2620:0:860:4: gerrit1001: 2620:0:861:2:, if they where the same address we would need to update routeing and BGP announcment sto failover the prefix [09:54:09] https://phabricator.wikimedia.org/T165631 is the relevant task [09:54:18] ahh. you meant the VIP on gerrit1001 isn't in the same subnet as the main IP on gerrit2001 [09:54:37] yes sorryjust noticed that confusion [09:55:52] paravoid: thx, was looking for one [09:56:07] it's not impossible to use the same VIP in both DCs, but it does require a bunch of careful routing [09:56:37] no, we don't do that [09:56:45] paravoid: sure, and i get why :) [09:57:40] kormat: it would require tunneling [09:58:24] XioNoX: really? can't it be done by adding a static host route ~everywhere? (note: i'm not saying this is a _good_ idea ;) [10:00:24] kormat: if there is only 1 backend, sure :) [10:00:59] XioNoX: you simply write a bash script to update the static route everywhere when you change backend. easy peasy! [10:02:04] kormat: I'm glad you're in the DB team [10:02:07] ;) [10:02:08] :D [10:03:02] s/bash script/cookbook/ [10:05:18] anyway I am in favor of cleaning up the current gerrit IP config via T165631 (thx Faidon to have found that task) [10:05:19] T165631: move gerrit.wm.org SSH service to private/behind LVS like phab-vcs - https://phabricator.wikimedia.org/T165631 [10:06:05] although I have no idea what amount of work is required on the infra or what the exact solution will be, if there is an interest to move forward you can count on me to help with the Gerrit context [10:07:52] https://netbox.wikimedia.org/dcim/devices/2138/ "Purchase date April 8, 2019" so maybe in 5 years? [10:08:11] what does that have to do with this :) [10:08:39] paravoid: if we get rid of one of the gerrit nodes, the whole issue goes away! ;) [10:09:00] * kormat should consider a career in management [10:09:04] clearly! [10:09:18] I guess we could do codfw first [10:09:22] then fail over? [10:09:39] XioNoX: decom gerrit in codfw, failover, then decom it in eqiad? [10:09:48] haha [10:09:52] i.. should stop "contributing" [10:10:07] or keep the public IPs to start with and put a LVS in front of the existing public host? [10:10:20] add a new LVS IP, add the realserver IP to the gerrit node, have gerrit listen on it, change DNS, wait for $TTL, deprecate the in-subnet IP [10:10:28] see if any issues, then convert one to be a private backend [10:10:39] yeah, that's a good idea too [10:10:55] depends how far we want to go yeah [10:11:03] I guess the 80/20 rule applies here :) [10:18:40] XioNoX: would the final setup be then LVS + private subnet IP in eqiad for gerrit1001, and same set up for 2001 in codfw? Then use DNS to point gerrit.w.o to eqiad or codfw [10:19:14] elukey: yeah, to point to the public VIP [10:19:22] LVS VIP [10:19:40] yep yep got it thanks :) [10:20:25] go for it [10:21:04] XioNoX: kormat made me feel bad about routing so I had to demostrate that even Analytics SREs are as good as Data Persistence ones at the end [10:21:14] haha :D [10:21:18] :D [10:25:05] and why isn't lists.wikimedia.org behind LVS? :) [10:25:20] paravoid: got a task for that one too? [10:27:20] that one is more special [10:27:28] it needs to originate outbound emails [10:30:31] can't it originate it from the LVS VIP as well? as it should be configured on the loopback? [10:31:55] it could originate packets, not connections, i.e. the reverse path of the connection wouldn't work [10:33:24] imagine a SYN, src::51234 -> dst gmail:25, sent out directly; the SYN/ACK would land on the lvs box with src gmail:25, dst :51234 [10:34:13] if there is only 1 backend, the LVS can forward it to the one, no? [10:34:14] 51234 being any random port, different for every flow [10:35:03] and does it need to originate packets from the VIP or its public host IP is fine too? [10:36:16] it has to originate them from /a/ public IP, that should be in various DNS records [10:37:30] inbound and outbound mail on the same IP is easier, but not a requirement [10:39:46] ok [10:40:27] we don't have *that* many odd records: https://netbox.wikimedia.org/search/?q=Keep+manual+&obj_type= [10:41:48] heh [10:41:54] the bulk should be moved to LVS really [10:42:45] let's do gerrit next week? should be simple I think? [12:00:33] wasn't there something special with gerrit? [12:00:42] i remember that being discussed at length years ago, but don't recall what the issue was [12:48:25] well there's special ports and protocols [12:48:44] also, it's not easy to push changes if we can't reach it [12:50:21] but I think that's true of more things than we like to admit, anyways [12:52:06] well a) indeed and b) push comes to shove, you can always ssh -L to the box [14:06:27] moritzm: thanks for that long-needed update to pwstore, sounds like it's going to remove a bunch of headaches. w00t! [14:06:41] +1 [14:06:43] 🎉 [14:08:50] I think many other people using it will share the keyserver pain, so will try to get this merged as well [14:56:34] moritzm: isn't that newish keyserver at keys.openpgp.org? [14:58:04] yes, that's the one running the new type of server [14:58:27] your email says keys.openpgp.*net* [14:59:01] and it's probably missing a link for more information at the end [14:59:22] oh, will send a followup correcting this [15:00:08] thanks [15:05:41] 5hi all over the past $SOME_MONTHS i have been working on a spec_helper to enable people to right simple spec tests without having to understand to much about puppet-rspec , fixtures, dependencies etc etc and thing i have finnaly got to that place and would welcome some input. the following is an exmaple of how to test the pki role. this spec test will compile the entire pki role taking care of [15:05:47] the ugly hacks and allow one to try and just ... [15:05:50] ... create a small subset of test i.e. dos puppet compile, is the pki service running (in the catalouge) [15:05:53] https://gerrit.wikimedia.org/r/c/operations/puppet/+/642423/2/modules/role/spec/classes/pki_spec.rb [15:06:57] As it compiles the entire catalouge it takes a bit longer to run and if we tested every role in this manner then CI coudl take a significant amount of time so thats something to keep in mind [15:07:15] puppet: even when you win you lose [15:07:21] :D lol [15:22:44] moritzm: yeah that's an elegant solution, much appreciated