[03:50:26] 10netops, 10Operations: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10CDanis) [04:02:52] 10netops, 10Operations: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10CDanis) @wiki_willy Any chance this is somehow related to {T229243}? [04:09:02] 10netops, 10Operations: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10wiki_willy) @CDanis - I just checked with our 3rd party contractor and he says it shouldn't have been affected from the work he was doing. Although, he was working in the racks from 1:45-4:00 UTC, and If it... [04:14:25] 10netops, 10Operations: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10CDanis) Still alerting, unfortunately. [04:22:36] 10netops, 10Operations: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10wiki_willy) Alright, I'm asking him to go back to the datacenter to check all the connections on mr1-eqsin. [05:34:07] 10netops, 10Operations: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10wiki_willy) a:03wiki_willy [05:37:28] 10netops, 10Operations, 10ops-eqsin: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10wiki_willy) 05Open→03Resolved Cable between mr1-eqsin p4 <---> asw-0603-eqsin p23 looks like it accidentally got bumped by the contractor during the server install. Called him back and he... [05:37:47] 10netops, 10Operations, 10ops-eqsin: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10Marostegui) We just got all the recoveries: ` [07:23:15] <+icinga-wm> RECOVERY - Host cp5005.mgmt is UP: PING OK - Packet loss = 16%, RTA = 231.87 ms [07:23:17] <+icinga-wm> RECOVERY - Host c... [05:38:29] 10netops, 10Operations, 10ops-eqsin: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10Marostegui) Ha! @wiki_willy was faster! [05:40:13] 10netops, 10Operations, 10ops-eqsin: mr1-eqsin down since ~01:50 UTC - https://phabricator.wikimedia.org/T229778 (10wiki_willy) @Marostegui - Ha, we tied. =) [08:36:28] 10Acme-chief, 10Horizon: Unable to create DNS zone traffic.wmflabs.org. in Horizon - https://phabricator.wikimedia.org/T229783 (10Vgutierrez) [08:36:56] 10Acme-chief, 10Horizon, 10cloud-services-team (Kanban): Unable to create DNS zone traffic.wmflabs.org. in Horizon - https://phabricator.wikimedia.org/T229783 (10aborrero) p:05Triage→03Normal a:03aborrero [08:38:23] 10Acme-chief, 10Horizon, 10cloud-services-team (Kanban): Unable to create DNS zone traffic.wmflabs.org. in Horizon - https://phabricator.wikimedia.org/T229783 (10Krenair) It's expected that zones can't be created in this manner (there's a script that ops can run to do it properly), but there shouldn't be an... [08:39:35] 10Acme-chief, 10Horizon, 10cloud-services-team (Kanban): Unable to create DNS zone traffic.wmflabs.org. in Horizon - https://phabricator.wikimedia.org/T229783 (10Krenair) One thing to check is whether there is a proxy by that name. If there is you should be careful trying to make a zone in its place [08:40:46] 10Acme-chief, 10Horizon, 10cloud-services-team (Kanban): Unable to create DNS zone traffic.wmflabs.org. in Horizon - https://phabricator.wikimedia.org/T229783 (10aborrero) For the record: https://wikitech.wikimedia.org/wiki/Help:Horizon_FAQ#Can_I_create_a_new_DNS_domain/zone_for_my_project,_or_records_under_... [08:40:59] 10Acme-chief, 10Horizon, 10cloud-services-team (Kanban): Unable to create DNS zone traffic.wmflabs.org. in Horizon - https://phabricator.wikimedia.org/T229783 (10Vgutierrez) before trying to do anything I've checked with a simple DNS query that traffic.wmflabs.org. is available right now: ` vgutierrez$ host... [08:45:28] 10Acme-chief, 10Horizon, 10cloud-services-team (Kanban): Unable to create DNS zone traffic.wmflabs.org. in Horizon - https://phabricator.wikimedia.org/T229783 (10aborrero) 05Open→03Resolved This should be done now: ` root@cloudcontrol1004:~# wmcs-makedomain --project traffic --domain traffic.wmflabs.org... [08:56:30] 10Acme-chief, 10Horizon, 10cloud-services-team (Kanban): Create a service account to manage traffic.wmflabs.org. from acme-chief - https://phabricator.wikimedia.org/T229786 (10Vgutierrez) [09:30:13] 10Traffic, 10Discovery-Search, 10Elasticsearch, 10Operations: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 (10Vgutierrez) I've seen the same behaviour configuring the ncredir LVS service as it's using two ports (80/443). Same happens wi... [10:03:54] 10Acme-chief, 10Horizon, 10cloud-services-team (Kanban): Create a service account to manage traffic.wmflabs.org. from acme-chief - https://phabricator.wikimedia.org/T229786 (10aborrero) p:05Triage→03Normal For the WMCS team meeting, needs discussion: how to better handle this. I'm not aware of the curren... [10:33:51] ema: if you have a minute I've a couple of CR for you ;) [10:33:55] https://gerrit.wikimedia.org/r/c/operations/puppet/+/527170 and https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/527173 [10:53:56] volans: do we really have replacements for all those scripts? I didn't know :) [10:54:44] ema: AFAICT yes there is a cookbook for each of them. Just the reimage ones are still pending on me [10:55:52] volans: happy times, +1 [10:56:52] thanks! [10:58:48] 10Traffic, 10netops, 10Operations, 10IPv6: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10jbond) Thanks brandon, Ill take a look at removing the ones SLAAC addresses from puppet this week. One of them, at least, was added by me and was what led... [12:35:16] 10Traffic, 10Discovery-Search, 10Elasticsearch, 10Operations: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 (10Mathew.onipe) About cloudelastic resolving to icinga1001, I had jbond help me do see where it cloudelastic.wikimedia.org resol... [12:53:22] 10Traffic, 10Discovery-Search, 10Elasticsearch, 10Operations: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 (10BBlack) So, yes, cloudelastic is correct in DNS for normal lookups. The issue is that the icinga check defines the virtual ho... [12:56:33] 10Traffic, 10Discovery-Search, 10Elasticsearch, 10Operations: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 (10Mathew.onipe) @BBlack yea yea.. I've missed your musing on complex system. Thanks. I will make a patch [12:56:53] 10Traffic, 10Discovery-Search, 10Elasticsearch, 10Operations: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 (10Mathew.onipe) p:05Triage→03Normal [13:26:08] 10Traffic, 10Discovery-Search, 10Elasticsearch, 10Operations: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 (10Mathew.onipe) Sadly, I don't think this will work as the host param will not be unique and icinga does not seem to handle that... [13:48:41] vgutierrez hi, around? [13:48:48] hi paladox [13:48:53] what do you need? [13:49:20] vgutierrez i'm wondering if you would be able to help add gerrit-replica to the dns/acme please? (we are renaming gerrit-slave to gerrit-replica) [13:49:47] sure [13:49:53] thanks! [13:50:17] do you have a phab task in place for that change? [13:50:20] (just to link the commits) [13:50:36] vgutierrez oh nope, though i can create one if you want? [13:50:59] yup, go ahead please [13:52:18] 10Traffic, 10Gerrit, 10Operations, 10Release-Engineering-Team: Rename gerrit-slave to gerrit-replica - https://phabricator.wikimedia.org/T229822 (10Paladox) [13:52:19] vgutierrez https://phabricator.wikimedia.org/T229822 [13:53:24] thx [13:56:57] paladox: https://gerrit.wikimedia.org/r/#/c/528138/ --> that will get a certificate valid for gerrit/gerrit-slave/gerrit-replica, and after everything is moved to gerrit-replica, we can get rid of gerrit-slave [13:57:20] thanks! (i guess the dns change has to go first for that acme change to work?) [13:57:48] nope [13:58:02] acme-chief uses dns-01 challenges to validate the hostnames [13:58:06] oh [13:58:12] great! [13:58:21] so as long as acme-chief is able to create the proper TXT record to make LE happy, it's enough [13:58:45] :) [13:59:54] I've got this dns change vgutierrez https://gerrit.wikimedia.org/r/#/c/operations/dns/+/527657/ [14:00:18] i'm just not sure about https://gerrit.wikimedia.org/r/#/c/operations/dns/+/527657/3/templates/0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa [14:00:57] hmm maybe I'd go for the two steps approach [14:01:08] 1. add gerrit-replica, 2. Remove gerrit-slave [14:01:51] yeh [14:02:11] vgutierrez fixed the change to do the adding of gerrit-replica :) [14:02:19] cool [14:02:21] i'm just not sure about https://gerrit.wikimedia.org/r/#/c/operations/dns/+/527657/4/templates/0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa [14:02:26] why not? [14:02:44] wether i can use 7.0.1.0.3.5.1.0.0.8.0.0.8.0.2.0 for both gerrit-slave and gerrit-replica [14:03:57] AFAIK it shouldn't be a problem [14:04:06] ah, ok [14:04:07] thanks! [14:04:23] but it looks like our dns checks aren't happy [14:04:47] yeh, fixed [14:05:36] cool :) [14:06:30] let me get merged the acme-chief CR and check that everything goes as expected and then I'll merge the DNS one [14:06:59] yay, thanks! :) [14:07:45] oh, don't forget to add Bug: T229822 in the dns commit message [14:07:46] T229822: Rename gerrit-slave to gerrit-replica - https://phabricator.wikimedia.org/T229822 [14:07:53] ah [14:07:54] * paladox does [14:08:19] done [14:11:05] X509v3 Subject Alternative Name: [14:11:05] DNS:gerrit-replica.wikimedia.org, DNS:gerrit-slave.wikimedia.org, DNS:gerrit.wikimedia.org [14:11:18] the new cert will be received in cobalt/gerrit2001 on the next puppet run [14:11:21] :) [14:11:26] yay, thank you! [14:16:02] $ host gerrit-replica.wikimedia.org [14:16:02] gerrit-replica.wikimedia.org has address 208.80.153.107 [14:16:02] gerrit-replica.wikimedia.org has IPv6 address 2620::860:4:208:80:153:107 [14:16:40] paladox: all good, ping me when you are ready to clean up the old DNS records / SNI [14:16:48] ok, thanks!! [15:58:39] hello sukhe! [15:59:03] hi ema! [16:04:34] \o/ welcome sukhe :D [16:18:00] vgutierrez: thanks :D [17:57:13] hello traffic folks! [17:58:02] I need some help with https://phabricator.wikimedia.org/T229861 [18:01:13] curl -6 https://cloudelastic.wikimedia.org:8643 returns a 'no route to host error' [18:01:30] this is done from deploy1001 [18:03:43] I suspect https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/hieradata/common/lvs/configuration.yaml#179-180 should be named `cloudelastic4` and `cloudelastic6` resp. [18:07:09] i added a comment there with my suspision, basically othres are defined with something not-ipv6, such as textlb6 `2620:0:860::ed1a::1`. That's not actually an ipv6 address, i'm not sure what it is [18:07:21] so i suspect clouelastic also needs that not-ipv6 thing [18:08:19] 2620:0:861:ed1a::1 is an ip address its routed to a loopback on [at least some of] the lvs servers [18:10:18] however the other addresses in the 2620:0:861:1:: prefix dont appear to be lvs addresses so that is the thing that looks odd to me as well [18:19:10] onimisionipe: i can't confirm it's "no route to host" [18:19:19] what i actually get (now) is "connection refused" [18:19:26] cloudelastic.wikimedia.org has IPv6 address 2620:0:861:1:208:80:154:241 [18:19:49] mutante: hmm, i still get no route to host from mwmaint1002 [18:19:49] so that looks to me like the service is not listening on that address [18:20:14] mutante: also logging into the individual cloudelastic* machines, we can see they don't have an lo:LVS configured for the ipv6 address, only their own ipv6 address [18:21:06] i wonder if it matters where it's requested from then... [18:23:01] mutante: where did you see connection refused as om getting icmp destination unrachable from the router [18:23:29] jbond42: from my laptop directly [18:25:01] strange i get !H from 2001:470:0:1c0::2 if i trace it from mine [18:26:49] if it's too much trouble we don't actually need ipv6. It just seemed like ipv6 has been around a few decades now and should generally be supported [18:28:07] no no we sould have ipv6, im just not familure with the setup i sure its likley a simple fix to someone who knows ot well [18:30:47] yeah [18:30:57] the DNS there is wrong, I think, for the public cloudelastic IPv6 [18:31:18] (maybe the host side too, didn't look) [18:31:30] note that the other load-balanced services use names like: [18:31:36] ncredir-lb.codfw.wikimedia.org. [18:31:46] always including the -lb and a data center [18:31:47] they need "normal" Ipv6 as well, for their e.g. cloudelastic1001.wikimedia.org hostnames, using add_ip6_mapped and in DNS for those names, which I think they have IIRC [18:31:56] but this one is just called "cloudelastic.wikimedia.org" [18:32:08] but separately, cloudelastic.wikimedia.org's public IPv6 should be in an LVS range, yes [18:32:15] mutante: that part's fine [18:33:27] bblack: yes it looks like cloudelastic1001.wikimedia.org has the correct ipv6 addr (2620:0:861:1:208:80:154:8) and correct dns [18:33:46] both in the DNS repo and the puppet repo ( hieradata/common/lvs/configuration.yaml ), what's been configured as 2620:0:861:1:208:80:154:241 should instead be [18:34:33] im gussing the lvs range (at least for eqiad) is 2620:0:861:ed1a::? [18:34:42] I can do some patches quick, since I'm familiar with the layout [18:34:42] yes [18:35:07] bblack: please cc me on the changes as im curious [18:35:09] ok [18:35:18] thanks [18:42:08] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/528215/ + https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/528216/ [18:42:36] I'm sure there will be a few minor disturbances in the force while rolling those out, but I can try to quickly run the agent after on all the affected things/hosts. [18:43:39] historically the LVS "subnet" for each datacenter was split into 4 chunks, at this point it's really more like 2 in functional practice [18:44:07] and we've never really mapped 1:1 between the octets of the IPv4 and IPv6 for whatever reasons, but I did for the last octet (visually, not numerically) in this case [18:44:44] but the important thing is it has to be in the LVS subnet for the site, and in the correct half of it (the high-traffic1/text half or the high-traffic2/upload half) for all the router<->LVS stuff to work out right. [19:42:08] ebernhar|lunch: IPv6 should be working now, does in my basic testing! [20:26:09] bblack: indeed, thanks!