[05:13:28] 10Traffic, 10Operations, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [05:35:06] 10Traffic, 10Operations, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [05:50:39] 10Traffic, 10Operations, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [06:07:40] 10Traffic, 10Operations, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [06:26:35] 10Traffic, 10Operations, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [06:33:18] 10Acme-chief, 10Traffic, 10Operations: Memory leak on acme-chief 0.21 - https://phabricator.wikimedia.org/T234131 (10Vgutierrez) >>! In T234131#5587004, @MoritzMuehlenhoff wrote: > @Vgutierrez I created a 2.6.1-3+deb10u2, it's in my home on acmechief1001. Let's deploy this on acmechief* hosts on Monday befor... [06:56:49] 10Traffic, 10Operations, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [07:03:25] <_joe_> hi, what is the status of substituting varnish backend with ATS on the text cluster? [07:04:04] <_joe_> rationale being I want to add TLS termination to all services on k8s [07:04:49] you need ema for that [07:05:11] but IIRC he replaced varnish-be with ats-be on cp4027 and cp4028 last week [07:05:32] so right now we have 3 cache nodes with ats-be on text (1077, 4027 and 4028) [07:05:50] <_joe_> ok so at least a couple months out [07:05:59] hey! [07:06:12] yeah so I'm currently working on text@ulsfo [07:06:25] morning ema :) [07:06:26] <_joe_> ema: my main interest is text@eqiad/codfw [07:06:44] _joe_: those are the last clusters in the list, see https://phabricator.wikimedia.org/T227432 [07:06:55] <_joe_> to give you context - my goal is to add TLS termiation this quarter to all services on k8s [07:07:15] <_joe_> but I won't also add LVS endpoints for that until all their consumers can switch [07:08:25] so, all nodes use directly the origins in eqiad/codfw, including the already converted cache nodes in ulsfo [07:08:54] so we really can't remove TLS support from the origins [07:09:06] <_joe_> I wasn't proposing this [07:09:13] <_joe_> but take for instance blubberoid [07:09:25] <_joe_> I'm not sure if it supports TLS at the moment [07:09:36] <_joe_> but it should, as it's (AIUI) exposed publicly [07:09:50] it should indeed [07:10:32] <_joe_> ema: you had a task with all backend services listed, right? [07:10:42] 10Traffic, 10Operations: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [07:10:55] _joe_: yes, https://phabricator.wikimedia.org/T210411 [07:11:10] (looks like blubberoid is missing on that task) [07:11:22] <_joe_> ema: I'm not sure it's publicly exposed [07:11:25] <_joe_> but I think so [07:12:14] right, so we need to add TLS support to it, and it's great to hear you're planning on doing so :) [07:12:14] <_joe_> it is indeed [07:12:46] we'll need the LVS endpoints as well for ATS to use it though [07:12:51] <_joe_> ema: yeah but in general I'd like to do a switchover from http to https, while until we have varnish in place, I'll need to /add/ it [07:13:12] <_joe_> ema: exactly why I wanted to avoid duplicating the effort, but oh well [07:14:37] _joe_: yeah I'm afraid we're gonna need both HTTP and HTTPS services in LVS during the transition [07:15:01] <_joe_> yes, at least for things that pertain to varnish/ats [07:15:12] <_joe_> for services not exposed publicly, we can go the easier route [07:15:23] _joe_: so if I understand this right, all k8s services are gonna get TLS support in one shot? [07:16:39] is there a way to get the list of services currently on k8s? [07:17:57] vgutierrez: hi! How's ats-tls going? :) [07:18:21] so.. upload's pretty awesome.. eqsin and ulsfo are 100% migrated to it [07:18:34] right now I'm moving one node per DC on text [07:18:47] <_joe_> ema: sure, but no, they will be added one by one [07:18:59] <_joe_> that's this quarter's goal of mine [07:19:05] <_joe_> at least part of it [07:19:23] next step would be tuning the timeouts across layers to handle the lack of buffering requests in ats-tls VS nginx [07:19:41] <_joe_> vgutierrez: patches welcome [07:19:44] <_joe_> :D [07:20:13] _joe_: nice :) [07:22:24] _joe_: how do I find out which services are on kubernetes? [07:22:45] <_joe_> ema: one way is to look at operations/deployment-charts [07:23:07] <_joe_> the helmfile.d directory has the config for all services deployed in staging and prod [07:23:21] <_joe_> another way is to list services from the kube clusters ofc [07:24:25] <_joe_> kubectl get services --all-namespaces [07:24:28] are there any services defined in (eg) eqiad but not in codfw? [07:24:33] <_joe_> (with the right credential files) [07:24:40] <_joe_> not that I know of [07:25:15] ok, I was wondering why there's a distinction between eqiad and codfw in https://github.com/wikimedia/operations-deployment-charts/tree/master/helmfile.d/services [07:26:32] <_joe_> those are the two clusters [07:26:37] <_joe_> they are managed separately [07:26:54] <_joe_> but yes the fs hierarchy could be better [07:28:42] <_joe_> the reason why it's so duplicated is one of the helmfile plugins we use has a bug with symlinks that is still unresolved [07:29:36] from a quick check on that list, it seems that only blubberoid and cxserver are behind varnish/ats (and both are currently accessed via plain HTTP) [07:30:41] 10Traffic, 10Operations, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [07:37:16] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp4029.ulsfo.wmnet'] ` The log can be found in `/var/log/wm... [07:39:52] <_joe_> ema: yes, and they're also called internally from other subsystems [07:53:34] 10Traffic, 10Operations, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [08:06:48] 10Traffic, 10Operations, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [08:13:20] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4029.ulsfo.wmnet'] ` and were **ALL** successful. [08:17:31] 10Traffic, 10Operations, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [08:28:21] https://blog.cloudflare.com/experiment-with-http-3-using-nginx-and-quiche/ [08:28:56] enabling QUIC is kinda tricky... [08:29:16] legit UDP traffic on port 443 is scary [08:31:52] 10Traffic, 10Operations, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [08:33:52] on the software side.. ATS already ships QUIC support on their master branch :) [08:34:25] nice [08:43:42] 10Traffic, 10Operations, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [12:28:01] anyone up for reviewing an exciting LVS change? [12:28:21] akosiaris: perhaps you? :) [12:28:23] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/544856/ [12:31:22] oh, I forgot to define the service /o\ [12:32:41] or maybe that's gone now? we used to have service definitions under conftool-data/service/, but now I can't find those [12:33:19] ah yes, see d9f8348 [12:34:06] how do we specify default values now? For example, setting the default weight [12:36:09] this would perhaps explain why ats-be hosts now all get weight=0 instead of weight=100, which used to be the default [12:36:30] _joe_: your presence is requested ^ [12:36:38] not sure what happens now with the default weight [12:38:14] <_joe_> the service definitions are gone, yes [12:39:14] <_joe_> and you will have to define the weight manually for now, the nodes get added in status=inactive so they have no effect until you change the weight manually [12:40:11] ok, good to know! [12:40:59] <_joe_> heh sorry Chris and I were supposed to send an email to ops@ when we did the release but we forgot (or maybe I said I'd do it) [12:43:38] not a problem at all, I thought we (I) somehow screwed something up during the various refactorings :) [13:23:21] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [13:25:42] <_joe_> ema: we should also remove the non-ssl services soon-ish wherever possible [13:26:05] _joe_: that's true [13:26:20] let me create a task so we don't forget [13:26:25] <_joe_> but I guess the ones you're migrating all interact with varnish [13:26:28] <_joe_> great [13:27:47] they do, yes, but after the ATS migration is done we should really get rid of the plain-http services (at least those that don't have naughty internal clients using plain http) [13:34:51] 10Traffic, 10Operations: Remove unused plain HTTP services from LVS - https://phabricator.wikimedia.org/T236065 (10ema) [13:34:59] 10Traffic, 10Operations: Remove unused plain HTTP services from LVS - https://phabricator.wikimedia.org/T236065 (10ema) p:05Triage→03Normal [13:35:47] so I guess we could start by removing swift :) [13:36:17] what can go wrong [13:37:36] that'll teach it to swift! *g* [13:51:55] _joe_: except for icinga and registry[12]00[12], I don't see any clients connecting to swift via plain http [13:52:29] <_joe_> ema: not sure if the registries can use TLS, I'll check [13:52:42] _joe_: thanks! [13:54:36] <_joe_> ema: are you using envoy for all the new services you add TLS termination to, correct? [13:54:57] _joe_: correct [13:55:21] <_joe_> perfect thanks :) [14:49:35] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 7 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10fgiunchedi) [15:58:35] 10Traffic, 10Operations, 10ops-ulsfo: ps1-22-ulsfo & ps1-23-ulsfo - https://phabricator.wikimedia.org/T235911 (10RobH) 1) reseat the hot swap nic, should reset 2) unplug the ps1, leaving ps2 powered, to reset the nic 3) reset with the reset button, will have to reconfigure the entire pdu (non-ideal, these ha... [15:59:27] 10Traffic, 10Operations, 10ops-ulsfo: ps1-22-ulsfo & ps1-23-ulsfo - https://phabricator.wikimedia.org/T235911 (10RobH) Mainly I'd like @bblack buy in on a date/time for me to do this work, since option 2 requires #traffic approval imo. (It would cause them work if any of the systems fail.) [16:34:08] 10Traffic, 10Operations, 10Pybal: Migrate pybal-test2001 away from jessie - https://phabricator.wikimedia.org/T224570 (10ema) >>! In T224570#5471336, @MoritzMuehlenhoff wrote: > More generally speaking: Are the pybal-test* servers still used for testing/developing? Is there a specific reason they are in prod... [16:54:03] 10Traffic, 10DNS, 10Operations, 10Toolforge, 10cloud-services-team (Kanban): Update authoratiative nameservers for the toolforge.org domain to point to Designate - https://phabricator.wikimedia.org/T235303 (10RobH) >>! In T235303#5575619, @RobH wrote: >>>! In T235303#5575089, @Andrew wrote: >> @robh shou... [16:57:16] 10Traffic, 10DNS, 10Operations, 10Toolforge, 10cloud-services-team (Kanban): Update authoratiative nameservers for the toolforge.org domain to point to Designate - https://phabricator.wikimedia.org/T235303 (10RobH) Andrew CC'd. > Hanna, > Normally we email Doneva for this, but her auto-reply advises she... [17:00:55] 10Traffic, 10DNS, 10Operations, 10cloud-services-team (Kanban): Update authoratiative nameservers for the wmcloud.org domain to point to Designate - https://phabricator.wikimedia.org/T235630 (10Andrew) ` Thank you Rob and Hanna! While we're at it, we'd also like the 'wmcloud.org' domain changed in the sam... [17:33:47] 10Traffic, 10DNS, 10Operations, 10cloud-services-team (Kanban): Update authoratiative nameservers for the wmcloud.org domain to point to Designate - https://phabricator.wikimedia.org/T235630 (10Andrew) ` Hi Rob and Andrew, The DNS has been updated for the two domains to the two nameservers listed below... [17:34:12] 10Traffic, 10DNS, 10Operations, 10Toolforge, 10cloud-services-team (Kanban): Update authoratiative nameservers for the toolforge.org domain to point to Designate - https://phabricator.wikimedia.org/T235303 (10Andrew) ` Hi Rob and Andrew, The DNS has been updated for the two domains to the two nameser... [17:42:31] 10Traffic, 10DNS, 10Operations, 10cloud-services-team (Kanban): Update authoratiative nameservers for the wmcloud.org domain to point to Designate - https://phabricator.wikimedia.org/T235630 (10Andrew) 05Open→03Resolved [17:42:41] 10Traffic, 10DNS, 10Operations, 10Toolforge, 10cloud-services-team (Kanban): Update authoratiative nameservers for the toolforge.org domain to point to Designate - https://phabricator.wikimedia.org/T235303 (10Andrew) 05Open→03Resolved [18:23:35] 10Traffic, 10Operations, 10ops-ulsfo: ps1-22-ulsfo & ps1-23-ulsfo - https://phabricator.wikimedia.org/T235911 (10RobH) Ok, I'm onsite and going to attempt the following on ps1-22-ulsfo: 1) unplug all the data/serial/network connections (leave all power in place) 2) unseat and re-seat the NIC which may power... [18:34:52] 10Traffic, 10Operations, 10ops-ulsfo: ps1-22-ulsfo & ps1-23-ulsfo - https://phabricator.wikimedia.org/T235911 (10RobH) Summary of work: * confirmed in docs that the pro2 will indeed allow hot swap of its network card (the older pro1 will not) * scheduled work with @bblack for #traffic cooperation (no impact... [18:35:01] 10Traffic, 10Operations, 10ops-ulsfo: ps1-22-ulsfo & ps1-23-ulsfo - https://phabricator.wikimedia.org/T235911 (10RobH) [18:36:01] 10Traffic, 10Operations, 10ops-ulsfo: ps1-22-ulsfo & ps1-23-ulsfo - https://phabricator.wikimedia.org/T235911 (10RobH) 05Open→03Resolved [21:02:27] 10Traffic, 10Operations, 10Performance-Team: Can't load flame or coal graphs on performance.wikimedia.org (HTTP 502) - https://phabricator.wikimedia.org/T236102 (10Krinkle) [21:13:54] 10Traffic, 10Operations, 10Performance-Team: Can't load flame or coal graphs on performance.wikimedia.org (HTTP 502) - https://phabricator.wikimedia.org/T236102 (10ori) Still doesn't work for me. [23:09:37] 10Traffic, 10Operations, 10Performance-Team: Can't load flame or coal graphs on performance.wikimedia.org (HTTP 502) - https://phabricator.wikimedia.org/T236102 (10colewhite) p:05Triage→03Normal