[05:34:52] 10Traffic, 10Operations, 10Patch-For-Review: Investigate HTTP/2 limits on trafficserver - https://phabricator.wikimedia.org/T231287 (10Vgutierrez) After some discussion with upstream developers, https://github.com/apache/trafficserver/pull/5888 has been submitted and it's been included in https://gerrit.wiki... [05:35:46] 10Traffic, 10Operations, 10Patch-For-Review: Investigate HTTP/2 limits on trafficserver - https://phabricator.wikimedia.org/T231287 (10Vgutierrez) p:05Triage→03Normal [08:16:28] 10netops, 10Operations: Review switches ACL to connect from tools-bastion to dbproxy1018 - https://phabricator.wikimedia.org/T231418 (10Marostegui) [08:16:48] 10netops, 10Operations: Review switches ACL to connect from tools-bastion to dbproxy1018 - https://phabricator.wikimedia.org/T231418 (10Marostegui) [08:16:54] 10netops, 10Operations: Review switches ACL to connect from tools-bastion to dbproxy1018 - https://phabricator.wikimedia.org/T231418 (10Marostegui) p:05Triage→03Normal [08:22:38] _joe_: hi! Did you create the cert for docker-registry.discovery.wmnet with puppet-ecdsacert? [08:23:01] we ran into https://phabricator.wikimedia.org/T231388 yesterday due to the fact that docker-registry.wikimedia.org is missing from SAN in the certificates [08:23:31] I can re-create the certificates with cergen, or we can follow whatever procedure you find best, but SAN needs to be updated [08:23:46] maybe that was handled by Fabian [08:24:51] my bet is that it was created with the utils/create_ecdsa_cert script [08:29:46] <_joe_> yes we can blame fabian for that, yay! [08:30:08] <_joe_> ema: please recreate it, yes. I think it was created using the script directly, yes [08:30:35] <_joe_> (I'm mostly off today but should be responsive to further pings) [08:33:41] i see the news from Reddit is actually true. you added wikipedia.org to Brave rewards [08:33:56] pretty big for fundraising potentially [08:34:07] and lots of positive comments [08:34:23] _joe_: thank you, re-creating it with cergen then [08:34:36] just saw in git log of the DNS repo when looking for other stuff [08:35:21] and yea, zero.wikipedia.org is gone, so no need to redirect wikipediazero.org [08:35:31] re: https://gerrit.wikimedia.org/r/c/operations/dns/+/532879 [08:36:57] vgutierrez: ^ [08:37:08] hmmm [08:37:19] wikipediazero.org is an alias for wikimediafoundation.org though [08:37:41] and we still own the domain... I could park it instead of redirecting it [08:39:59] hmm, i think i would say park it and then we ping James [08:40:09] and then Chuck to drop it :p [08:40:33] but really the way that is easier should be fine [08:42:05] 10Wikimedia-Apache-configuration, 10Commons, 10SDC General, 10WikibaseMediaInfo, and 5 others: Make /entity/ alias work for Commons - https://phabricator.wikimedia.org/T222321 (10Lucas_Werkmeister_WMDE) > Also statement URIs - like https://www.wikidata.org/entity/statement/L40053-5aa77d7a-4c9e-ba1c-255b-3c... [08:43:14] sometimes I wish there was a cumin alias for a group of hosts, then I find that the alias is there, and I am happy [08:43:18] today is one of those days [08:43:31] lol [08:47:37] alias pink_unicorns ? [08:49:02] ema: instead of puppetizing ports.conf we could also use puppet to unload the httpd ssl/tls modules. after all ports.conf already has "IfModule" around Listen 443 [08:49:25] mutante: +1 [08:50:39] ack, using httpd::mod_conf is a bit cleaner [09:03:20] not even needed, just stopped loading it and then can manually remove/restart https://gerrit.wikimedia.org/r/c/operations/puppet/+/532948 [09:07:01] ema: miscweb1001 now has envoy listening on 443, not httpd anymore [09:07:17] a2dismod ssl and restarting both services [09:08:19] mutante: \o/ [09:08:48] i will change miscweb2001 to match it and not use 444 [09:11:21] thank you [09:27:34] 10Traffic, 10Operations: STL file downloads result in ERR_SPDY_PROTOCOL_ERROR - https://phabricator.wikimedia.org/T231422 (10Gilles) [09:33:23] 10Traffic, 10Operations: cergen fails signing CSR - https://phabricator.wikimedia.org/T231423 (10ema) [09:33:29] 10Traffic, 10Operations: cergen fails signing CSR - https://phabricator.wikimedia.org/T231423 (10ema) p:05Triage→03High [09:37:22] 10Traffic, 10Operations, 10serviceops: Error pulling image from docker registry - https://phabricator.wikimedia.org/T231388 (10ema) A proper fix for this issue is blocked on cergen bug T231423. I am going to disable TLS between ATS and eqiad's docker-registry for the time being. cp1075 is also in eqiad, so... [09:41:01] 10Traffic, 10Operations: Cannot download STL files due to network error - https://phabricator.wikimedia.org/T231422 (10Gilles) [09:41:35] <_joe_> ema: I think it has to do with the move to puppet 5, possibly [09:42:03] <_joe_> seems like the ruby util that does some horrible monkey-patching of puppet internals wasn't updated to match the upgrade [09:43:14] 10Domains, 10Traffic, 10DNS, 10Operations, 10Patch-For-Review: Could not reach wikipedia from domain wikipedia.fi - https://phabricator.wikimedia.org/T230470 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez https is happy now after adding wikipedia.fi as part of non-canonical-redirect-6 in https://g... [09:43:33] _joe_: when did the upgrade happen? I've created a cert for grafana with cergen on August 15 and that worked fine [09:44:18] 10Domains, 10Traffic, 10DNS, 10Operations, 10Patch-For-Review: Could not reach wikipedia from domain wikipedia.fi - https://phabricator.wikimedia.org/T230470 (10Vgutierrez) oh, it also works for https://wikipedia.fi: ` willikins:~ vgutierrez$ curl https://wikipedia.fi -o /dev/null -v 2>&1 |fgrep -i Locat... [09:46:14] <_joe_> ema: I am not sure it has, hence the "possibly" :P [09:46:25] <_joe_> and no, it has not, apparently [09:47:45] _joe_: I see that we're currently using registry2001, any reason to use that instead of registry1001? [09:48:52] <_joe_> yes, swift replication gives a new meaning to "eventually" consistent [09:50:22] godog: I'm getting issues trying to render the last 24 hours of data for the ncredir overview dashboard: https://grafana.wikimedia.org/d/zCYRtYvWz/ncredir-overview?orgId=1&from=now-24h&to=now [09:51:19] vgutierrez: yeah, if you hover on the red exclamation point it says why [09:51:44] too many samples? [09:52:31] indeed, probably best to turn that query into a recording rule, I'm also checking the samples limit we have now [09:52:54] ack [09:53:05] --query.max-samples=10000000 [09:54:21] default is 50M btw, iirc we lowered it when we had some queries/dashboards OOMing prometheus [11:28:34] 10Traffic, 10Operations, 10Patch-For-Review: Investigate HTTP/2 limits on trafficserver - https://phabricator.wikimedia.org/T231287 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [11:36:57] 10Traffic, 10Operations, 10Patch-For-Review: Puppetize ATS TLS configuration for incoming traffic - https://phabricator.wikimedia.org/T221594 (10Vgutierrez) 05Open→03Resolved [11:36:59] 10Traffic, 10Operations: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10Vgutierrez) [11:39:02] 10Traffic, 10Operations: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [11:39:14] 10Traffic, 10Operations: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) p:05Triage→03Normal [12:24:45] re: brave.com, that was Partnerships working on that, we just provided the DNS tech support, I just didn't want to say anything since they often aren't very public with their work, but I guess the news is out on reddit :) [12:55:18] ema: dunno what is up with cergen there, mind if I poke? Can I try to gen the cert like you do without breaking anything? [12:55:32] ottomata: hi! :) [12:55:53] ottomata: feel free to poke around, that would be great. You can find my yaml file under /tmp on puppetmaster1001 [12:55:57] danke [12:56:04] thank you! [12:59:04] ottomata: I've left puppet disabled on docker_registry_ha::registry hosts as a precaution [13:00:09] ottomata: but feel free to re-enable it if you need [13:07:19] _joe_: it seems that most of the functionalities of role/profile::docker::registry have been moved to docker_registry_ha, and I don't see docker::registry being used anymore. Should we get rid of it or is there a reason to keep it around? [13:10:25] ema: do you know if there is an easy way to see some puppet master (and api server?) logs? journalctl -u puppet-master doesn't have anything [13:11:35] ottomata: maybe under /var/log/apache2/puppet* ? [13:12:14] ah ha, great, thank you [13:12:32] np! They seem to be busy logfiles, wow [13:12:36] yeah! [13:27:25] ah, well, now it's not only eqiad and codfw! Origin servers need to accept connections from any cache node [13:39:25] 10Domains, 10Traffic, 10DNS, 10Operations: Could not reach wikipedia from domain wikipedia.fi - https://phabricator.wikimedia.org/T230470 (10Zache) seems to work, thanks [13:41:15] grr, still not sure what is going on. puppetmaster seems to accept the CSR, but then does nothing with it... [13:42:14] ottomata: if it helps, I did manage to create new certificates on August 15 [13:43:17] ema: which certificates? [13:43:57] ottomata: modules/secret/secrets/certificates/certificate.manifests.d/grafana.certs.yaml [13:44:02] k [13:44:08] that worked fine [13:44:12] that was the one i was comparing with too [13:45:18] the CSR is identical expect for the expected details [13:45:45] yar, it is hard to debug this stuff, i have to set up a puppetmaster elsewhere etc... [13:45:46] yargh [13:46:13] * ottomata wonders again why we use the puppet ca for these certs :p [13:51:29] ema: if i would edit only the 2 files trafficserver/backend.yaml and cache/text_ats.yaml and not touch cache/text.yaml would that make a difference right now? is cache/text.yaml still used for miscweb at this point? [13:52:09] i think i want to merge another change that just moves one service at a time, like "switch over racktables" and then check that and then move on later [13:53:57] mutante: cache/text.yaml is used by all text nodes not yet converted to ATS (all of them except for cp1075 at this point) [13:54:25] so yeah for now you need to change that too I'm afraid [13:54:41] alright [13:56:48] ottomata: AIUI we use the puppet CA because it exists and we've yet come to consensus on an alternative [13:57:51] aye, but we could use a local custom root ca for cergen certs...probably would be simpler and less hacky [13:58:01] buuut ¯\_(ツ)_/¯ [14:05:39] ottomata: there was a PKI session at the SRE summit ;) I've an action item to push mgmt to allocate time for it :-) [14:05:59] :) [14:07:31] ema: i'd like to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/533024 instead of that previous one, using https and discovery record (even in text.yaml?) [14:08:18] mutante: yeah I've seen that. Replied on the CR. Varnish resolves hostnames once and for all at VCL load time, so DNS discovery makes little sense [14:08:41] other than that the patch looks good! [14:08:51] i saw, amending and thanks! [14:09:47] ema: for ATS, I am more happy to have integration.wikimedia.org switched to it [14:10:07] it has some reasonable traffic from tech savvy people so that might help :] [14:11:22] hashar: thanks! We're switching all sites to ATS in one go though, not one by one :) [14:11:48] (one cache server at a time of course, so we'll be in a mixed varnish backend/ats backend situation for a bit) [14:15:55] ah ha hm. [14:15:58] 'Server Error: this master is not a CA' [14:16:30] wat [14:16:39] that is when trying to get a cert via tha http api [14:17:05] e.g. curl https://puppetmaster1001.eqiad.wmnet:8140/production/certificate/grafana.discovery.wmnet [14:17:27] how about the others? [14:17:34] other certs? it seems to be the same [14:17:42] oh other puppetmasters? [14:18:13] yeah other puppetmasters [14:18:35] same from 2001 [14:18:46] i thnk that's the only other 'frontend' [14:21:18] interesting [14:21:50] iirc the http API i'm using is pretty legacy [14:21:54] no 'v3' or anything [14:22:02] just // [14:22:15] but, don't know what would have chahnged [14:22:16] i am adding the new director miscweb to cp* servers and switching racktables over. in the past when there were separate cache::misc servers i would force puppet run on them all but now that it's just cache::text i do nothing and let puppet handle it and wait 30 min [14:23:19] (besides manually running it on about 2 random ones) [14:23:28] mutante: you still can force the puppet run if you like, up to you! [14:24:18] at least i see a bug in cergen, this should be reported earlier when attempting to send the CSR :) [14:24:19] ottomata: so, I think some things changed puppetmaster-side in the apache configuration recently [14:24:23] oh? [14:25:10] ema: ok, optional is good. as long as it's not important to push config change at once [14:25:11] j.bond added some stuff around canarying a buster puppetmaster, and also had to do some monkeypatching of puppet json in lua [14:25:40] "monkeypatching of puppet json in lua" [14:25:53] let's say that out loud a few times [14:26:34] https://tools.wmflabs.org/bash/help [14:26:51] that's bash as in IRC quotes, not as in shell [14:28:43] i think it migth be related. is 8140 a proxy port? [14:28:51] 8141 responds with a cert. [14:29:45] yeah 8140 is a proxy port [14:29:54] I also see this [14:30:04] 50-puppetmaster1001-eqiad-wmnet.conf: RewriteRule ^/puppet-ca/.*$ https://puppetmaster1001.eqiad.wmnet:8141%{REQUEST_URI} [P,QSA] [14:30:17] yeah, but the request isn't to /puppet-ca [14:30:19] so I suspect we're just missing a rewrite rule to always round /production/certificate paths to 8141 on the proper-master [14:30:22] uyeah. [14:30:24] i think so too [14:30:24] s/round/route/ [14:30:34] ema: ok, running on cp-text_eqiad and realizing it's not more than 7 actually, nice [14:31:33] yup, cp1075 recently left the band :) [14:31:50] hm but [14:31:51] ProxyPassMatch ^/([^/]+/certificate.*)$ https://<%= @master %>:8141 [14:31:54] is fierst [14:31:55] hopefully there will be a reunion tour eventually! [14:32:06] oh that is in the template? [14:32:14] still grokiung [14:32:17] groking* [14:32:28] ottomata: where do you see that? I don't see it in the config files on puppetmaster1001 [14:33:01] just in the erb [14:33:06] reading as to why not on 1001 [14:33:19] web-frontend.conf.erb [14:33:48] I get 0 matches in the puppet repo for git grep 'ProxyPassMatch.*certificate' [14:33:50] what am I missing [14:34:48] ottomata: you're looking at an old revision [14:34:57] that line was removed in d88038d9 [14:40:00] sorry was afk for a min [14:40:02] yeah was pulling [14:40:14] so sounds like that got lost [14:40:28] when switching to rewrite rules? [14:41:04] yup [14:41:26] actually all of the legacy api ProxyPassMatch rules did [14:42:17] yeah, I'm less keen on restoring all of them [14:42:26] ok [14:51:18] ottomata: ok the change is live in eqiad, can you try again? [14:51:42] k [14:52:49] cdanis: ema it worked! [14:53:12] \o/ [14:53:16] ema: i added your docker-registry.discovery.wmnet cert manifest, and generated your certs [14:53:20] they are uncommitted in /srv/private [14:53:25] can you commit them? [14:53:46] i did modify the manifest while debugging, but only to put make first alt name match the common name [14:53:52] if this is a problem we can regenerate. [14:54:05] thanks for the help ema! [14:55:24] well thank you ottomata! committing the certs [14:59:25] ottomata: out of curiosity how hard would it be to make cergen use the /puppet-ca API? [15:00:16] i'm not sure, it has been a while so I don't fullly remember why we couldn't use the new one. but there was some reason [15:00:17] i remember trying [15:00:28] it might have to do with wanting to use SANs and other extensions? [15:00:39] we already do some pretty hacky stuff to be able to use those with puppet ca [15:00:56] i.e. [15:00:57] https://github.com/wikimedia/cergen/blob/master/ext/puppet-sign-cert [15:00:58] ahh, okay [15:01:15] oh wow [15:01:30] wow is the correct reaction. [15:02:11] sorry, I'm volansing now, but is 'PuppetCertifcateSignError' our typo or theirs? [15:02:16] LOL [15:02:43] hahaha [15:02:44] class PuppetCertifcateSignError < StandardError [15:03:06] blame may show my name, but that doesn't mean I have to take credit :p [15:03:10] :D [15:03:27] i'm making a patch to get a better error about this bug anyway, will fix. [15:03:40] (wont' generate new .deb right now tho...will happen next time we need to) [15:11:17] cdanis: thanks very much! [15:11:36] 💖 [15:12:51] had to switch to gnome-terminal to see it, worth every millisecond <3 [15:13:26] some day i'll convince people other than _joe_ of the joy of weechat+glowing-bear [15:13:49] 💩 [15:13:52] O:) [15:14:11] vgutierrez: 💔 [15:23:04] cdanis: mind doing a quick review? [15:23:05] https://gerrit.wikimedia.org/r/c/cergen/+/533040/ [15:26:21] +1'd [15:27:18] ty [15:31:30] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [15:31:59] racktables is now moved to stretch and using https with envoy on the backend [15:32:07] the other 2 on krypton i'll do tomorrow [15:34:00] hashar: ema: i think hashar's comment would be more about the contint1001 box on https://phabricator.wikimedia.org/T210411 i guess [15:34:26] doc and integration and envoy on contint1001 [15:35:10] doc.wikimedia.org is now on doc1001.eqiad.wmnet :] [15:41:53] ah :) that ticket should be updated i guess, they all supposed to start talking https [17:46:59] 10Traffic, 10netops, 10Operations: Aug 28th: turn off 1/3 esams-knams lasers in advance of Relined PA-988002 maintenance - https://phabricator.wikimedia.org/T230448 (10ayounsi) 05Open→03Resolved Confirmed working. [18:57:01] 10netops, 10Operations: Review switches ACL to connect from tools-bastion to dbproxy1018 - https://phabricator.wikimedia.org/T231418 (10ayounsi) `lang=diff [edit firewall family inet filter labs-instance-in4 term labsdb-tcp4 from destination-address] 10.64.37.28/32 { ... } + 10.64.37.27/32; [ed... [18:58:18] 10netops, 10Operations: Review switches ACL to connect from tools-bastion to dbproxy1018 - https://phabricator.wikimedia.org/T231418 (10ayounsi) 05Open→03Resolved a:03ayounsi [21:56:13] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Operations, and 2 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10CDanis) Fortunately this occurrence seems to be quite rare. On each Swift frontend host, I: * grepped today's logs for GETs that resul... [22:14:37] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Jclark-ctr) Host moved cmjohnson. advised to move out if row B in to 10G racks leave 1 in B ` host... [22:30:14] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10ops-eqiad: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [23:52:44] 10Traffic, 10Operations: Unexpectedly received mobile version of an article while logged out - https://phabricator.wikimedia.org/T231504 (10Mholloway) [23:56:27] 10Traffic, 10Operations: Unexpectedly received mobile version of an article while logged out - https://phabricator.wikimedia.org/T231504 (10Mholloway)