[03:51:23] 10Traffic, 10Operations: Enforce POST size limit on ats-tls - https://phabricator.wikimedia.org/T236755 (10Vgutierrez) 05Open→03Resolved p:05Triage→03Normal [03:51:29] 10Traffic, 10Operations, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [03:56:33] 10Traffic, 10Operations: Provide an easy way of picking the traffic serving TLS certificate used by ATS - https://phabricator.wikimedia.org/T234803 (10Vgutierrez) 05Open→03Stalled I'm marking this task as stalled, it will be resolved as soon as T231627 is completed [03:56:36] 10Acme-chief, 10Traffic, 10Operations: Decide/document criteria needed to serve acme-chief LE issued unified certificate to end users - https://phabricator.wikimedia.org/T230687 (10Vgutierrez) [04:50:17] 10Traffic, 10Operations, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [05:05:32] 10Traffic, 10Operations, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [09:13:29] 10Traffic, 10Operations: ats-be on the text cluster is experiencing broken connections - https://phabricator.wikimedia.org/T236988 (10MoritzMuehlenhoff) p:05Triage→03Normal [10:59:57] 10netops, 10Operations: Update router ACLs for newer bacula hosts - https://phabricator.wikimedia.org/T237016 (10jcrespo) This is currently affecting backups from analytics1029 and an-master1002 FYI T236406#5630631 CC #Analytics @Ottomata @elukey . [11:18:14] 10Traffic, 10Operations: Network unreachable after network-online.target is brought up - https://phabricator.wikimedia.org/T237243 (10ema) [11:20:41] moritzm: it seems that the issue I've reported here on Friday about cp5011 is quite widespread ^ [11:29:36] 10netops, 10Operations, 10Patch-For-Review: Update router ACLs for newer bacula hosts - https://phabricator.wikimedia.org/T237016 (10elukey) Thanks a lot Jaime! @akosiaris if the change looks good I can update cr1/cr2 manually (or I can use homer if already available!) [11:31:13] ema: from reading systemd-special(7) lldpd.service should use network-online.target if we want it to postpone startup until the network interface is fully setup [11:32:18] not sure if that's a good idea per se [11:54:26] 10Traffic, 10Operations, 10Patch-For-Review, 10Puppet, 10User-jbond: Serve volatile uri from local site - https://phabricator.wikimedia.org/T235427 (10jbond) The volatile URI has been moved both puppetmasters [11:58:03] 10Traffic, 10Operations: Network unreachable after network-online.target is brought up - https://phabricator.wikimedia.org/T237243 (10MoritzMuehlenhoff) The lldpd unit only depends on network.target, but network-online.target, per systemd-special(7) lldpd.service only the latter will postpone startup until the... [12:19:43] 10netops, 10Operations, 10Patch-For-Review: Update router ACLs for newer bacula hosts - https://phabricator.wikimedia.org/T237016 (10akosiaris) Let a minor comment, namely let's keep helium around for a bit more. > (or I can use homer if already available!) I don't know tbh. I filed the task explicitly bec... [12:33:35] 10netops, 10Operations, 10Patch-For-Review: Update router ACLs for newer bacula hosts - https://phabricator.wikimedia.org/T237016 (10Volans) While you wait for @ayounsi I can maybe fill some gap. Homer is already a thing and Arzhel is using and testing it, but it doesn't have yet proper documentation for a w... [12:41:13] moritzm: the problem is that pretty much everything starting around that time fails to access the network, not only lldpd [12:41:26] and that includes nrpe, which does use network-online.target [12:42:12] well not really, nagios-nrpe-server.service has After=nss-lookup.target in its unit [12:42:35] but I imagine that nss-lookup requires network-online [12:46:55] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10JAllemandou) Thanks for the fast loop over format @Nuria and @BBlack. Indeed having a single field named `TLS` formatted as describe... [12:48:05] yeah, that might be intentional: [12:48:08] looking at the docs, network-online.target is for services which strictly need a configured network connection to start [12:48:58] and lldpd starts just fine without the network interfaces up (but it can only really do it's just when the network is up) [12:49:24] so I guess this is an explicit call by upstream, the systemd unit is shipped by them, not Debian [12:49:39] or maybe it's an oversight, probably best to report upstream? [12:51:24] interesting [12:52:00] https://github.com/vincentbernat/lldpd/issues/241 is a little inconclusive [12:52:26] turns out the Debian maintainer is also upstream [12:53:04] network-online.target is probably more correct [12:53:52] because, without the network being up, no client can query lldpd anyway, so while this maybe saves a few milliseconds of startup time, it doesn't seem really correct [12:53:59] right [12:57:23] so my understanding so far of the issue is that at boot: (1) network link goes up (2) network card gets configured (3) network-online.target is reached (4) link goes down (5) services configured to start After=network-online.target fail to start properly (6) link comes back up [12:58:53] and this happens on systems with different network cards, eg db1075 (Tigon3) and cp3059 (BCM57412) [12:59:05] so perhaps it' [12:59:30] so perhaps it's something related to the switches? [13:02:12] 10Traffic, 10netops, 10Operations: Network unreachable after network-online.target is brought up - https://phabricator.wikimedia.org/T237243 (10ema) [13:19:22] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp5012.eqsin.wmnet'] ` The log can be found in `/var/log/wm... [13:38:33] 10netops, 10Operations, 10Patch-For-Review: Update router ACLs for newer bacula hosts - https://phabricator.wikimedia.org/T237016 (10elukey) To keep archives happy: cr1-eqiad: ` elukey@re0.cr1-eqiad# show | compare [edit firewall family inet filter analytics-in4 term bacula from destination-address]... [13:43:48] 10netops, 10Operations, 10Patch-For-Review: Update router ACLs for newer bacula hosts - https://phabricator.wikimedia.org/T237016 (10akosiaris) 05Open→03Resolved a:03akosiaris ` akosiaris@an-master1002:~$ telnet -4 backup1001.eqiad.wmnet 9103 Trying 10.64.48.36... Connected to backup1001.eqiad.wmnet. E... [13:46:07] 10netops, 10Operations, 10Patch-For-Review: Update router ACLs for newer bacula hosts - https://phabricator.wikimedia.org/T237016 (10jcrespo) For the record, the other host affected: ` analytics1029:~$ telnet -4 backup1001.eqiad.wmnet 9103 Trying 10.64.48.36... Connected to backup1001.eqiad.wmnet ` [14:07:06] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp5012.eqsin.wmnet'] ` The log can be found in `/var/log/wm... [15:27:01] I don't like to be a nag, but icinga's all yellow from certs warnings for *.wikipedia.org, is that safe to ack and to link to a phab task ? [15:30:00] is for them to say, but if you ack, make sure it is only for a limited time, as otherwise it won't be noticed if it gets red (?) [15:37:45] jynus: yeah that's one of the reasons I tend to prefer downtime as it expires [15:38:29] ah, sure, sorry- I understood you as ack- but ack can also be configured to expire ! [15:39:19] 0:-) [15:39:24] ah yeah you are right, yup expiring ack will do too [15:50:38] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5012.eqsin.wmnet'] ` Of which those **FAILED**: ` ['cp5012.eqsin.wmnet'] ` [15:55:20] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Nuria) @BBlack: we can take a stab at modifying code on VCL if you can CR since that needs to happen before the varnishkafka changes [16:05:33] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Ottomata) Hm, another idea... > I don't think sub-objects or arrays are supported by varnishkafka. We'll have to set each one as a... [16:05:43] 10Traffic, 10netops, 10Operations: Network unreachable after network-online.target is brought up - https://phabricator.wikimedia.org/T237243 (10MoritzMuehlenhoff) p:05Triage→03Normal [16:30:21] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Ottomata) > - Changes the serialization code path we've been using to produce webrequest for years We discussed this in Analytics s... [16:35:45] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10BBlack) Agreed, let's not go down that road right here (because we have a burning need for this data pronto), but side note to keep... [17:59:06] so cp5012 (reimaged today) does not boot properly. The installation seems to have gone fine, however (lost in between hundreds of uninteresting kernel messages): [17:59:10] [ 2.917721] Unpacking initramfs... [17:59:10] [ 2.922806] Initramfs unpacking failed: junk in compressed archive [17:59:19] fun! [17:59:57] I'm gonna try find out more tomorrow, but if anyone has ideas please give me a shout [18:08:33] ema: I might retry the imaging just to see what happens [18:08:41] (with console open) [18:23:03] ema: anything needed from me on https://phabricator.wikimedia.org/T237243 ? [18:54:58] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10BBlack) Patches above look sane? I went ahead and shortened the key names down to the minimum to prevent bloat at these layers. We can... [19:16:19] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Ottomata) Hm, would be ok with me, but likely whatever we choose we'll be stuck with forever. I tend to prefer descriptive names in gene... [19:25:46] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10BBlack) >>! In T233661#5633090, @Ottomata wrote: > Hm, would be ok with me, but likely whatever we choose we'll be stuck with forever. I... [19:26:47] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10BBlack) Nevermind, I see it in the gerrit comments [19:31:00] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Ottomata) > I'm not sure if there's limitations on overall length of the varnishkafka inputs/outputs. Shouldn't be from varnishkafka, but... [19:51:04] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10JAllemandou) I checked for message length in one day of webrequest, and we top at 4916 bytes. I think Kafka will be fine as per message-s... [20:49:21] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 8 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10Gilles) [21:58:38] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Nuria) @BBlack: once we deploy the VCL/varnish-kafka chnages we need to change our refine pipeline to read these values, when we deploy t...