[08:54:21] 10Traffic, 10Operations, 10Performance-Team, 10observability, 10Patch-For-Review: Ensure graphs used by Performance account for Varnish-to-ATS migration - https://phabricator.wikimedia.org/T233474 (10ema) >>! In T233474#5761873, @Krinkle wrote: > It looks like the Apache Backend-Timing graphs dried up. >... [09:06:22] 10netops, 10Operations: fastnetmon misreports attack type and protocol - https://phabricator.wikimedia.org/T241374 (10ayounsi) a:03ayounsi Opened https://github.com/pavel-odintsov/fastnetmon/issues/787 [09:09:17] 10Traffic, 10Operations, 10Performance-Team, 10SRE-swift-storage, and 2 others: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10fgiunchedi) Analytics now publishes media access stats, might be useful to drive some/all thumbnail cleanup: https://wikitech.wiki... [09:25:52] 10Traffic, 10Operations, 10observability: cp1083: ats-tls and varnish-fe crashed due to insufficient memory - https://phabricator.wikimedia.org/T241593 (10Volans) @ema maybe could be related to NUMA utilization? Having a quick look at `numastat` (both `-n` and `-m`) there is a general imbalance between the t... [09:28:30] volans: NUMA hits or misses shouldn't cause an allocation failure though, right? [09:29:20] slower performance, sure, but not SIGABRT [09:31:57] ema: yes that was my first thought too hence the initial question ;) I was wondering if by any chance it could have tried to force-allocate on a specific node. [09:32:08] ah! [09:33:07] I didn't know you could do that (specifically ask to allocate on a given node, without fallback) [09:34:21] numa_alloc_onnode should allow to do that, but you have to do it specifically [09:34:44] and doesn't seems a great practice in general, but if you're doing low level memory management [09:35:52] probably a far fetched theory [10:02:27] by any chance anyone has few spare minutes to double check this one? [10:02:27] https://gerrit.wikimedia.org/r/c/operations/dns/+/554080 [10:02:46] those are missing records that came out from the cross-check of data generated from netbox and current dns repo [10:24:25] volans: never heard of wmf7[0-9]{3}, what are those? [10:24:49] asset tags [10:25:02] every device we have has one [10:27:55] so I assume we'll never use 7 for a DC number? :) [10:28:30] no, we'll never use wmf* for hostnames [10:28:39] asset tags are only for mgmt [10:28:43] each host has 2 mgmt records [10:28:57] $hostname.mgmt.$dc.wmnet and $asset.mgmt.$dc.wmnet [10:29:12] asset tags are assigned upon arrival in the DC [10:58:46] 10Traffic, 10Operations, 10observability: cp1083: ats-tls and varnish-fe crashed due to insufficient memory - https://phabricator.wikimedia.org/T241593 (10fgiunchedi) I checked all "memory free" metrics as reported by node-exporter for the varnish case and indeed the numbers match, i.e. the kernel was report... [11:29:24] 10Traffic, 10DNS, 10Mail, 10Operations, 10Patch-For-Review: wikimedia.community domain name is not resolving an mx record - https://phabricator.wikimedia.org/T241132 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez `$ host -t mx wikimedia.community wikimedia.community mail is handled by 10 mx1001.wi... [11:42:38] 10Acme-chief, 10Traffic, 10Operations: memory leak on keyholder-proxy on buster/python 3.7 - https://phabricator.wikimedia.org/T239386 (10Vgutierrez) @Volans I think we could close this task already as everything seems healthy on acmechief1001 [11:43:09] 10Acme-chief, 10Traffic, 10Operations: memory leak on keyholder-proxy on buster/python 3.7 - https://phabricator.wikimedia.org/T239386 (10Volans) 05Open→03Resolved a:03Volans Indeed, done :) [13:29:19] 10netops, 10Operations: Routinator RSYNC errors - https://phabricator.wikimedia.org/T240817 (10ayounsi) 05Resolved→03Open Opened https://github.com/NLnetLabs/routinator/issues/267 upstream. As `rsync://localhost/repo/` has been alerting for 10 days now. And there is not much we can do. [13:36:30] 10Traffic, 10Operations: Track TLS related ATS metrics in prometheus - https://phabricator.wikimedia.org/T231286 (10Vgutierrez) 05Open→03Resolved We currently have 5 SSL/TLS related panels in the [[ https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown | ATS instance drilldown ]] [13:36:32] 10Traffic, 10Operations: Puppetize ATS TLS configuration for incoming traffic - https://phabricator.wikimedia.org/T221594 (10Vgutierrez) [13:44:20] 10Traffic, 10Elasticsearch, 10Operations, 10Discovery-Search (Current work), and 2 others: Sustained periods (2-4h) of bad latency on production-search eqiad - https://phabricator.wikimedia.org/T241421 (10dcausse) [14:14:00] volans: lgtm! [14:22:47] jbond42: the ATS section of https://wikitech.wikimedia.org/wiki/User:Jbond/Encryption is not entirely correct, ATS servers don't use TLS to talk with each other (they do not talk among themselves at all!) but rather to encrypt HTTP requests to the origin servers (aka applayer) [14:24:22] jbond42: I've updated the relevant section, thanks for doing all that! [14:24:54] ema: great thanks was just about to have a stab at updating myself [14:26:49] ema: thx [14:46:43] 10Traffic, 10Operations, 10Patch-For-Review: two failing upload VTC tests - https://phabricator.wikimedia.org/T241653 (10ema) p:05Triage→03Normal [15:40:12] XioNoX: do you happen to know if there's any information in SNMP or elsewhere about the number of PAUSE frames sent/received by our switches? [15:42:34] I'm not on my laptop anymore, but I'd guess so, at least via ssh/netconf as it's exposed on the cli. But probably snmp too [15:48:52] cdanis: https://apps.juniper.net/mib-explorer/search.jsp#object=dot3InPauseFrames&product=Junos%20OS&release=19.4R1 [15:54:29] thanks! [18:43:38] 10HTTPS, 10Traffic, 10Operations, 10Voice & Tone: sec-warning page is Wikipedia-specific and dubiously worded - https://phabricator.wikimedia.org/T241656 (10Dzahn) This ticket seems to be a duplicate of T241309. [18:44:39] 10Traffic, 10Operations: Add more detailed instructions to the "sec-advice" page - https://phabricator.wikimedia.org/T241309 (10Dzahn) See also T241656 which might be a duplicate. [19:58:39] 10Traffic, 10Operations: Add more detailed instructions to the "sec-advice" page - https://phabricator.wikimedia.org/T241309 (10jcrespo) [19:58:41] 10HTTPS, 10Traffic, 10Operations, 10Voice & Tone: sec-warning page is Wikipedia-specific and dubiously worded - https://phabricator.wikimedia.org/T241656 (10jcrespo) [19:59:29] 10Traffic, 10Operations: Add more detailed instructions to the "sec-advice" page - https://phabricator.wikimedia.org/T241309 (10jcrespo) Feel free to edit the body with a complete list of changes, but I beleive a single task would be enough to track all improvements requested. [21:23:21] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Ottomata) PSA! I've noticed that usages of envoyproxy for service TLS termination uses unencrypted private key files, but the cergen certificate manifests for these are c... [21:40:20] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Ottomata) Ah hm, I also just realized the public cert is manually committed to public puppet in files/ssl. Should we maybe just change sslcert::certificate to be smart(...