[07:41:40] 10Acme-chief, 10Traffic, 10Operations: Memory leak on acme-chief 0.21 - https://phabricator.wikimedia.org/T234131 (10Vgutierrez) The issue is also happening on acmechief-test1001, so I decided to hack acme-chief code a little bit to add [[ https://mg.pov.lt/objgraph/objgraph.html | objgraph ]] reports, I've... [08:31:35] 10Acme-chief, 10Traffic, 10Operations: Memory leak on acme-chief 0.21 - https://phabricator.wikimedia.org/T234131 (10Vgutierrez) I've hacked an ACMEChief subclass that only performs OCSP response checks/updates and it still leaks memory: `lang=python class OCSPChecker(ACMEChief): def certificate_manageme... [08:55:46] 10Acme-chief, 10Traffic, 10Operations: Memory leak on acme-chief 0.21 - https://phabricator.wikimedia.org/T234131 (10Vgutierrez) Using objgraph to track leaking objects doesn't show anything wrong either: ` dict 723 set 248 CTypeDescr 101 tuple 39 list 7 SignalDict 6 weakref 1 meth... [09:31:57] 10Acme-chief, 10Traffic, 10Operations: Memory leak on acme-chief 0.21 - https://phabricator.wikimedia.org/T234131 (10Vgutierrez) Using valgrind against a simplified version that only runs _fetch_ocsp_response() against `unified / rsa-2048` shows a leak on: `==25358== 8,888 (1,536 direct, 7,352 indirect) byte... [10:44:03] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: Enable gzip compression for interface icon SVGs served by MediaWiki - https://phabricator.wikimedia.org/T232615 (10alaa_wmde) hey @Gilles We encountered an incident related to this change {T234183#5533440}. I suspect either there's somethin... [10:47:48] 10Acme-chief, 10Traffic, 10Operations: Memory leak on acme-chief 0.21 - https://phabricator.wikimedia.org/T234131 (10Vgutierrez) Reported to debian package maintainer on https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=941413 [10:58:17] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: Enable gzip compression for interface icon SVGs served by MediaWiki - https://phabricator.wikimedia.org/T232615 (10Gilles) Indeed, it seems like Varnish claims that the content is gzipped but it actually isn't. I think this is a Varnish bug... [11:09:13] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: Enable gzip compression for interface icon SVGs served by MediaWiki - https://phabricator.wikimedia.org/T232615 (10Vgutierrez) @Gilles, yeah, with varnishadm ban like it's documented on https://wikitech.wikimedia.org/wiki/Varnish#One-off_purg... [11:12:53] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: Enable gzip compression for interface icon SVGs served by MediaWiki - https://phabricator.wikimedia.org/T232615 (10Gilles) Yep, exactly. In fact you can go as far as only banning content-length between 150 and 860 (inclusive). [11:22:10] 10netops, 10Operations: configure BGP route damping on IX sessions - https://phabricator.wikimedia.org/T222424 (10jbond) [[ https://www.ripe.net/publications/docs/ripe-580#recommendations | RIPE routing-WG recommends a suppress-value of 6000 ]] if we go to that we may want to also increase the reuse but i coul... [11:51:55] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: Enable gzip compression for interface icon SVGs served by MediaWiki - https://phabricator.wikimedia.org/T232615 (10Gilles) @Vgutierrez purged SVGs whose content-length is > 100 and <= 899 in both cache layers for text and upload. [11:56:11] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: Enable gzip compression for interface icon SVGs served by MediaWiki - https://phabricator.wikimedia.org/T232615 (10Vgutierrez) Actually I'm missing the backend layer on the upload cluster.. it's powered by ATS and the procedure is different [11:58:56] 10netops, 10Operations, 10Puppet: Investigate improvements to how puppet manages interfaces - https://phabricator.wikimedia.org/T234207 (10jbond) [12:28:11] 10Traffic, 10Discovery, 10Operations, 10WMDE-Analytics-Engineering, and 4 others: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875 (10WMDE-leszek) [15:05:47] 10netops, 10Operations, 10Puppet: Investigate improvements to how puppet manages interfaces - https://phabricator.wikimedia.org/T234207 (10akosiaris) p:05Triage→03Lowest [16:29:59] 10Traffic, 10Analytics, 10Operations, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Milimetric) p:05Normal→03High [18:40:50] XioNoX: cr2-esams had a quick flap on BFD status; all it said on the icinga page for the duration was 'CRIT: Down: 2' with no other details [18:41:49] cdanis: I think one of our tunnels had some brief connectivity issue [18:42:05] ah okay, BFD is used to check for (GRE?) tunnel liveness? [18:42:24] cdanis: for OSPF, going inside the GRE tunnel [18:42:29] ahh [18:42:31] thanks [18:43:03] cdanis: was that icinga message the full message or the short one? [18:43:04] 10Traffic, 10Operations: Broken puppet on traffic-upload-stretch.traffic.eqiad.wmflabs and traffic-text-stretch.traffic.eqiad.wmflabs - https://phabricator.wikimedia.org/T234256 (10Andrew) [18:43:19] it was the full one visible on the web UI [18:43:42] can't find any longer version visible in Icinga logs either [18:45:34] cdanis: https://github.com/wikimedia/puppet/blob/133f5accded6c0f47c2dead4b814e7c1ec9e9de1/modules/nagios_common/files/check_commands/check_bfd.py#L59 should say the IP of the down neighbor on the long version (new line) [18:46:03] I wonder if there was some change to our icinga configuration or something that results in it not appearing anywhere [18:46:41] there could be a bug of course, on my script or somewhere else [18:47:10] or I could put everything on the one line [18:49:20] I don't see how the script could not print it [18:49:36] so it must be that icinga stopped caring about the next line for some reason [18:49:41] https://usercontent.irccloud-cdn.com/file/UrxTTSvU/Screenshot_2019-09-30%20Extended%20Information.png [18:49:48] it shows it when you click on it [18:49:50] oh [18:49:55] I didn't click on it 🙃 [18:49:58] okay carry on [18:50:08] not very user friendly [18:50:18] well you know, icinga [18:50:27] :) [18:50:42] it seems like that backup GRE tunnel is having some issues [18:51:00] it's over the internet so best effort, but is not used by default [18:51:07] still is strange [18:53:40] I acked those alerts for 2h [19:12:53] 10Traffic, 10MobileFrontend, 10Operations, 10Readers-Web-Backlog (Tracking): Sections on some mobile pages are not collabsable - https://phabricator.wikimedia.org/T233373 (10Jdlrobson) 05Open→03Resolved a:03Jdlrobson I'm considering this resolved. The cache was flushed for all wikis as part of T233095. [19:59:07] 10netops, 10Operations: configure BGP route damping on IX sessions - https://phabricator.wikimedia.org/T222424 (10ayounsi) Great doc, thanks! We can use 2000 for reuse, the following will happen: Flaps up to 6000, then gets stable: 15 min -> 3000 30 min -> 1500 (unblocked as < 2000 ) Accepting a prefix 30m...