[00:12:54] 10Traffic, 10Operations, 10Performance-Team (Radar), 10Sustainability (Incident Prevention): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10Krinkle) Not sure if this is merely a display issue, but I see fairly odd buckets on the dashboard: * 0ms - 438... [00:59:50] 10Traffic, 10Android-app-Bugs, 10Operations, 10Wikipedia-Android-App-Backlog, and 4 others: When fetching siteinfo from the MediaWiki API, set the `maxage` and `smaxage` parameters. - https://phabricator.wikimedia.org/T245033 (10JMinor) 05Open→03Resolved [01:01:25] 10Traffic, 10Operations, 10Performance-Team (Radar), 10Sustainability (Incident Prevention): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10dpifke) That's because I forgot to change query format to "heatmap" in the panel settings. :) Fixed. [03:04:37] 10Traffic, 10Operations: Maxmind data update issues for DNS (and others?) - https://phabricator.wikimedia.org/T252577 (10wkandek) Just FYI: my machine is being served from `esams` again. [06:50:20] 10Traffic, 10Operations, 10Performance-Team (Radar), 10Sustainability (Incident Prevention): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10Gilles) 05Open→03Resolved I think that I've fixed the display further, the format of the heatmap needed to... [06:50:22] 10Traffic, 10Operations, 10Performance-Team (Radar): Depooling single text caching server in esams had a disproportionate performance impact - https://phabricator.wikimedia.org/T238085 (10Gilles) [07:18:52] 10Traffic, 10Operations, 10Performance-Team (Radar), 10Sustainability (Incident Prevention): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10ema) >>! In T238086#6139615, @Gilles wrote: > @ema @Vgutierrez you can now use [[ https://grafana.wikimedia.org... [07:44:31] May 15 04:32:39 cp5006 logrotate[1262]: error: error running non-shared postrotate script for /var/cache/varnishkafka/webrequest.stats.json of '/var/cache/varnishkafka/webrequest.stats.json ' [07:44:35] wat? [07:46:30] vgutierrez: did cp5006 run into issues today? I see you rebooted it earlier on [07:47:50] yeah [08:30:26] 10Traffic, 10Operations, 10Patch-For-Review: ATS: Add the ability to check if origin server responses can be cached and their lifetime to the Lua plugin - https://phabricator.wikimedia.org/T251537 (10ema) 05Open→03Resolved a:03ema Done, 404 TTL capping now in place: ` root@cp3050:~# timeout 1 atslog-b... [08:43:13] 10Traffic, 10Operations, 10Patch-For-Review: cache_upload varnish-fe exhausting transient memory - https://phabricator.wikimedia.org/T249809 (10ema) It might be worth experimenting with **enabling** request coalescing for large files. That could help reducing pressure on transient I think, worth giving it a... [09:23:10] 10Traffic, 10Core Platform Team, 10MediaWiki-extensions-CentralAuth, 10Operations, and 5 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10jcrespo) ping @bblack to know if you prefer to make temporary workar... [09:33:11] 10netops, 10Operations, 10ops-eqord: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10ayounsi) From Telia after asking them the light levels they're getting. > Looks like we are still at times seeing low light and errors in Chicago and transmitting those to San Francis... [10:19:21] 10Varnish: upload.wikimedia.org should allow 'Range' via Access-Control-Allow-Headers on CORS preflight - https://phabricator.wikimedia.org/T57631 (10Aklapper) [11:10:10] 10netops, 10Operations, 10ops-eqiad: Three ports on asw2-d-eqiad are not working as expected - https://phabricator.wikimedia.org/T247881 (10ayounsi) If they're dead: * Either we need them (eg. short on ports), and in that case we need to replace the switch. Which is a heavy operations. * Or we mark the ports... [12:08:49] someone have a few minutes today to review https://gerrit.wikimedia.org/r/c/operations/puppet/+/549683 ? [12:16:15] 10netops, 10Operations, 10ops-eqiad: Three ports on asw2-d-eqiad are not working as expected - https://phabricator.wikimedia.org/T247881 (10faidon) If three ports are permanently failed, I'm not sure how we could ever trust that switch again. Perhaps it's better to do a painful but //planned// replacement ra... [13:10:59] 10Acme-chief, 10Traffic, 10Operations: acme-chief crashes upon OCSP responder errors - https://phabricator.wikimedia.org/T252881 (10Vgutierrez) [13:45:28] 10Acme-chief, 10Traffic, 10Operations: acme-chief crashes upon OCSP responder errors - https://phabricator.wikimedia.org/T252881 (10Vgutierrez) p:05Triage→03Medium [13:53:14] cdanis: looking! [14:00:50] cdanis: can you explain the part quoting firware-version with sed? What's an example of ethtool output that needs quoting and why? [14:01:08] ema: no example I know of, but it can be an arbitrary string [14:01:19] just being paranoid :) [14:04:01] fair enough! [15:33:22] 10Acme-chief, 10Traffic, 10Operations: acme-chief crashes upon OCSP responder errors - https://phabricator.wikimedia.org/T252881 (10Vgutierrez) OCSP responder issues reported to LE in https://community.letsencrypt.org/t/ocsp-responder-returning-503-errors/122846 [15:44:39] 10netops, 10Operations: scrape ripe atlas data for a few anchors at other large networks - https://phabricator.wikimedia.org/T252890 (10CDanis) [16:08:57] 10netops, 10Operations: scrape ripe atlas data for a few anchors at other large networks - https://phabricator.wikimedia.org/T252890 (10CDanis) Most transit providers don't participate in RIPE Atlas. Here's the ones who do, in order of CAIDA AS rank: * NTT [[ https://atlas.ripe.net/probes/6066/ | us-atl-as291... [17:26:36] 10netops, 10Operations, 10ops-eqord: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10ayounsi) From Telia: > Your service was affected by an outage along the transmission path, but the Loss of Signal we saw in Chicago happened after that outage had already started so i... [17:57:33] 10Acme-chief, 10Traffic, 10Operations, 10Patch-For-Review: acme-chief crashes upon OCSP responder errors - https://phabricator.wikimedia.org/T252881 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [18:03:20] 10Acme-chief, 10Traffic, 10Operations: Let's Encrypt OCSP responders are showing 503 errors - https://phabricator.wikimedia.org/T252901 (10Vgutierrez) [18:03:58] 10Acme-chief, 10Traffic, 10Operations: Let's Encrypt OCSP responders are showing 503 errors - https://phabricator.wikimedia.org/T252901 (10Vgutierrez) p:05Triage→03Medium [19:52:03] 10Acme-chief, 10Traffic, 10Operations: Let's Encrypt OCSP responders are showing 503 errors - https://phabricator.wikimedia.org/T252901 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez `May 15 19:43:27 acmechief1001 acme-chief-backend[30417]: Refreshing live OCSP response for certificate non-canonical-r...