[08:16:31] it's quite remarkable that eqiad managed to survive with only 2 nodes pooled [08:17:56] we have become much more resilient to eqiad issues with the switch to ats-be, the previous architecture with varnish-be would have definitely melted horribly in a scenario like this :) [08:20:11] AFAICT from the graphs we only started having troubles when cp1089 alone was handling all of eqiad (~26k rps) [08:20:58] so yeah, very useful and interesting load test! [08:23:59] https://grafana.wikimedia.org/d/kHk7W6OZz/ats-cluster-view?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-layer=tls&var-cluster=text&from=1581438335158&to=1581476772954 [08:24:25] here's "the graphs" I mentioned above ^ [10:02:08] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp3051.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [10:05:51] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp3052.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [10:08:50] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2023.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [10:14:38] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2026.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121014_vgutie... [10:27:25] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2023.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2023.codfw.wmnet'] ` [10:34:06] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3051.esams.wmnet'] ` Of which those **FAILED**: ` ['cp3051.esams.wmnet'] ` [10:35:46] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2026.codfw.wmnet'] ` and were **ALL** successful. [10:38:28] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3052.esams.wmnet'] ` Of which those **FAILED**: ` ['cp3052.esams.wmnet'] ` [10:46:26] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2019.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121046_vgutie... [10:49:09] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2025.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121048_vgutie... [10:50:23] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [10:51:25] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp3050.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121051_vgutie... [11:07:14] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2025.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2025.codfw.wmnet'] ` [11:10:25] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2019.codfw.wmnet'] ` and were **ALL** successful. [11:16:00] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2016.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121115_vgutie... [11:19:00] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2024.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121118_vgutie... [11:19:38] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [11:23:53] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3050.esams.wmnet'] ` Of which those **FAILED**: ` ['cp3050.esams.wmnet'] ` [11:37:34] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2016.codfw.wmnet'] ` and were **ALL** successful. [11:40:23] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2024.codfw.wmnet'] ` and were **ALL** successful. [12:01:29] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [12:02:41] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2022.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121202_vgutie... [12:05:57] 10Traffic, 10Operations, 10Research, 10Patch-For-Review: Set up git-driven static microsite for wikiworkshop.org - https://phabricator.wikimedia.org/T242374 (10bmansurov) @BBlack the site has been updated. Please turn on the DNS. [12:07:35] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [12:09:55] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2013.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121209_vgutie... [12:20:46] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2022.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2022.codfw.wmnet'] ` [12:32:25] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2013.codfw.wmnet'] ` and were **ALL** successful. [12:35:42] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [12:36:29] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2012.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121236_vgutie... [12:39:08] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2020.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121239_vgutie... [12:42:52] 10netops, 10Operations: eqsin (1) MX204 router - https://phabricator.wikimedia.org/T245000 (10ayounsi) p:05Triage→03Medium [13:27:58] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2010.codfw.wmnet'] ` and were **ALL** successful. [13:32:40] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2018.codfw.wmnet'] ` and were **ALL** successful. [13:40:56] 10netops, 10Operations: cr1-eqsin routing engine crashlooping after JunOS upgrade - https://phabricator.wikimedia.org/T244944 (10ayounsi) 05Open→03Resolved Looks solved. [13:48:20] ema: yeah, it does seem like just when the final text node was depooled that we saw issues, which as you say is really amazing [13:49:04] ema: also, if you wanted to take a post-merge review on https://gerrit.wikimedia.org/r/c/operations/puppet/+/571596 i'd appreciate it -- first time trying to write VTC from scratch [13:57:03] cdanis: looking! [14:01:05] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [14:02:05] cdanis: good stuff, belated +1 [14:02:09] ty! [14:03:54] 10Traffic, 10Operations, 10Patch-For-Review: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 (10Vgutierrez) 05Open→03Stalled ATS is having issues handling properly the connect and the TTFB timeout when KA is enabled and parent proxies are... [14:03:57] 10Traffic, 10Operations: ulsfo varnish-fe vcache processes overflow on FDs - https://phabricator.wikimedia.org/T243634 (10Vgutierrez) [14:27:34] bblack: the router has been stable since the upgrade we're good to repool eqsin. Only issue is some data is not exposed over SNMP anymore [14:27:53] bblack: I also opened a procurement task to replace that router: https://phabricator.wikimedia.org/T245000 [14:28:05] bblack: let me know if it's ok to repool eqsin [14:28:15] XioNoX: ok - it finally synced bgp without crashing I ugess? [14:28:16] XioNoX: i am mildly concerned about that, given we've had CPU usage problems with cr1-eqsin before, but also, seems tolerable [14:28:23] bblack: correct [14:28:30] +1 :) [14:28:34] bblack: yeah, JTAC looked at coredumps and said to use a newer firmware version [14:30:09] cdanis: yeah, I don't like it much neither, hopefully it will hold until we replace it [14:30:15] +1 [14:30:24] also +10 on replacing it :) [14:30:47] hehe [14:35:45] 10netops, 10Operations: Upgrade routers - https://phabricator.wikimedia.org/T243080 (10ayounsi) [15:03:06] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2007.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121502_vgutie... [15:12:10] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2017.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121512_vgutie... [15:14:39] <_joe_> hi! [15:15:00] <_joe_> any idea why curl -I 'https://en.wikipedia.org/w/api.php?format=json&formatversion=2&errorformat=plaintext&action=query&meta=siteinfo&maxage=86400' gets a "pass" from the caching layer? [15:16:45] <_joe_> I see the backend sets [15:16:48] <_joe_> < expires: Thu, 01 Jan 1970 00:00:01 GMT [15:16:52] <_joe_> is this the reason? [15:19:15] _joe_: cache-control: s-maxage=0, max-age=86400, public [15:19:45] <_joe_> oh I forgot to add smaxage as a parameter [15:20:00] <_joe_> ok that way it works [15:20:38] excellent [15:21:12] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2007.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2007.codfw.wmnet'] ` [15:29:56] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2017.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2017.codfw.wmnet'] ` [15:31:07] 10Traffic, 10Discovery, 10Operations, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Lea_Lacroix_WMDE) We just [[ https://phabricator.wikimedia.org/T244722 | increased the factor to 180 ]]... [15:41:07] 10Traffic, 10Operations, 10serviceops: When fetching siteinfo from the MediaWiki API, set the `maxage` and `smaxage` parameters. - https://phabricator.wikimedia.org/T245033 (10Joe) [15:47:24] 10Traffic, 10Android-app-Bugs, 10Operations, 10Wikipedia-Android-App-Backlog, and 3 others: When fetching siteinfo from the MediaWiki API, set the `maxage` and `smaxage` parameters. - https://phabricator.wikimedia.org/T245033 (10Joe) [15:49:48] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [15:50:41] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2006.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121550_vgutie... [15:50:45] 10Traffic, 10Android-app-Bugs, 10Operations, 10Wikipedia-Android-App-Backlog, and 3 others: When fetching siteinfo from the MediaWiki API, set the `maxage` and `smaxage` parameters. - https://phabricator.wikimedia.org/T245033 (10Joe) [15:53:42] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2014.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121553_vgutie... [15:59:02] 10Traffic, 10Android-app-Bugs, 10Operations, 10Wikipedia-Android-App-Backlog, and 3 others: When fetching siteinfo from the MediaWiki API, set the `maxage` and `smaxage` parameters. - https://phabricator.wikimedia.org/T245033 (10Dbrant) @Joe Thanks for that! We'll update our code asap. Since we use the `si... [15:59:48] bblack, ema: funny.. ats-tls with KA enabled and hitting just one varnish-fe port is faster than with KA enabled and hitting the 8 ports of varnish-fe [16:00:05] well yeah, you get more reuse [16:00:09] like a 50% faster [16:00:17] we may not even need 8 ports, if reuse doesn't cause problems [16:00:29] so I've hit another ATS bug [16:00:35] with KA + parent proxies [16:00:36] we did the 8 port thing to control TIME_WAIT states and socket numeric exhaustion, because of lack of KA and req-per-conn with nginx [16:00:45] so I just disabled parent proxies on cp4031 [16:02:28] vgutierrez: any buster reimaging ongoing today? [16:02:34] (for coordination with bios updates) [16:02:36] :_) [16:02:39] any? [16:02:50] I've reimaged ~20 hosts today I think [16:02:53] not on eqiad though [16:02:58] ok [16:03:13] good to go for bios-flashing downtimes on a couple of eqiads then? [16:03:20] right now I'm hitting cp2006 and cp2014 [16:03:22] ok [16:03:23] yeah go ahead [16:03:32] my plan for today is finish codfw [16:03:35] and reimage eqiad tomorrow [16:05:59] 10Traffic, 10DC-Ops, 10Operations, 10ops-eqiad, 10ops-esams: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) Please note the outage caused the SAL of my adding back cp108[67] to service, (as the rest weren't really returned, but depooled.) [16:06:19] 10Traffic, 10DC-Ops, 10Operations, 10ops-eqiad, 10ops-esams: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) [16:12:15] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2006.codfw.wmnet'] ` and were **ALL** successful. [16:14:52] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2014.codfw.wmnet'] ` and were **ALL** successful. [16:18:08] could i get a review on the following from someone in traffic please https://gerrit.wikimedia.org/r/c/operations/puppet/+/571491, thanks [16:22:19] jbond42: hey! [16:22:29] do we still need the varnish-side definitions? :) [16:22:37] hey :) [16:22:47] oh alternate-domains, right [16:22:49] we do [16:22:56] bblack: yeah [16:23:31] we may not need the director metadata, but I assume your big patch I haven't read yet, may touch on some of that [16:23:54] jbond42: excellent, I see that cas-graphite has been added to the TLS certificate [16:24:16] yep its not in dns yet but i think everything elses hould be configuered [16:24:24] pcc against a cache_text host looks good https://puppet-compiler.wmflabs.org/compiler1001/20774/ [16:24:30] it's not in dns yet though [16:24:35] cas-graphite I mean [16:25:24] I'll leave you to it and go bring up a dns server :) [16:25:38] jbond42: the ops/puppet change looks good! [16:25:46] great thanks [16:26:32] I'm off :) o/ [16:27:09] bye [16:27:57] bblack: dns change is here https://gerrit.wikimedia.org/r/#/c/operations/dns/+/571753 :) [16:28:08] jbond42: dns update may flake on you for now, let me finish bringing up the missing server (needs reimage) [16:28:26] yes no rush from my end can wait untill tomorrow [16:32:41] 10Traffic, 10Varnish, 10Core Platform Team, 10MediaWiki-API, 10Operations: Evaluate the feasibility of cache invalidation for the action API - https://phabricator.wikimedia.org/T122867 (10Demian) [16:36:07] 10Traffic, 10Elasticsearch, 10Operations, 10Discovery-Search (Current work), and 2 others: Sustained periods (2-4h) of bad latency on production-search eqiad - https://phabricator.wikimedia.org/T241421 (10TJones) 05Open→03Resolved a:03TJones [16:52:47] 10Traffic, 10Operations: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [17:07:38] vgutierrez: is it the package alone that does it, or the KA changes? [17:07:49] we're testing KA on cp4031 [17:07:58] 8.0.6-rc0 in 4026 and 4032 [17:08:00] so unrelated issues [17:08:03] ok [17:08:24] I'm quite concerned with ATS KA behaviour right now [17:08:29] including ats-backend [17:10:30] so, as a test environment I have an ATS instance that's in front of a httpbin container [17:10:42] httpbin as in https://httpbin.org/ [17:11:06] I'm hitting the following URL: http://127.0.0.1:3128/delay/20 [17:11:14] so that endpoint returns a 200 after 20 secs [17:11:42] with KA enabled, I hit that using curl [17:11:52] the first time, it returns in 20 seconds (as expected) [17:12:02] the second time, it returns in 23 seconds [17:12:13] and httpbin access log shows the following [17:13:07] httpbin_1 | time="2020-02-12T16:37:41.5787" status=200 method="GET" uri="/delay/20" size_bytes=507 duration_ms=20002.73 [17:13:07] httpbin_1 | time="2020-02-12T16:37:52.2979" status=200 method="GET" uri="/delay/20" size_bytes=0 duration_ms=3717.91 [17:13:07] httpbin_1 | time="2020-02-12T16:38:12.2721" status=200 method="GET" uri="/delay/20" size_bytes=507 duration_ms=20002.50 [17:13:20] first request is the first curl [17:13:28] second and third request are the second curl [17:13:39] a timeout is *wrongly* triggered after ~3 seconds [17:13:47] and it retries the request [17:18:46] with KA disabled (only that change) do things work as expected? [17:18:59] yes [17:19:26] * vgutierrez retesting as we speak [17:22:44] yes [17:22:47] both curls return in 20 seconds [17:22:48] and [17:22:51] httpbin_1 | time="2020-02-12T17:21:53.9234" status=200 method="GET" uri="/delay/20" size_bytes=502 duration_ms=20003.33 [17:22:51] httpbin_1 | time="2020-02-12T17:22:18.0782" status=200 method="GET" uri="/delay/20" size_bytes=502 duration_ms=20001.18 [17:23:02] only 2 requests on httpbin side [17:23:27] I'm afraid this issue has been introduced by backporting PR 4028 from upstream [17:24:02] that PR is the one splitting the connect and the TTFB timeout [17:24:31] does it introduce a new param that's default to 3s and needs changing? [17:24:52] nope [17:24:54] but also it shouldn't be status=200 if it's a timeout [17:24:57] and ero bytes heh [17:25:01] *zero [17:25:10] it changes the meaning of one existing parameter [17:25:18] and we took that one down from 180s to 3s [17:25:26] 120 -> 3s sorry [17:25:30] and 120 -> 10s for ats-be [17:25:44] well [17:25:54] it changes the 3s to be connect-only, and not ttfb? [17:26:04] yes [17:26:10] so what control ttfb now? [17:30:26] proxy.config.http.transaction_no_activity_timeout_out [17:30:38] so right now our TTFB is 120secs on ats-tls [17:30:43] and the connect timeout is 3secs [17:30:50] and that works as expected without KA [17:30:56] with KA we're screwed [17:31:05] ok [17:31:28] because it seems to use connect timeout as the ttfb timeout for the 2nd (and presumably rest) of the reqs on a reused conn? [17:31:37] yes [17:31:39] that's the issue [17:47:49] bblack: I've confirmed it locally.. without PR 4028, it behaves as expected [17:52:02] did we get the missing #4020 fix too? [17:52:23] https://github.com/apache/trafficserver/pull/4028#issuecomment-433905069 [17:52:29] I'm working on that right now :) [17:53:05] ok :) [17:53:13] note also upstream seems to have reverted it, oddly [17:53:17] yep [17:53:28] I had some discussion about it with them [18:07:46] 8.0.6-rc0 introduces quite a lot of new H2 errors that we weren't seeing before [18:07:51] sigh [18:21:25] PR 4020 doesn't fix the KA issue [18:21:54] and I'm out of here...after 12 hours I don't think I'm being productive anymore [19:37:31] 10Traffic, 10Varnish, 10MediaWiki-API, 10Operations: Evaluate the feasibility of cache invalidation for the action API - https://phabricator.wikimedia.org/T122867 (10WDoranWMF) [21:34:07] 10Traffic, 10Discovery, 10Operations, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Bugreporter) I think increase the factor will not make thing better, it only increase the oscillating p... [21:40:32] 10Traffic, 10Discovery, 10Operations, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Bugreporter) If the rate of edit Query Updater can handle is a constant, changing the factor will not a... [22:01:45] 10Traffic, 10Discovery, 10Operations, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10ArthurPSmith) @Bugreporter > I think increase the factor will not make thing better, it only increase... [23:54:00] 10netops, 10Operations: Upgrade routers - https://phabricator.wikimedia.org/T243080 (10ayounsi) cr1-eqsin is back to normal, next step is to plan esams.