[00:03:07] 10HTTPS, 10Traffic, 10DBA, 10Operations, and 4 others: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499 (10JFishback_WMF) [02:03:44] 10HTTPS, 10Traffic, 10Operations, 10Wikimedia-Blog: Change automatic shortlink in blog theme - https://phabricator.wikimedia.org/T165511 (10Varnent) 05Open→03Declined This site has been closed and is no longer being actively developed. [02:06:04] 10HTTPS, 10Traffic, 10Operations, 10Wikimedia-Blog: make blog links from wmfwiki front page use HTTPS links - https://phabricator.wikimedia.org/T104728 (10Varnent) [10:38:36] 10Traffic, 10Operations, 10Patch-For-Review: ulsfo varnish-fe vcache processes overflow on FDs - https://phabricator.wikimedia.org/T243634 (10akosiaris) I blocked a number of IPs manually on cr3 and cr4 for ulsfo. Command was `set policy-options prefix-list blackhole4 ` for 5 IPs. The prefix list w... [11:20:43] 10Traffic, 10Operations: ulsfo varnish-fe vcache processes overflow on FDs - https://phabricator.wikimedia.org/T243634 (10ema) Today we've been tackling the "FortiGate" angle (correlation described in T243634#5848297). The host in trouble this morning was cp4028, with 140k FDs at 10:30. In total, 5 different "... [11:37:49] 10Traffic, 10Operations: ulsfo varnish-fe vcache processes overflow on FDs - https://phabricator.wikimedia.org/T243634 (10akosiaris) I just reverted the cr3, cr4 uslfo change. [12:05:25] ema, vgutierrez, bblack: any objections against merging https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/566476/ ? +1d by Filippo, but wanted to the service owners a chance to object [12:05:30] checking [12:08:36] thx [12:28:19] 10Traffic, 10Discovery, 10Operations, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10JeanFred) [13:11:07] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp5006.eqsin.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [14:00:58] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5006.eqsin.wmnet'] ` and were **ALL** successful. [14:37:29] 10Acme-chief, 10Traffic, 10Operations, 10Patch-For-Review: acme-chief is unable to renew certificates against LE staging environment - https://phabricator.wikimedia.org/T244236 (10Vgutierrez) 05Open→03Resolved Fixed by backporting https://github.com/certbot/certbot/commit/0b5468e992ab57fa028ddf33ca2351... [14:52:26] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: Production load.php spends ~ 10% time doing output compression within PHP - https://phabricator.wikimedia.org/T242478 (10ema) I have applied the change to cp1075 for some minutes, and the effect on network transfer is [[https://grafana.wikime... [15:18:29] 10Traffic, 10Operations: traffic_server crash upon Lua reload: attempt to concatenate a table value - https://phabricator.wikimedia.org/T242952 (10ema) This just happened on cp1087: ` Feb 05 15:14:05 cp1087 systemd[1]: Reloaded Apache Traffic Server is a fast, scalable and extensible caching proxy server.. Fe... [15:21:35] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp5012.eqsin.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [15:52:51] ema: vgutierrez: have either of you done normal development flow with a local clone of trafficserver github, etc? [15:53:13] to hack ATS itself? [15:53:18] I'm finding that out of th ebox a lot of the gold tests fail [15:53:22] yeah [15:53:49] I was trying to just make sure the vanilla stuff works before I start hacking (that I can build and it passes tests), but so far the test results are pretty dismal [15:54:03] so my laptop is unable to compile ATS, I abuse boron for that [15:54:05] maybe because I'm on master, and I need to switch to a release to get reliable tests or whatever [15:54:16] I got it to compile, and even pass the "make check" unit tests [15:54:19] nope nope, stay away from master [15:54:32] but there's the separate "gold" tests from autest, lots of those seem to fail on master [15:54:40] that's kinda normal sadly [15:54:43] ok [15:54:46] it promised only lies [15:54:59] bblack: what are you trying to hack? [15:55:22] I wanted to look at the difficulty/feasibility of hit4pass-like concepts [15:55:35] I've started grokking some of the related bits of the source tree [15:55:55] but I figure if I really wanted to stab at even a demo-level feasibility patch, I need to be able to write new tests, etc, etc [15:56:07] so now I'm just working on getting a stable dev workflow I can test a patch against [15:56:16] will try shifting to some release branch/tag I guess [15:56:55] so.... [15:57:02] ATS expects PRs against master [15:57:15] but I'd recommend you checking the current state of affairs in their CI [15:57:18] well yeah [15:57:50] but what I need is a testsuite that passes, so I know whatever I'm hacking isn't breaking other things :) [15:57:56] it's not the first time I submit a PR that fails on the tests cause some other stuff broke it [15:57:58] I can always forward-port to master later [15:58:05] hmm that's not always easy [15:58:16] but if you wanna go down that path [15:58:20] easier than having no idea what I'm breaking [15:58:21] I'd recommend you 8.0.5 :) [15:58:36] it is what I usually do [15:59:04] there's still no 9.0.0 release :/ [15:59:24] nope [16:03:34] heh new build requirement stepping down from master to 8.0.5: TCL development headers/libs :) [16:03:46] scary, I can't remember the last time I even heard the world TCL :) [16:03:50] *word [16:04:25] what I usually do to test my stuff is append it as a new debian patch on top of our debian release [16:04:34] and run that in a set of containers [16:04:39] with a few httpbin containers as origin [16:05:34] the hit-for-pass stuff runs pretty deep, which is why I'm worried about breaking existing stuff. I'll almost certainly have to add novel testsuite stuff as well to exercise it. [16:05:42] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5012.eqsin.wmnet'] ` Of which those **FAILED**: ` ['cp5012.eqsin.wmnet'] ` [16:05:45] but yeah [16:06:15] this is basically just figuring out feasibility though [16:06:37] dunno if it's worth talking to them before going that deep :) [16:06:45] if I can reasonably-quickly make a MVP patch that "works" but has some rough edges to sort out [16:06:52] or if it looks way harder than that [16:07:06] then I'll have something intelligent to say or think about it either way [16:19:37] 8.0.5 doesn't pass its own testsuite on my laptop either. it passes more of it at least :) [16:22:15] I think I'll just have to adapt and give up on the whole "clean testsuite" mental model :P [17:24:51] actually the more I dig, I think it might be possible to prototype hitforpass-like behavior from plugins/lua [17:25:26] it would be "interface abuse" on some level, but yeah, maybe, most of it anyways [17:29:51] basically the DOC_BUSY status (what a cache reader sees when another request is busy writing/updating the object) does a lot of the same things an hfp would (force a miss and don't try to make a new object from the response) [17:30:18] but then that gets into some of the tunable config about read_while_writer issues, too [17:30:45] but since the lookup results are plugin-settable... [17:31:12] inserting the magic hfp object is trickier, but maybe doable as well if we wanted a non-source-patching experiment [17:32:41] the problem with naively using DOC_BUSY as an hfp-like, is it would only work with RWW coalesce turned off [17:33:12] (which I guess kinda defeats the purpose, but it's still interesting how close the cases are) [17:53:06] 10Traffic, 10Operations, 10ops-eqsin: rack/setup/install ps[12]-60[34]-eqsin - https://phabricator.wikimedia.org/T242250 (10RobH) Update: I've coordinated with Jin via Google Hangout Messages and he has reviewed the rack and ensured he has all the cabled needed. I sent in this email to him, but since then... [17:53:31] 10Traffic, 10Operations, 10ops-eqsin: rack/setup/install ps[12]-60[34]-eqsin - https://phabricator.wikimedia.org/T242250 (10RobH) [18:00:01] 10Traffic, 10netops, 10Operations, 10observability, 10Patch-For-Review: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10CDanis) 05Open→03Resolved [18:21:19] 10Traffic, 10Operations, 10Performance-Team: Production load.php spends ~ 10% time doing output compression within PHP - https://phabricator.wikimedia.org/T242478 (10Krinkle) ###### Network | [Dashboard: Cluster overview (eqiad appservers)](https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&... [18:48:21] 10netops, 10Operations, 10ops-codfw: codfw: Delete cloud interface-range - https://phabricator.wikimedia.org/T244196 (10jijiki) p:05Triage→03Normal [19:23:38] 10Traffic, 10Operations, 10Inuka-Team (Kanban), 10MW-1.35-notes (1.35.0-wmf.16; 2020-01-21), and 2 others: Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10SBisson) This was enabled in production just now. [23:40:18] 10Traffic, 10Discovery, 10Operations, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Dvorapa) [23:41:07] 10Traffic, 10Discovery, 10Operations, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Dvorapa) [23:49:41] 10Traffic, 10Operations, 10Inuka-Team (Kanban), 10MW-1.35-notes (1.35.0-wmf.16; 2020-01-21), 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10nshahquinn-wmf) 05Open→03Resolved I'm seeing events flowing into the production database, so I...