[01:28:55] 10netops, 10Operations: Investigate Juniper storm control - https://phabricator.wikimedia.org/T245192 (10Papaul) a:05Papaul→03None [06:18:47] 10netops, 10Operations: Configure BGP route damping on Anycast sessions - https://phabricator.wikimedia.org/T262372 (10ayounsi) After manually setting `check_fail = 2` overnight the service stopped being randomly depooled. Bird restarts didn't trigger camping neither. [08:57:54] 10Traffic, 10Operations: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10ema) [08:58:05] 10Traffic, 10Analytics, 10Operations: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10ema) [09:54:43] 10Traffic, 10Analytics, 10Operations: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10elukey) I also checked with `top` on cp5011 to better visualize the graph, and the usage is really too much from what I used to see. We are in the process of evaluating atskafka but it w... [11:23:28] 10netops, 10Operations, 10Patch-For-Review: Configure BGP route damping on Anycast sessions - https://phabricator.wikimedia.org/T262372 (10ayounsi) 05Open→03Resolved Fixed. [13:32:58] 10Traffic, 10Operations, 10Patch-For-Review: Deprecate TLSv1.2 weak ciphersuites - https://phabricator.wikimedia.org/T258405 (10Vgutierrez) [13:58:04] bblack: ema: is there a check somewhere(s) in the VCL that we were given a request for one of our domains? [13:58:47] looks like we synth a 400? [13:59:49] ahh! normalize_request in wikimedia-frontend [14:04:37] yes, that! [14:06:19] second time this week that I ask something here and immediately find it myself :) [14:06:23] need a rubber duck at my desk [14:09:52] I used to have one but it's been stolen by my daughter [14:10:07] surely the duck is enjoying the conversations much more [14:12:42] bblack: dns patch for migrating eqsin to netbox is ready if you want to have a look. We could merge it later today or tomorrow [14:12:45] https://gerrit.wikimedia.org/r/c/operations/dns/+/630644 [14:14:15] https://gerrit.wikimedia.org/r/c/operations/puppet/+/630860/ ema :) [14:14:40] I also noticed we're sending some Set-Cookies in responses to un-owned domains, do you think I should write a followup patch to stop that ? [14:16:03] cdanis: yes please [14:16:39] volans: +1 [14:17:16] cdanis: I assume they're going out with the 400? Maybe we could do a condition in the delivery side on 400 "invalid" [14:17:21] (to stop cookies and such) [14:18:21] do we actually want to send cookies, in general, along with error responses? [14:20:15] hm [14:20:24] bblack: yeah, they're going out with the 400 [14:20:34] I had conditioned on the host header, I'm not sure about in general on errors... [14:20:55] on a 404 I think we still probably want to set cookies [14:21:29] oh on the bereq.http.host == "invalid" sort of condition you mean? [14:21:42] yeah, I had planned on only conditioning on that [14:21:49] ok that sounds reasonable [14:21:51] it seems like the simplest change [14:22:22] you could probably make the argument for stripping cookies on all [45]xx, but that's something that needs thinking and confirmation [14:23:26] I think that one depends on the cookie :) [14:23:57] you could also argue we're abusing the generic 400 status for some of this and should use 421 [14:24:17] (400 kinda makes sense for a malformed host header, but maybe 421 would be better for a well-formed but misdirected one) [14:24:26] yeah I think 421 makes slightly more sense [14:25:30] browsers I think do have bugs related to this, because we do get odd reports with 3rd party domains in the host header [14:25:42] (which is kinda scary from the user pov) [14:31:11] DNS bit squatting kind of thing or even worse? [14:31:44] no I think it's actual bugs, the names are completely unrelated [14:31:46] https://www.youtube.com/watch?v=aT7mnSstKGs [14:31:49] ack [14:32:01] yeah.. that's pretty scary [14:32:06] like, browsers just sending requests over the wrong one of several open persistent conns to different domains [14:32:20] yeah, it does seem scary [14:32:34] sigh I am baffled as to why this doesn't pass vtc [14:32:35] I don't know of a better explanation though [14:32:41] ---- c1 2.8 EXPECT resp.http.Set-Cookie (WMF-Last-Access=29-Sep-2020;Path=/;HttpOnly;secure;Expires=Sat, 31 Oct 2020 12:00:00 GMT) == "" failed [14:32:50] and yet it is a 400 response [14:33:07] cdanis: patch? [14:33:38] https://gerrit.wikimedia.org/r/c/operations/puppet/+/630865bblack: [14:33:40] argh [14:33:42] bblack: https://gerrit.wikimedia.org/r/c/operations/puppet/+/630865 [14:35:12] your patch blocks the geoip cookie, and you're failing on the analytics cookie? [14:35:19] (needs more patch?) [14:35:57] cdanis: ^ [14:35:58] ... oops [14:38:04] thanks :) [14:41:58] 10Traffic, 10CheckUser, 10Operations: Log source port for anonymous users and expose it for sysops/checkusers - https://phabricator.wikimedia.org/T181368 (10NickK) >>! In T181368#6488698, @BBlack wrote: > Is this still desirable for checkusers? Infrastructure has changed since then and is still-changing, bu... [14:45:01] a confession: I don't understand how to return early from a VCL subroutine, so I'm sprinkling conditionals everywhere [14:46:08] there is no way to return early from a VCL "subroutine" :) [14:46:38] I was kind of suspicious of that, after looking at the implied semantics of 'return'~ [14:46:40] ! [14:46:56] if you call 10 "subroutines" deep and then put in an return statement, it returns straight back out the whole stack (including skipping any remaining implied default VCL code) [14:47:05] return means "return to the core varnish C code" [14:47:05] 'great' [14:47:18] I'm so ambivalent about my intuitions being correct [14:47:49] it has been the source of many bugs and frustrations [14:48:00] but I imagine it makes the implementation of the VCL "compiler" much simpler [14:48:22] 10Traffic, 10Analytics, 10Operations: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10ema) >>! In T264074#6501340, @elukey wrote: > I think it is better to know if the increase is brought by the new VUT/VSL api or if it is something else. Other units such as `varnishmta... [14:49:22] it might'v ebeen less subtle and confusing if the had used the keyword "goto" instead of "return" heh [14:50:13] but if you want a mental model that's closer to the reality of the situation... [14:50:19] heh yeah [14:50:32] what's really going on is every "call" statement in VCL simply inlines the code of whatever is called, recursively [14:51:20] "call" is really more like a template directive to inject a block of code with a certain name [14:56:25] 0 tests failed, 0 tests skipped, 22 tests passed [15:03:23] 10Traffic, 10Analytics, 10Operations: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10elukey) https://github.com/wikimedia/varnishkafka/commit/b0675e80c2a059ba3a508d8ebfc16a79bee3e154 shows a big change in usage of VUT/VSL, that afaics should be easier (more responsibilit... [16:05:52] the external monitoring emails about bad requests and recovery that we're getting since yesterday, is something traffic should look into them? [16:06:10] or more serviceops? I can open a task if needd [16:21:36] volans: the ones from " [16:21:38] DX App Synthetic Monitoring App" ? [16:22:09] I think that's the latest rename/rebrand/whatever of what used to be our deprecated status.wikimedia.org stuff [16:23:31] yeah maybe open a task with us, we'll have to go find the login for it to see what's going on maybe [16:23:37] it's probably still in the repo [16:23:54] bblack: it's "watchmouse" in the repo [16:24:03] yes those [16:24:16] we got the first few yesterday but then today started to get more [16:24:26] it's weird that it says bad request I guess [16:24:51] I looked a bit at the logs and they're indicating that they are seeing "HTTP/1.1 400 Bad Request" from us [16:26:23] shdubsh: want to open you the task given you've already looked at them? [16:26:49] * volans about to jump in a meeting [16:26:55] sure [16:26:56] yeah maybe just an "investigate this" task to remind [16:27:03] thanks! [16:27:09] the outcome may be that the service is awful and we need to remember to shut it off completely :) [16:28:09] bblack: FYI there are some preferences to depool esqin before merging the dns patch, anything ongoing in the next hour or two that we should postpone? [16:28:28] AFAICT all sites are pooled right now [16:29:06] honestly I think the depool is more impactful than the dns patch risk, but I'm open to whatever [16:29:07] cc XioNoX if anything is ongoing network wise [16:29:17] chaomodus: ^^^ [16:29:23] the main thing to look at is any planned link outages or core sites hardware work, etc [16:29:37] I'm ok without depooling, the diff gives me confidence :) [16:29:45] okay cool [16:29:49] i'm good with it [16:30:10] fear is the mindkiller [16:30:57] that's very wise. [16:31:23] chaomodus: wanna do the honors? [16:31:37] kay :) [16:31:41] fear is the mindkiller; no haunted graveyards. [16:32:02] for the non-Dune-readers: [16:32:08] https://en.wikiquote.org/wiki/Dune#Book_1:_Dune -> [16:32:16] "I must not fear. Fear is the mind-killer. Fear is the little-death that brings total obliteration. I will face my fear. I will permit it to pass over me and through me. And when it has gone past I will turn the inner eye to see its path. Where the fear has gone there will be nothing. Only I will remain." [16:33:31] :) [16:33:33] - authdns-updating [16:33:37] confidence boost [16:33:57] bblack: I'm not sure if e.ma is still around, could I get a quick +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/630860 ? [16:34:07] the others in each of my two stacks of VCL patches can wait [16:35:09] oh right, I passed by that earlier and thought (but didn't say): how is the total rate looking? Do we need to reduce the faliure_fraction further? [16:35:31] I don't believe we do [16:35:37] ok :) [16:35:37] see also https://phabricator.wikimedia.org/T257527#6499757 [16:35:55] we might have to, but it still seems like it will just add single-digit percentage to logstash load [16:36:15] and I was going to give o11y a headsup as I puppet-merge and keep an eye on things [16:36:47] +1'd, I think the bots aren't reporting at present on IRC [16:37:04] thanks! [16:39:02] also, 3 quotes down from mine on that same wikiquote link above, a rather prescient one given the book was published in 1965: [16:39:06] "Once men turned their thinking over to machines in the hope that this would set them free. But that only permitted other men with machines to enslave them" [16:40:34] * bblack adds "Re-read Dune" to the list he keeps of things that sounds like a good idea, for which there will not be free time anywhere in the foreseeable future. [16:41:05] just don't re-read too many of the books ;) [17:00:33] https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2020.09.29/network-error/?id=AXTazYKrLNRtRo5XjgSw [17:00:48] confirmed live for enwiki :) [17:05:33] 10Traffic, 10Operations: External Monitoring alerting on 400 Bad Request errors - https://phabricator.wikimedia.org/T264111 (10colewhite) I cannot find any indication that the 400s are originating from our servers either in webrequest log or turnilo. I have temporarily disabled the alerts until this can be lo... [20:49:18] 10Traffic, 10Operations, 10Puppet, 10User-herron: Puppet hosts with signed certificate present on agent but not master - https://phabricator.wikimedia.org/T185239 (10BBlack) [21:03:03] 10Traffic, 10Operations: Servers freezing across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Dzahn) Just a few minutes ago: db2125 crashed - mgmt iface also not available T260670 [21:06:02] 10Traffic, 10Operations, 10Performance-Team (Radar): Consider allowing H2 coalesce for upload.wikimedia.org for images used in wiki articles - https://phabricator.wikimedia.org/T116132 (10BBlack) All the perf tradeoffs and relatively-trivial work aside, the major blocker we still face here is the likely pro... [21:12:37] 10Traffic, 10Operations, 10Puppet, 10Technical-Debt: Fix rule violation in the lvs balancer role - https://phabricator.wikimedia.org/T264132 (10BBlack) [21:12:54] 10Traffic, 10Operations, 10Puppet, 10Technical-Debt: Fix rule violation in the lvs balancer role - https://phabricator.wikimedia.org/T264132 (10BBlack) p:05Triage→03Low [21:19:26] 10Traffic, 10Varnish, 10Analytics, 10Operations, 10User-Elukey: Sort out analytics service dependency issues for cp* cache hosts - https://phabricator.wikimedia.org/T128374 (10BBlack) 05Open→03Declined This is too-stale now and a lot of these bits have been replaced over time and are known to have th... [21:27:03] 10Traffic, 10Commons, 10Operations, 10SRE-swift-storage, and 2 others: upload-lb.ulsfo.wikimedia.org still allow access to some deleted files - https://phabricator.wikimedia.org/T133819 (10BBlack) [21:27:09] 10Traffic, 10Commons, 10Operations, 10SRE-swift-storage, and 2 others: Deleted files sometimes remain visible to non-privileged users if permanently linked - https://phabricator.wikimedia.org/T109331 (10BBlack) [21:27:16] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Operations, 10Patch-For-Review: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038 (10BBlack) [21:28:09] 10Traffic, 10Operations, 10serviceops, 10Performance-Team (Radar), 10Sustainability: Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 (10BBlack) 05Stalled→03Resolved a:03ema This should've been closed back when T250781 closed - all purge traffic now goes via kafka queues and mul...