[06:22:28] 10netops, 10Operations, 10fundraising-tech-ops, 10observability: update nagios_nsca configuration in frack for new nsca servers - https://phabricator.wikimedia.org/T262291 (10ayounsi) [07:00:13] 10netops, 10Operations: Configure BGP route damping on Anycast sessions - https://phabricator.wikimedia.org/T262372 (10ayounsi) p:05Triage→03Medium [07:20:32] ema, elukey: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=icinga1001&service=cache_text%3A+Varnishkafka+webrequest+Delivery+Errors+per+second+-eqsin- this looks a bit bad, no? [07:23:46] it does yes, I am not sure why it was not showing up in icinga but only in unhandled problems [07:24:00] I didn't ack anything yesterday that I remember [07:24:08] anyway, checking vk on those nodes [07:28:55] ah ok it was there, but under "icinga1001" since we have aggregate alarms, and went unnoticed [07:28:59] * elukey cries in a corner [07:29:03] thanks XioNoX [07:29:14] sadly we have been dropping data from those ndoes [07:29:32] elukey: I didn't want to make you cry :( [07:31:48] my team will not be happy as well I think [07:32:52] so basically for some reason librdkafka kept trying to use a old socket to kafka-jumbo1006 (that is in d3 IIRC) until I restarted vk [07:33:03] only from those two nodes in eqsin [10:32:34] vgutierrez: acme-chief-certs-sync.service failed on acmechief-test1001.eqiad.wmnet is expected? (I just run a systemctl --failed on eqiad for other reasons) [10:33:09] hmmm only if keyholder isn't happy [10:33:14] * vgutierrez checking [10:33:59] vgutierrez@acmechief-test1001:~$ sudo -i keyholder status [10:33:59] keyholder-agent: active [10:33:59] - The agent has no identities. [10:33:59] keyholder-proxy: active [10:33:59] - The agent has no identities. [10:34:03] yup... expected [10:34:06] * vgutierrez fixing [10:34:27] we have an icinga check for not-armed keyholder [10:34:32] has the host notification disabled? [10:34:37] indeed [10:34:56] it isn't a big deal on -test TBH :) [10:36:23] ack [10:36:43] keyholder armed, thanks for pinging 🍺 [10:36:49] de nada [13:39:39] vgutierrez: the warns in icinga for the wildcard ecdsa/rsa certs are known? can be ack'd ? [13:40:13] the unified cert from digicert should be under the renewal process AFAIK [13:40:14] godog: known, yes, we should ack them [13:40:18] there's a ticket open about the renewal [13:40:46] T261419 [13:41:06] ok! will ack with that task [13:41:11] thx :D [13:42:13] np [13:43:33] I'm comparing icinga with https://alerts.wikimedia.org to see how the grouping is working out FWIW [14:12:57] 10netops, 10Operations, 10fundraising-tech-ops, 10observability: Add alert[12]001 to network ACLs - https://phabricator.wikimedia.org/T260533 (10Jgreen) [14:13:01] 10netops, 10Operations, 10fundraising-tech-ops, 10observability: update nagios_nsca configuration in frack for new nsca servers - https://phabricator.wikimedia.org/T262291 (10Jgreen) 05Open→03Resolved p:05Triage→03Medium a:03Jgreen Config change is deployed to puppet and appears to be working fro... [15:12:09] 10Traffic, 10Operations: Cache Accept-language optimisation - https://phabricator.wikimedia.org/T262428 (10jbond) p:05Triage→03Medium [15:22:42] https://blog.cloudflare.com/unimog-cloudflares-edge-load-balancer/ [15:22:58] cdanis: ^ if you didnt notice it already [15:32:46] (not that you're here, but you'll see it eventually) [15:33:32] bblack: thanks! [15:56:01] 10Traffic, 10DNS, 10Operations: Verify diff.wikimedia.org ownership for Facebook - https://phabricator.wikimedia.org/T259807 (10CKoerner_WMF) Hello friends. Is there anything I need to do to help move this along? [18:41:40] 10Traffic, 10Operations, 10Product-Infrastructure-Team-Backlog, 10serviceops, 10Wikimedia-production-error: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 (10Joe) 05Resolved→03... [18:45:16] 10Traffic, 10Operations, 10Product-Infrastructure-Team-Backlog, 10serviceops, 10Wikimedia-production-error: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 (10bearND) Is there a se... [18:45:40] 10Traffic, 10Operations, 10Product-Infrastructure-Team-Backlog, 10serviceops, 10Wikimedia-production-error: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 (10Joe) Sadly, we still... [18:48:00] 10Traffic, 10Operations, 10Product-Infrastructure-Team-Backlog, 10serviceops, 10Wikimedia-production-error: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 (10Joe) To clarify furth... [18:55:45] bblack vgutierrez: either of you around to help with a Varnish and ATS cache ban for a UBN? hoping to avoid needing to page you [19:11:23] I just read this rzl [19:11:35] Do you still need help? [19:12:21] vgutierrez: yes thanks! T262437 is context [19:12:22] T262437: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 [19:12:40] I'm trying to ban /api/rest_v1/page/mobile-html/* and /api/rest_v1/page/mobile-html-offline-resources/* from cache [19:13:26] I'll be online in 2 minutes [19:13:28] I see the instructions on Wikitech, so I'm starting with ATS, but I'm not sure what file to put the Lua in, frankly I need an adult :) [19:13:33] ack, much appreciated [19:18:36] you can take 89b540ba3e as an example [19:18:54] that one filters by Host Header [19:19:05] oh it just goes in default.lua? okay cool [19:19:08] indeed [19:20:47] and you can retrieve the URL with ts.client_request.get_url() [19:21:02] https://docs.trafficserver.apache.org/en/8.0.x/admin-guide/plugins/lua.en.html#ts-client-request-get-url [19:21:35] I assume it's all handled now? [19:22:49] oh probably not yet [19:23:00] rzl: if you're already working up patches, I'll review [19:23:08] thanks, almost ready [19:23:11] :) [19:23:56] we'll want to ban the backends first, then the lua bit for the frontend. backend is relatively straightforward [19:24:16] err sorry I said all that backwards [19:24:26] :) [19:24:28] backends first, which is the lua bit, then frontends, which is relatively straightforward :) [19:25:31] posted https://gerrit.wikimedia.org/r/626210 [19:25:36] any word on how to test? [19:26:14] rzl: hmm s/url:match/url.match [19:26:40] even string.match(url, ...) [19:26:43] oh really? I got that from https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Choosing_origin_server but I don't know lua :) [19:27:09] heh [19:27:19] uh... [19:27:31] jenkins is clearly unhappy about *something* though, still looking [19:27:51] perfectly happy to try the s/:/./, I just don't get it [19:28:29] we don't have any copypasta examples quite like either variant [19:28:45] https://www.irccloud.com/pastebin/hGtYVKGb/ [19:28:53] so : also works apparently [19:29:23] is "url" a thing? I see other examples around our code with: [19:29:25] local uri = ts.client_request.get_uri() [19:29:49] oh that clicks with the error from jenkins [19:29:58] modules/profile/files/trafficserver/default.lua:73: attempt to call field 'get_url' (a nil value) [19:30:01] I'll make it uri, thanks [19:30:30] different stuff [19:30:42] and probably get_uri() is what you need, cause it doesn't return the host part [19:31:05] the error BTW is because in default.lua get_url() isn't being used, so the test suite is crashing because some mocking is required [19:31:18] ahhh [19:31:29] it's never simple is it? :) [19:31:36] not with ATS /o\ [19:32:33] okay, that satisfied jenkins -- is there any quick way to add a test case? [19:32:58] default_test.lua [19:33:46] we can override the current mock for get_uri() and test if ts.http.config_int_set(TS_LUA_CONFIG_HTTP_CACHE_GENERATION is being called as expected [19:34:20] okay I think I see how these work [19:37:14] can't see any examples right here, can I just pass two params to was.called_with? [19:37:39] as in, assert.stub(ts.http.config_int_set).was.called_with(TS_LUA_CONFIG_HTTP_CACHE_GENERATION, 1599679366) [19:38:41] yup IIRC [19:38:45] cool [19:40:09] posted PS3, take a look while CI runs? [19:40:10] yup: assert.spy(s).was_called_with(match.is_ref(t), 2) [19:40:15] that's an example from busted doc [19:40:18] * vgutierrez checking [19:40:38] wait no, do_global_read_request [19:40:39] hang on [19:41:40] don't even know a request from a response, you can tell I'm not from around here [19:42:03] :) [19:42:14] okay, PS4 [19:43:42] all three tests failed, do I have the logic backwards somehow? [19:45:32] hmmm [19:45:34] oh, I have _G.ts.client_request.uri = instead of .get_uri = [19:46:09] tests need tests! [19:46:34] nice catch [19:46:36] thank you both for being very patient while I drive into walls [19:48:55] hmm, making it get_uri didn't help [19:51:45] hmmm [19:51:51] prior code is messing with your test [19:52:11] config_int_set() is being called several times [19:52:49] and that's why the stub doesn't match your asserts [19:53:38] hm [19:53:54] you need to ensure that no cookie is there (local cookie = ts.client_request.header['Cookie'] evaluates to false) [19:54:06] is it ok that get_uri is redefined multiple times too? [19:54:14] and also get rid from the Authorization header [19:54:21] s/from/of/ [19:54:32] for this test case yes, it's ok [19:55:15] so basically you're missing _G.ts.client_request.header['Cookie'] = nil [19:55:27] nod [19:55:35] and [19:55:39] in all three tests, or just the top one? [19:55:57] _G.ts.client_request.header['Authorization'] = nil [19:56:09] technically the three of them, but adding it to the top one should suffice [19:57:08] I splurged and went for all three [19:58:44] sigh [19:59:27] we improved a little bit? [19:59:34] heh yeah, 1/3 is better than 0/3 [20:00:00] I was assuming :match is a partial match, let me actually make sure that's true [20:00:26] hmm it's sure supposed to be [20:00:58] hmm [20:01:01] > a = "/api/rest_v1/page/mobile-html/Elephant" [20:01:01] > a:match('/api/rest_v1/page/mobile-html/') [20:01:01] nil [20:01:17] match isn't working as expected [20:01:40] you can use ^ to left anchor like regex [20:03:41] it's the - [20:03:45] https://www.irccloud.com/pastebin/ReUS9s4H/ [20:03:47] heh yep [20:03:57] minus is a metachar in lua patterns [20:04:37] ^/api/rest_v1/page/mobile%-html/ [20:04:56] percent?? [20:05:02] okay I tried a few things but I never would have got there [20:05:11] yeah lua patterns are not my fave [20:05:25] yet another method of torture [20:05:54] the worst part is that some simple lua patterns look just enough like regexes, to fool the unwary into thinking lua string functions use regexes [20:06:02] I was indeed unwary [20:07:03] \o/ [20:07:14] nice :D [20:07:47] quick, ship it before the test falls over and becomes a -1 again [20:07:56] lol [20:08:10] don't forget to summon Cthulhu while you do it [20:08:14] or it's gonna fail [20:08:27] should I disable-puppet on A:cp before I merge this or are we just gonna throw it out there? [20:08:52] "Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn" should do it [20:09:06] rzl: hmmm yeah, just try in one cp4X node first if you like [20:09:13] 👍 [20:10:01] 4032 or any other text one [20:10:07] aka >=4027 [20:13:27] done on 4032 [20:14:00] lovely [20:15:16] atslog-backend looks right to the untrained eye [20:18:24] okay to roll out to the rest of the cps? [20:19:38] yeah [20:19:50] let's keep an eye on icinga.wm.org/alerts [20:20:10] sometimes ats-be has hiccups when lua scripts are reloaded :) [20:20:30] haha sure [20:20:36] should I cumin it in batches then? [20:20:44] yeah [20:20:47] that would be wise [20:21:00] but if you want your tshirt go for it ;P [20:21:28] jokes aside, we don't see more than one or two crashes on the whole fleet [20:21:33] (last famous words) [20:21:54] sudo cumin -b3 A:cp 'run-puppet-agent -e T262437' [20:21:55] T262437: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 [20:21:55] seem good? [20:22:19] yeah [20:26:00] rzl: I'll use your pain/commit to update the wikitech page tomorrow EU morning :) [20:26:20] cause clearly it could use some love [20:26:20] ha, sounds good [20:26:30] really appreciate the late-night help [20:26:39] no problem :D [20:48:14] okay, that's done and I'm going to rerun j.oe's command from https://phabricator.wikimedia.org/T262437#6448216 to purge varnish [20:48:23] and then again with s/-offline-resources// [20:50:54] ack [20:51:48] done, and https://en.wikipedia.org/api/rest_v1/page/mobile-html-offline-resources/Melanocortin looks correct \o/ [20:54:06] awesome :D [20:57:17] ZzZ time here :D [20:57:31] for sure, thanks again <3 [21:29:04] 10Traffic, 10Operations, 10Product-Infrastructure-Team-Backlog, 10serviceops, 10Wikimedia-production-error: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 (10RLazarus) 05Open→0...