[07:28:34] Hi! I'd like to deploy a follow-up to yesterday's deployment today. I already put it into the calendar for 10:00 UTC. [07:28:41] Will any of you be around? [08:17:19] I will be around [09:32:10] duesen: btw, is it ok for redioscope to run from one DC only ? context is that we are upgrading the k8s cluster (cc elukey ) [09:59:11] effie: redioscope should actually only be running in one DC... there should only be one instance of it. [09:59:55] a second instance isn't a problem as such, we'd just have to make sure the grafana dashboards aren't edding up the numbders from prometheus, otherwise we'll double-count everything [10:01:53] it looks like it is runnin on both DCs atm [10:03:58] it sounds like it is ok for luca to proceed with the codfw upgrade, and potentially not redeploy redioscope on codfw once the k8s cluster is up? [10:05:39] effie: hold on, let me double-check something... maybe i'm not remembering the setup correctly... [10:07:58] Ok, I was indeed wrong. We should have one instance per DC. Redioscope is generating additional stats for the rest gateway rate limits by talking directly to redis. So on a DC where there's no traffic on the gateway, it doesn't matter if redioscope is running. And in any case, if redioscope is down for a while, nothing breaks. All it does is generate metrics. [10:08:31] So, go head with the upgrade, but please re-deploy on both DCs [10:09:04] effie: I was going to hit +2 on my first patch and start deployment, is that ok with you? [10:10:23] duesen: go ahead [10:18:53] applying to staging and running tests now [10:26:47] tests are a bit flaky, running again... [10:30:50] ok, looks good. pushing to eqiad [10:32:15] duesen: we have some elevated error rate on eqiad, we are looking into it [10:32:22] but i tis not related to your deployment [10:35:31] ok, thanks for letting me know [10:35:57] I was about to apply to codfw and merge the second patch. is that ok? [10:37:31] oh right, I see the jump in 500 errors at 10:00 UTC... yea I didn't apply my change until 30 minutes later. [10:50:43] testing the second patch on staging didn't work... i as expecting issues there, it's likely a problem of the test setup, since it requires a different host header. I'm still pretty confident. I'll fiddle with the tests for a bit [11:11:01] something is off... [11:12:25] Raine, claime: could it be that www.wikifunctions.org isn't routed through the gateway (but abstract.wikipedia.org is)? I'm not seeing wikifunctions in the hosts list.... But i did get ratelimits for calsl from abstractwiki to wikifunctions... I think? [11:12:42] When I set host:www.wikifunctions.org for a request to staging, I get a 404 [11:13:38] It's not a *huge* deal, but it does prevent proper testing. I'd still like to get the ratelimit policy for the wikifunctions and abstractwiki endpoints out. as far as I can tell, they exist on both domains. [11:25:57] effie: about 15 minutes ago request times went up and error rates as well (again). I also see a substantial increase in api requests classified as "anon browser". Could be related? [11:26:10] yes it could [11:28:50] ok... i'll leave the latest patch mered but undeployed for now, until i hear from raine or clement. I'll revert later today if I don't hear back from them. [12:14:38] FYI both Raine and Clement are OOO until Monday [14:23:21] matthieulec: uh... darn... perhaps hnowlan can help, once the current emergency is over. Otherwise I'll have to revert. [17:12:59] I reverted the patch and applied the revert to staging. [17:42:12] duesen: apologies, missed this while troubleshooting something else: yes, you're correct that API requests to wikifunctions.org are not routed through the gateway. [17:42:42] https://gerrit.wikimedia.org/g/operations/puppet/+/0f92e9b968d397ba2980d818e5906b5f51258eb2/hieradata/common/profile/trafficserver/backend.yaml#458 <- no gateway-check.lua in the plugin stack