[08:14:41] I'm seeking kind souls for a quick review https://gerrit.wikimedia.org/r/c/operations/puppet/+/1117488 [08:27:15] thank you elukey ! [09:29:53] andrewbogott: head's up, wikitech-static's root is full and mysql is down [10:02:20] claime: many GB of https://wikitech.wikimedia.org/wiki/File:Pontoon_demo_graphite_buster.ogv wouldn't be helping [10:02:35] * claime looks at godog [10:02:51] Is that being updated every few days? [10:03:03] I have no idea [10:03:16] probably given it's being revamped for the summit [10:03:19] https://phabricator.wikimedia.org/P73206 [10:05:07] I think it can be cleaned up tho x) [10:14:34] definitely that file doesn't get updated every few days [10:14:47] as far as I'm concerned it can be nuked [10:16:16] hmmm [10:16:21] Numerous files seem to be duplicated [10:16:27] But they're all a lot smaller [10:17:16] let me know if you need media or database backups [10:54:57] I've cleared up about 1GB free (after setting reserved back to 5%) [10:55:01] Filed a task about those uploads... [10:55:11] I'm just getting ready to leave for the airport... So will have another look later [10:57:01] but mysql is back up [10:57:11] Reedy: tyvm <3 [10:57:15] have a safe flight [11:03:49] !incidents [11:03:49] 5654 (UNACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [11:03:59] !ack 5654 [11:03:59] 5654 (ACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [11:04:03] ^^ elukey, Amir1 [11:04:18] thanks [11:05:09] checking [11:06:53] did we get increased api traffic due to https://news.ycombinator.com/item?id=42936723 ? [11:07:21] It shouldn't impact thanos [11:07:49] Emperor godog do you have any maint going on on thanos in eqiad? It just paged [11:08:04] sorry, I didn't know there was an incident ongoing [11:08:42] Amir1: Emperor jynus I'm taking a look as well [11:08:55] my question was unrelated [11:09:32] Amir1: checking [11:10:38] thanks [11:12:40] answering myself, I don't see higher api traffic, but I see an increase in 304, which probably is unrelated [11:23:05] vgutierrez: sorry just seen the ping, I was doing a presentation for the summit :( [11:23:49] usual query of death, or something similar? Just saw the recovery [11:26:56] I saw "!log bounce thanos-query on titan1002" [11:27:18] godog: anything we oncallers should do to help? [11:27:39] Amir1: maint> not me [11:27:51] Thanks [11:27:57] Amir1: ah yes my bad I forgot to update, we're good and I'm tracking followup in https://phabricator.wikimedia.org/T385693 [11:29:20] if the heavy queries are related to temp account dashboards, I can ping the team to make adjustments [11:30:03] Amir1: might be, I'm fixing problematic dashboards with edit count, I hope not but service might page again [11:30:21] noted [11:30:36] if it does, I use my ambassador hat [11:31:58] lol ok thank you [13:27:58] thank you claime, I started to work on that last night but discovered that I've lost access to pwstore. [13:28:22] Should we do something about those alerts so that more/all SREs know when wikitech-static is down? Or are you already getting alerts? [13:29:43] I saw it on karma [13:35:55] claime: since you touched it last, would you be willing to do a software update there? I'm always happy when someone other than me has hands-on practice there... https://wikitech.wikimedia.org/wiki/Wikitech-static#Manual_updates [13:40:34] Ah, the fun of sysadmin jenga ;) [14:04:03] oh, on second thought, it might not need an update; the alert I'm seeing about versions is firing because the site was down entirely. [14:23:19] I will be roll-restarting the eventgate-main pods in eqiad and codfw to pick up https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1112451 (schema change) [14:23:36] (I'll be doing this in a few min so people have a chance top stop me :)) [14:24:10] +1 from my side, if you could wait 10 mins I'll assist :) [14:24:33] ack, will not start before :35 [14:28:11] vgutierrez: My skills are too rusty to debug this https://phabricator.wikimedia.org/T357603#10525515, but I think I 've nailed it down to ATS doing something weird and apparently racy. [14:29:15] guh [14:29:38] it is, isn't it? [14:29:55] I can't help but wonder whether I 've made a mistake and are on the wrong track [14:30:11] but log in a random cp host and run it and ... [14:30:21] akosiaris: ack, I'll take a look [14:30:46] thanks! [14:32:21] let me caffeinate myself first :D [14:32:34] klausman: go ahead [14:32:59] vgutierrez: go ahead, you 'll need it. I know I do [14:35:14] the best part is that if you also try to run some statistics on the server header, that one will vary wildly in the output [14:35:22] which is baffling to say the least [14:38:57] elukey: starting qith eqiad [14:40:56] Ok, restart done (some pods are still terminating) [14:42:08] klausman: please also check https://grafana-rw.wikimedia.org/d/ZB39Izmnz/eventgate?forceLogin&from=now-1h&orgId=1&refresh=30s&to=now&var-dc=thanos&var-kafka_broker=All&var-kafka_producer_type=All&var-service=eventgate-main&var-site=eqiad&var-stream=All [14:42:09] One of the new pods had a restart/bounce, but now has been stable for 2.5m [14:43:08] (so not only the pod state, that is important but it doesn't give the full picture) [14:43:19] yep, will keep an eye on error rates etc [14:47:34] elukey: I have a meeting in ~15m, will do codfw after that (and ping again beforehand) [14:47:53] okok [14:47:58] So far, the metrics/graphs look fine to me [15:02:21] hi folks, just a quick heads-up that I'll be deploying a new conftool release in a couple of minutes. no action required / issues expected, but I wanted to flag it :) [15:03:03] ack thanks! [15:27:11] akosiaris: so.. requests triggering a TCP_HIT, TCP_MEM_HIT or TCP_MISS get an Etag, if a TCP_REFRESH_HIT is triggered, then the Etag is gone [15:29:59] ah, that's interesting. So max-age=5 does play a part here [15:30:19] yes [15:30:24] it happens after 5s [15:30:28] should have done my math I guess [15:30:50] cache-control: no-cache comes from there BTW [15:30:59] 304 responses are getting a no-cache [15:31:09] time for me to dig into RFCs and ATS code I guess [15:31:16] 🍿 [15:31:44] thanks [15:43:11] akosiaris: I've replied in the task but my current understanding is that the applayer isn't complying with RFC 9110 [15:52:22] all done with conftool deployment (4.2.0-1 -> 5.0.1-1) [15:55:22] elukey: EG dashboard for eqiad looks good over that last 1h+. Ok to proceed with codfw? [15:56:53] vgutierrez: oh that's an interesting find. Thanks. I 'll ping them on the task as well. Thanks for digging into this! [15:57:58] klausman: +! [15:57:59] +1 [15:58:38] akosiaris: as an example swift sends etags on 304s and ATS forwards them as expected to the user [15:58:48] by swift I mean swift.discovery.wmnet [15:59:01] makes sense [15:59:03] ok proceeding to roll-restart eventgate pods in codfw. [16:02:39] all old pods gone, no strange things happening yet [16:05:26] klausman: I see the article-country model server getting 20X from eventgate now (from its logs) [16:07:40] yep thanks klausman and elukey. the article-country model-server in prod is now sending prediction change events to EventGate [16:07:49] \o/ [16:08:10] No errors/oddities on Grafana. Will keep an eye out for a bit