[06:39:41] I think https://www.mediawiki.org/wiki/Git/Reviewers is broken? I'm not being added automatically to reviews anymore (nor other persons) [06:40:18] maybe hashar could have some pointers ? ^ [06:58:00] XioNoX: I submitted https://gerrit.wikimedia.org/r/c/operations/puppet/+/1165767 10 minutes ago and Simon got added just fine as reviewer [07:02:17] hmmm, yeah :) nevermind then, not sure why it didn't work once [07:08:33] it can happen when a merge is really fast, it happened to me a few times. the bot doesn't backfill reviewers once a patch is merged [07:27:35] ^ that sounds like a reasonable explanation :] [07:29:20] there are alternatives such as watching a project ( https://www.mediawiki.org/wiki/Gerrit/watched_projects ) [07:30:42] that would get you an email notification whenever a filter matches (examples: project:operations/puppet, branch:^wmf.*) [07:32:31] Gerrit also has a reviewer plugin which would make Gerrit add you as a reviewer. It is similar to our Git reviewers bot with the caveat that you need to be able to edit the repo configuration [07:33:40] ex: https://gerrit.wikimedia.org/r/admin/repos/operations/homer/public,commands , at the bottom is `Edit reviewers config` which opens a model window. A filter can be set (ex: `*` to be added to any change made to that project), add a reviewer and file your email. Done [07:33:54] that is imho less convenient than https://www.mediawiki.org/wiki/Git/Reviewers [07:34:13] XioNoX: for the long explanations ^ :-] [07:41:56] hashar, moritzm the one that made me suspicious was https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1154319 [07:44:23] in the infobox of https://www.mediawiki.org/wiki/Git/Reviewers there is a link to the tool web page https://gerrit-reviewer-bot.toolforge.org/ [07:44:29] which has the last 50 lines of code [07:44:42] but that is filed by actions made in reaction to l10n-bot :] [12:10:35] Amir1: how cricital is db2146 ? (cf. https://phabricator.wikimedia.org/T398433#10967385) [12:11:07] effie: how critical are wikikube-worker2046 and 2042 ? ^ [12:11:30] they can not be, I can drain them if it is neded [12:12:04] XioNoX: from orchestrator/dbctl a replica in codfw with no special role [12:12:45] thx [12:12:59] XioNoX: shall I go for it? [12:13:00] this raised my curiosity https://usercontent.irccloud-cdn.com/file/qmr8hPWP/Screenshot%20From%202025-07-02%2014-12-07.png [12:14:07] effie: not for now, but thx for your reactivity [12:19:27] who can depool or give the green light for db2146 ? [12:22:28] with amir or manuel out, that should be federico, I guess [12:24:04] there is no rush, but better not leave it like that too long [13:39:33] <_joe_> !log repooling cp7006, testing logging improvements [13:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:39] mysql 2006 errors spike in both DC, mostly involving commons [13:50:02] as I said, there are 350 million queries per second on commons [13:50:24] that laod cannot be handled, 6K new connections per second [13:50:31] can someone check traffic patterns? [13:50:39] looking [13:50:43] it's only on outside facing deployments [13:52:06] query killer kicked in [13:52:20] but it will take time to process the backlog [13:52:49] ack, anything we can do? [13:53:18] nothing immediately obvious in superset [13:53:47] it is slowing down [13:54:45] it is not gettting fixed because new queries are arriving that are too slow [13:54:57] zabe: could that be related to your patch ? [13:54:58] MediaWiki\Api\ApiQueryCategoryMembers::run [13:55:07] this is a change in code, not a server issue [13:55:12] problem is it backpressures into worker sat [13:55:15] <_joe_> when in doubt, rollback [13:55:18] (might this be worth a status incident?) [13:55:20] it's rolling back [13:55:29] <_joe_> yes I think it is [13:55:31] query are too slow: "Creating sort index" [13:55:35] yep [13:55:50] alright, it was mentioned already, just echoing here [13:55:51] it is a select that creates a temporary index that is too slow [13:56:13] claime: the rollback is stuck [13:56:22] that saturates the db, and slows down the api [13:56:46] zabe: wdym? [13:57:13] has it stopped deploying, or are you seeing an error message or something? [13:57:18] not [13:57:23] it is just very slow [13:57:28] 2 nodes within like 3 min [13:57:31] out of 2000 [13:57:32] probably too many old pods are unhealthy [13:57:36] yeah [13:58:00] one option would be to do something like delete the old replicaset [13:58:04] I can go nuclear and delete the old rs [13:58:07] heh [13:58:37] it is doing a full table scan on categorylinks, each query reading 735 millon rows [13:58:46] there is no db that can handle that [13:58:59] claime: is there anything anything I can do? [13:59:15] zabe: no [13:59:21] zabe: stay cool and carry on for now, we will figure something out [13:59:34] opinions on deleting the old rs? [13:59:37] https://phabricator.wikimedia.org/P78739 [13:59:43] claime: I have not logged in, but we could do a for loop and delete pod [13:59:55] creating a statuspage update [14:00:03] claime: imo do it [14:00:14] you could also scale it down in chunks [14:00:29] jynus: I will fix the query [14:00:43] i don't think chunk scaledown would work [14:00:52] you might be right [14:00:56] claime: is it only api-ext? I have not figured this otu yes [14:01:06] no it's everything [14:01:44] claime: then we need to parallise? [14:01:57] let me try it on mw-api-ext eqiad first [14:02:15] I am running db-kill, but it more are sent, it will only delay issues, not fix them [14:02:43] <_joe_> can I ban all requests to the action api to commons [14:02:47] <_joe_> to ease the pressure? [14:03:14] <_joe_> but this was request-related [14:03:24] <_joe_> there was a huge spike of requests since 12:50 [14:03:36] I killed stuff on db1199, any other host to kill? [14:03:42] mw-api-ext.eqiad.main 139/260 260 139 2y39d [14:03:45] <_joe_> it just stopped [14:04:30] running it on db1243 too [14:04:39] I'm getting consistent 503's on https://en.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&uselang=en&list=linterrors&lntcategories=fostered&lntlimit=1&lnttitle=Wikipedia_talk%3AWikiProject_Cemeteries [14:04:45] Is this a known issue? [14:04:55] yes [14:05:00] sukhe: could you please (or anyone else from traffic) give us an estimatiom of the impact ? [14:05:16] hnowlan: can you update the https://www.wikimediastatus.net/ to say 'all wikis'? a commons DB outage breaks everything, usually [14:05:19] <_joe_> effie: edits are non-existent, we're serving tons of 5xxs [14:05:23] cdanis: ack, doing [14:05:43] <_joe_> any opposition toi banning action api requests for commons? [14:05:45] effie: sorry, I am in a meeting and not really following along. [14:05:50] _joe_: please do [14:06:01] pods are not coming back up due to readiness probe down [14:06:23] another suggestion in my back pocket claime is to temporarily disable the readiness probe [14:06:25] <_joe_> what is the readiness probe? [14:06:26] dbs I think are now ok-isk, but there is very low traffic [14:06:27] <_joe_> yes [14:06:29] <_joe_> cdanis: +1 [14:07:18] starting a doc https://docs.google.com/document/d/1_UfyOH8jfRNFJhtdrLR2S8HLieJvH3Q_D_6nL9PvPzk/edit?tab=t.0 [14:07:24] it's the httpd probe [14:08:07] I will take a look at k8s, I wonder if we have overwhelmed it [14:08:23] <_joe_> enabling the rule banning api traffic to commons [14:08:36] so dbs should be ok now, but they are not receiving back the traffic from app servers [14:09:02] yeah because 90% of them are down [14:09:20] <_joe_> claime: now I've stopped requests for commons [14:09:23] <_joe_> for the api [14:09:23] ack [14:09:28] jynus: you should be getting from the web, mw-web is up [14:09:32] <_joe_> so that should allow pods to restart? [14:09:38] saturation going down [14:09:41] that should help [14:09:43] <_joe_> ok [14:09:48] sure, I mean traffic is low in general, I cannot differenciate at db level [14:09:53] yrah it's turning green [14:09:57] <_joe_> that should allow the readiness probe to be ok again? [14:10:03] yeah [14:10:11] _joe_: yes, most pods are healthy now [14:10:18] <_joe_> ok once the rollback is ocmplete, I will disable the rule [14:10:18] I was just firing up `kubectl edit` too [14:10:30] mw-web is ok [14:10:39] still, read rows is 50 times higher than usual [14:10:40] <_joe_> please let me know as we're currently banning all api requests to commons [14:10:47] <_joe_> jynus: that's just WMCS [14:10:49] _joe_: sure [14:11:00] <_joe_> which is excluded from requestctl rules completely (for now) [14:11:12] <_joe_> this is the typical rule we'd put in moat mode and apply to everyone [14:11:36] _joe_: just a few more pods to be fully ready [14:11:46] we are at 237/260 [14:11:51] i'll delete them so they restart [14:11:55] <_joe_> I don't need to know if pods are ready, I need to know the rollback of the code is complete [14:12:20] The rollback did crash around 85% completion [14:12:27] should I retry syncing it? [14:12:39] that sounds like we need to run one more scap [14:12:57] docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2025-07-02-130612-publish-81 [14:12:58] <_joe_> yes [14:13:00] that's the correct image, yes? [14:13:04] zabe: is there a revert ready? [14:13:06] it's timestamped an hour and 6 minutes ago [14:13:14] <_joe_> please let's re-run scap sync with the revert [14:13:21] +1 [14:13:22] agree [14:13:26] effie: the revert is the rollback that crashed [14:13:32] I think [14:14:03] so just running scap sync-world should be enough [14:14:09] claime: I am not sure, zabe will tell us, to my understanding scap started its auto-roll back [14:14:10] runnning [14:14:21] The revert is merged and fetched to deploy1003 [14:14:22] zabe: running what? [14:14:23] <_joe_> claime: only if the code has been updated [14:14:30] <_joe_> zabe: ack [14:14:32] <_joe_> phew [14:14:45] claime: another try to sync the revert with scap [14:14:49] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 1.911% idle [14:14:53] ack [14:15:46] db metrics looking good but row reads still worry me (it means code is still a low number of slow queries) [14:16:04] as of 14:15 [14:16:06] <_joe_> I don't want to put too much pressure but we're denying all api requests to commons atm [14:16:21] not much we can do other than wait for scap [14:16:27] I don't know if it is the backlog [14:16:33] <_joe_> zabe: as soon as the deploy is done, please lmk [14:16:36] The revert was a manual revert of the patch by myself which I tried to scap, but scap failed at around 85% as discussed above [14:16:39] or just that deploy hasn't finished [14:16:43] _joe_: will do [14:17:05] <_joe_> for everyone's sake, https://requestctl.wikimedia.org/action/cache-text/temp_ban_api_commons [14:17:32] dbs in codfw recovered on its own [14:17:37] didn't get as stuck [14:18:02] mediawiki-multiversion:2025-07-02-130612-publish-81 is the deployed image, and it's possible it's the one with the bad code, so we need to wait for the scap run to be sure [14:18:15] yeah in codfw I see a different image in codfw claime [14:18:27] 2025-07-02-134646-publish-81 [14:18:30] cdanis: probably it got deployed ok there [14:18:34] aye [14:18:35] <_joe_> let's wait for scap sync-world now shall we? [14:18:52] <_joe_> 6:18:28 <+logmsgbot> !log zabe@deploy1003 Finished scap sync-world: retry revert (duration: 04m 27s) [14:18:53] and eqiad got stuck because of the not ready pods, and rolled back via helmfile rollback [14:18:53] scap technically finished, but it only touched 257 nodes for some reason [14:19:13] wtf [14:19:26] <_joe_> 257 nodes? [14:19:28] they all look updated [14:19:29] zabe: is that about 15% of the total number? [14:19:31] <_joe_> what does it mean? [14:19:43] btw eqiad mw-api-ext is now at the same as codfw, docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2025-07-02-134646-publish-81 [14:19:46] https://phabricator.wikimedia.org/P78740 [14:19:56] so it might have been literally just the eqiad mw-api-ext rs that needed fixed [14:19:58] which is 260 btw [14:19:59] 11% [14:20:26] <_joe_> look, II think that's the 11% who had the non-rolled-back image? [14:20:35] is there an incident doc? [14:20:35] <_joe_> anyways, jynus how does the db look now? [14:20:36] yes that's what I'm saying [14:20:43] jynus: https://docs.google.com/document/d/1_UfyOH8jfRNFJhtdrLR2S8HLieJvH3Q_D_6nL9PvPzk/edit?tab=t.0 [14:20:45] jynus: 10:07:18 starting a doc https://docs.google.com/document/d/1_UfyOH8jfRNFJhtdrLR2S8HLieJvH3Q_D_6nL9PvPzk/edit?tab=t.0 [14:20:58] _joe_: about to paste the answer [14:21:09] ok _joe_ you can unban now i think [14:21:10] i'm +1 for disabling the rule _joe_ [14:21:16] shall we open the api _joe_ ? [14:21:21] see doc: holding healtyh, but metrics look wrong [14:21:42] I am ok with that [14:21:55] I just cannot guarantee they will overload again [14:22:02] *wont [14:22:10] <_joe_> done [14:22:12] <_joe_> unbanned [14:22:17] monitoring [14:22:21] mw-web pods are like 30m old and have the right image fwiw, they must have rolled back fine first time [14:22:22] <_joe_> within 10-15 seconds it should be back [14:22:32] hnowlan: yeah they did [14:22:44] hnowlan: it makes sense it'd just be mw-api-ext getting wedged, if that ran the codepath with the truly pathological behavior [14:22:49] hnowlan: they were behaving well most of the time [14:23:09] I'll give those pods some biscuits [14:23:23] <_joe_> hnowlan: the mw-web pods are potty-trained, basically [14:23:31] sat going up [14:23:37] yeah, just musing about the scap numbers zabe saw [14:23:43] we'll see if it levels out [14:23:46] one question I have for afterwards is why Amir1's db section circuit breaking didn't kick in here [14:23:52] so there could be 2 explanations for the bad metrics: either there are some dbs with stack stuff due to backlog, or the bad queries are still being produced [14:24:00] we want the 1 scenario [14:24:01] it's going to spike [14:24:24] how are the dbs doing jynus [14:24:35] traffic going up in a good way, so far so good [14:24:45] read rows not spiking [14:24:52] we're at 90% sat I don't like this [14:24:59] what about adding some pods to manage the aftermath [14:25:05] actually, open connections going dow [14:25:07] if our theory is that it will cool down [14:25:09] which is very good [14:25:36] monitoring https://grafana-rw.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=now-6h&to=now&timezone=utc&var-site=eqiad&var-group=core&var-shard=s4&var-role=$__all [14:25:40] specially read rows [14:25:50] api-ext is at about 100% [14:25:50] mw-api-ext is not happy [14:25:59] latency and sat are tapped out [14:26:03] should we try banning the API requests again now that the rollback is out the picture? [14:26:06] briefly [14:26:12] claime: it's worse than that, we're losing mw-api-ext pods [14:26:15] they are going unhealthy [14:26:17] yeah [14:26:25] idle workers is at 0 while active workers goes down (because the scrape fails) [14:26:31] rebanning [14:26:52] reban in place [14:26:56] My connection is really crappy here [14:27:16] joe mentioned a spike in requests, have was classified what that was? [14:27:35] no [14:27:38] open connections spiked again [14:27:40] that's bad [14:27:40] I trust it was commons api, unless [14:27:53] I banned the api requests to common again [14:28:31] saturation going back down [14:28:32] <_joe_> I doubt the rollback worked as intended [14:28:58] ok I have another idea, what we rollback the helmrelease [14:28:58] <_joe_> zabe: what is your revert? [14:29:00] can someone try to rerun a scap backport of the revert? [14:29:04] <_joe_> effie: no. [14:29:08] I have to go to a meeting [14:29:14] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1165897 [14:29:28] <_joe_> claime: I just skipped the kickoff off the priority 0 project ofr SRE :) [14:29:41] I can try scapping it again [14:29:42] <_joe_> the problem is not you skipping a meeting, but people who don't [14:29:56] categorylinks is the issue I mentioned at https://phabricator.wikimedia.org/P78739 [14:29:59] as the source of issues [14:30:00] <_joe_> zabe: no please hold on, I'd like an SRE to do it [14:30:05] sure [14:30:17] <_joe_> anyways, I'll do the backport [14:30:42] if the patch is merged, don't we need a sync world only ? [14:31:20] Activeuserpager is unrelated. I already created a ticket for that [14:31:21] that's what was done and we're not sure if the synced code was right or not [14:31:55] <_joe_> effie: no, because I suppose the patch wasn't correctly applied to the codebase [14:32:20] should we look directly at a file in a running pod? [14:32:35] is it just InitialiseSettings.php [14:32:43] let me check on mw-experimental [14:33:20] <_joe_> it is being rebuilt now [14:33:37] <_joe_> so now you'll see the correct code effie [14:34:39] <_joe_> ok [14:34:55] <_joe_> the check phases are so slow, sigh [14:35:03] on 2025-07-02-134646-publish-81, I only see the chaneg for group 0 [14:35:05] grep wgCategoryLinksSchemaMigrationStage -A10 $(find -name InitialiseSettings.php) [14:35:06] grafana (or preometheus) is started to fail for me, losing metrics [14:35:07] 'wgCategoryLinksSchemaMigrationStage' => [ [14:35:09] 'default' => SCHEMA_COMPAT_WRITE_BOTH | SCHEMA_COMPAT_READ_OLD, [14:35:11] 'group0' => SCHEMA_COMPAT_WRITE_BOTH | SCHEMA_COMPAT_READ_NEW, [14:35:13] ], [14:35:18] effie: yeah, so that means it was rolled back correctly [14:35:35] jynus: which DC? [14:35:42] eqiad [14:35:44] so we may have another issue in our hands ? [14:35:51] or you mean my dc? [14:35:55] <_joe_> no [14:36:04] <_joe_> no effie I doubt it [14:36:13] <_joe_> I'm 99% sure it was a sync issue [14:36:17] I am updating the status page to clarify that we the commons API is down [14:36:21] I cannot load any metrics, 503 service unavailable [14:36:24] <_joe_> let's see after I've done the deployment [14:36:30] moritzm: ganeti work on eqiad could impact prometheus as well in there? [14:37:07] no, it's thanos [14:37:10] probe failed http_thanos-query_ip4 [14:38:03] <_joe_> slyngs: can you take a look at thanos_query, please? [14:38:29] I have some back, read rate is still 40x than usual on commons [14:38:45] nothing really changed in ganeti/eqiad, I'm live-migrating a few VMs in the background to rebalance the cluster follwoing the ITS reboots, but this doesn't impact running VMs [14:38:48] <_joe_> jynus: that might be related to the wmcs requests [14:39:08] <_joe_> but if that's still the case after the complete rollback, then we have some other issue [14:39:48] open connections however, is ok [14:39:48] I'm on thanos btw [14:39:50] ok another theory [14:39:53] could there be some lingering cron jobs that got launched with the bad config? [14:39:59] <_joe_> cdanis: yes [14:40:10] <_joe_> but that would be clear to jynus in terms of provenance of queries [14:40:21] <_joe_> instead, what spikes is specifically api requests [14:40:35] jobs would use -api-int though right [14:40:39] <_joe_> my fear is that the issue is with mysql stats being "messed up" [14:40:41] <_joe_> effie: yes [14:40:47] <_joe_> if they even did [14:40:50] <_joe_> or mw-jobrunner [14:40:52] yeah, as far as I understand regarding different useres, this is interactive [14:40:53] let me check connections from the various deployments [14:40:56] <_joe_> ok deply complete [14:41:04] to mysql, maybe that would give something useful [14:41:13] <_joe_> I'll re-disable the requestctl rule [14:41:32] actually, I am wrong, it is mediawiki-main-tls-service.mw-jobrunner.svc.cluster.local. [14:41:40] docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2025-07-02-143219-publish-81 [14:41:53] nah, still mw-api-int.svc.cluster.local. [14:42:08] <_joe_> api-int? [14:42:40] <_joe_> disabling the rule again [14:42:40] but those are not overloading [14:42:56] they are just probably the most running because of the banning [14:43:11] I am just sampling at random [14:43:16] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=core&var-shard=s4&var-role=$__all&from=now-6h&to=now&timezone=utc&viewPanel=panel-9 [14:43:22] <_joe_> rule disabled again [14:43:33] The connections didn't pile up that much [14:43:49] Amir1: look at https://phabricator.wikimedia.org/P78739 [14:43:57] type: ALL on categorylinks [14:44:06] yeah that's an issue [14:44:16] the active users though it's unrelated (as I said aboved) [14:44:18] those where the ones piling up [14:44:25] https://phabricator.wikimedia.org/T397992 [14:45:06] that's just bad, the one I sent you is cluster-killing [14:45:39] saturation going back up again [14:45:59] <_joe_> yeah can we depool mw-api-ext-ro from eqiad please? I have a suspicion [14:46:18] I can do it [14:46:35] MediaWiki\Category\CategoryViewer::doCategoryQuery [14:46:37] <_joe_> what I want to test is if the codfw databases will have th esame issue [14:46:48] worth trying [14:46:50] ^I think it is this one, categorylinks table [14:47:00] <_joe_> zabe: I don't think the above is related to your change, was it? [14:47:15] _joe_: yeah it should be [14:47:16] <_joe_> this looks more and more to me like a problem with the dbs, rather than any code [14:47:21] it's related to the normalization work [14:47:25] yep [14:47:31] _joe_: potentially, if the new DB format somehow makes the old queries a lot slower [14:47:39] thats what is happening [14:47:43] https://phabricator.wikimedia.org/P78739#316147 [14:47:45] <_joe_> ah wait, we didn't roll back the db format? [14:47:46] we started joining linktarget [14:47:48] fone [14:47:49] do you need any more hands here? [14:47:50] done [14:47:51] The bug is this, the force index is incorrect [14:47:59] <_joe_> ok [14:48:27] <_joe_> so the problem is as I thought, the query stats changing and some indexing nto working [14:48:41] <_joe_> Amir1 / zabe how hard is it to get a patch? [14:49:06] not too hard but revert of the config change should fix it [14:49:16] 'wgCategoryLinksSchemaMigrationStage' => [ [14:49:17] Isn't that possible right now? [14:49:18] 'default' => SCHEMA_COMPAT_WRITE_BOTH | SCHEMA_COMPAT_READ_OLD, [14:49:18] <_joe_> btw [14:49:20] 'group0' => SCHEMA_COMPAT_WRITE_BOTH | SCHEMA_COMPAT_READ_NEW, [14:49:22] ], [14:49:24] that's the live config for a while now Amir1 [14:49:26] with group1 rolled back from read_new [14:49:33] <_joe_> Amir1: we reverted the config [14:49:51] <_joe_> btw in codfw for now I don't see the saturation spiking as much [14:49:55] so things should be fine, slow queries in group0 won't make a difference [14:50:15] yes, but twice now they aren't fine, once we allow action api against commons in eqiad [14:50:18] <_joe_> Amir1: yeah btu they're not [14:50:27] <_joe_> so the status [14:50:29] <_joe_> is [14:50:35] <_joe_> all read queries for api to codfw [14:50:44] <_joe_> commons api is open atm [14:50:54] then either the config is not deployed properly or it's something else ("traffic patterns") [14:50:57] <_joe_> codfw api seem to be doing ok [14:51:09] group0 was deployed for days now [14:51:17] and categorylinks there is tiny [14:51:20] Amir1: ok well, codfw is now taking the full eqiad+codfw load just fine for mw-api-ext-ro [14:51:25] <_joe_> Amir1: neither, as I said, some stats in the commons db in eqiad got messed up [14:51:27] so something else is wrong [14:51:56] It was hundreds of queries doing "971372684 | wikiuser2023 | 10.67.135.217:52760 | commonswiki | Query | 1265 | Creating sort index | SELECT /* MediaWiki\Api\ApiQueryCategoryMembers::run */ " [14:52:07] <_joe_> jynus: can you take a look at codfw? [14:52:11] 1265 is the number of seconds executing [14:52:12] <_joe_> the same queries should happen there [14:52:18] ^ [14:52:28] <_joe_> anyways, I need ot step afk for a sec [14:52:33] codfw overloaded [14:52:51] but load is so slow that it went back to normal [14:53:03] jynus: [14:53:06] https://logstash.wikimedia.org/goto/ee70b7c385b0de4eec5607bf7c20e4e7 [14:53:11] https://grafana.wikimedia.org/goto/8ZtvxNsHR?orgId=1 [14:53:33] please folks, add a title to what the graph you are pasting is about, it would help a lot [14:53:57] mine is codfw effects, it spiked on row reads, but the revert seemed to have effect [14:54:08] I think it is the patch, just the state didn't fully recover [14:54:11] three replicas in eqiad are particularly hit hard [14:54:25] db1242, db1247, db1199 [14:54:29] (the job queue or whatever persist the bad queries or something)? [14:54:45] jynus: no [14:55:00] all are API group [14:55:05] the queries themselves shouldn't be in the jobqueue, they just trigger code that the mw-jobrunner pods run [14:55:12] ok [14:55:14] which should in theory be running whatever has been scap'd out [14:55:22] then the bad state is still there [14:55:37] https://usercontent.irccloud-cdn.com/file/KTriNFpv/grafik.png [14:55:41] cwhite: jhathaway: are you up to date on the current status? [14:56:04] saturation on codfw has levelled out at a pretty reasonable rate [14:56:07] I think so, I've read the back scroll at least [14:56:18] I think I have a grasp. Been keeping up wit hthis channel. [14:56:21] <_joe_> hnowlan: proving my hunch the issue is possibly with the eqiad dbs [14:56:44] so 1) I think we should update the status page, perhaps to 'monitoring' state [14:56:57] ack [14:56:58] 1) on it [14:57:01] and 2) as a High but not an UBN! we have to diagnose eqiad and then repool mw-api-ext-ro in eqiad [14:57:17] but the current state is livable temporarily [14:57:28] <_joe_> cdanis: I thin it's UBN! tbh [14:57:40] <_joe_> we're surviving but in a heavily degraded state [14:57:56] <_joe_> it's not incident-response territory but UBN! nonetheless [14:57:57] I know why they are slow [14:58:04] effie: already done, I am IC [14:58:05] sure, degraded redundancy [14:58:13] the queries from the previous revert are still running [14:58:15] hnowlan: oh sorry [14:58:20] if we kill them it'll come back to normal [14:58:31] that's what I tried to say [14:58:33] <_joe_> Amir1: uh I was under the impression they were killed [14:58:33] the time in show processlist is on hours now [14:58:47] <_joe_> jynus: sorry that wasn't clear [14:58:54] :) [14:59:11] I guess the query killer is not running or broken or whatever [14:59:14] give me a second, I'm about to do something terrible [14:59:24] <_joe_> killing queries manually? [14:59:31] I did that [14:59:38] but I guess they came back? [14:59:40] <_joe_> so there are no more long-running queries? [15:00:08] where are you seeing hour -long queries? [15:00:24] jynus: probably on those three replicas that Amir1 mentioned [15:00:29] sudo db-mysql db1242 -e "show processlist" | grep -i "Creating sort index" | cut -f 1 | xargs -I{} bash -c 'sudo db-mysql db1242 -e "kill {};"' [15:00:40] gods of bash forgive me please [15:00:48] that's not that bad [15:00:56] also you can just `sudo -s` :P [15:01:01] sure, those are the ones I didn't run kill [15:01:12] on the others, I did it myself [15:01:28] running on this, then I will run on db1247 [15:02:00] fixed now https://logstash.wikimedia.org/goto/625668938f1794b366460ae086232b53 [15:02:15] First thing I did was /usr/bin/pt-kill --kill --print --victims all --interval 5 --busy-time 10 --idle-time 10 --match-command 'Query|Execute|Sleep' --match-user $user --log /var/log/db-kill.log h=localhost [15:02:28] rows read on dbs has plunged in eqiad [15:02:30] it made things better until the revert [15:02:32] looks like it worked [15:02:42] \o/ [15:03:08] so repool mw-api-int-ro in eqiad? [15:03:20] I also know why circuit breaker didn't kick in, I bumped the threshold during the kafka broker stuff [15:03:30] I didn't revert it [15:03:30] claime: give it a minute to fully level out maybe [15:03:59] queries still around ~7M [15:04:04] this is yet another instance where I wish we had weights or fractional load on DNS discovery [15:04:04] aka read rows :) [15:04:19] there is an automation that handles that: https://wikitech.wikimedia.org/wiki/Db-kill [15:04:50] <_joe_> cdanis: yeah, my goal was to get to it with the control plane :) [15:04:54] down to 1M on s4 [15:04:55] <_joe_> bypassing dns discovery [15:04:58] we don't need to rush it, codfw is handling the -api-ext just fine: mw stats on codfw https://grafana.wikimedia.org/goto/EcfqxNsHR?orgId=1 [15:05:23] but why didn't the query killer did it in the first place? [15:05:43] jynus: the threashold was higher than usual due to another past incident [15:05:44] vgutierrez: very likely some rolling average somewhere, the slow queries got to zero in logstash [15:06:00] jynus: did we ever fix the issues where the query killer stops working when the mariadb daemon is really badly overloaded? [15:06:01] Amir1: graph back to nromal now [15:06:05] anyway +1 from me to repool mw-api-ext-ro@eqiad [15:06:18] cdanis: sorry, not a DBA, haven't touched a db in ages [15:06:21] yeah +1, queries have levelled out [15:06:25] jynus: LOL [15:06:28] ok I am pooling back [15:06:34] cdanis: no but I also don't think it ever kicked in here, since the db was responsive [15:06:37] but query killer says "ps.processlist_time between 60 and 1000000" [15:06:46] it should have killed long selects [15:06:52] pooling api-ext-ro on eqiad [15:07:07] the dbs were not overloaded [15:07:24] bit apparently the query killer didn't work [15:07:29] that's an actionable [15:07:47] e.g. maybe some condition has changed [15:08:04] jynus: the slow db queries dashboard shows 'USE `commonswiki`' as one of the very-long-running queries [15:08:24] so it seems possible that even select processlist was never completing, or other queries the query killer does? [15:08:25] jynus: noted [15:08:46] it completed because I ran it, it must be some other logic that changed [15:08:52] e.g. username or other condition [15:09:03] I go back to the conference, I'll file a gazillion follow ups later [15:09:06] not a worry now, but something to check later [15:09:18] anything needed of me? [15:09:19] because even if it cannot prevent an issue [15:09:29] it usually helps recovering after an overload [15:09:32] like this case [15:09:41] (well, unlike this case :-D) [15:09:50] Amir1: you're probably safe to go :) [15:09:52]