[07:46:59] inflatador: when you get a chance https://gerrit.wikimedia.org/r/c/operations/puppet/+/1126486?usp=dashboard [08:26:00] arturo: m5 master has been switched over [08:32:48] marostegui: thanks! [10:55:42] this is parsoid going crazy overloading content dbs [10:55:48] marostegui: Amir1^ [10:55:49] yeah it seems es hosts are overloaded [10:55:57] es? [10:56:02] scrappers? [10:56:08] yep, cluster 3 , cluster26 [10:56:13] cluster27 [10:56:15] etc, [10:56:30] but there is also a deploy in progress FYI [10:57:27] page recovered [10:57:35] metrics seems recovering, at least memcached and parsoid [10:57:35] http latency increased since 10:47 [10:57:52] scap stopped [10:57:55] effie: deploy failed ,right? [10:58:17] yes it did [10:58:29] backend response time still high but might be starting the recovery [10:58:40] was everything rolledback or do we need to run some cleanup effie ? [10:58:42] we are recovering [10:58:47] yeah, I don't see full recovery but it could be metrics latency [10:59:29] do we have evidence that it was deploy-related or should I dig into incoming traffic? [10:59:30] elukey: most likely no [10:59:41] volans: I will first go through the deployment [10:59:46] and get back to you [10:59:58] ack, standing by [11:00:02] <_joe_> volans: what did you see? [11:00:11] volans: I checked sample and didn't see anything obvious, but who knows [11:00:31] so far nothing hence my asking, if we don't have evidence of anything deploy-related I will dig more to see if I can find anything [11:00:35] <_joe_> volans: the deployment was to move requests for parsoid all to php 8.1 [11:00:37] but if we think it was the deploy no point :) [11:01:10] parsoid, api-int, jobrunner [11:01:14] <_joe_> volans: no I mean what were the things you saw... if the issue was saturation of php workers on 8.1, then it's probably that [11:01:19] <_joe_> otherwise I doubt it [11:01:25] what was the previous pct? [11:01:27] <_joe_> (that == the deployment) [11:01:37] It was 50% [11:01:37] <_joe_> 50% [11:01:52] <_joe_> if you saw a spike of load in ES, that shouldn't be this deployment [11:02:03] <_joe_> it would look like something caused a spike of parses [11:02:05] I wonder if it's some maxUnavailable shenanigans [11:02:13] failed probes for parsoid, db-related errors [11:02:14] overload started 10:47 [11:02:15] <_joe_> claime: causing db load? [11:02:17] a ton of erros in logstash are like "Wikimedia\Rdbms\DBConnectionError: Cannot access the database: could not connect to any replica DB server" [11:02:21] we did increase maxUnavailable recently, right? [11:02:25] thats the es [11:02:28] https://logstash.wikimedia.org/goto/ded4cc29c3349c6dd0ecc7fe1bf748d5 [11:02:30] <_joe_> people, it's not the deployment [11:02:37] ok [11:02:41] ack [11:02:49] <_joe_> if anything, increasing maxunavailable would reduce the db load [11:03:06] kamila_: only for -migration, main remained at default [11:03:16] oh, ack, thanks claime [11:03:32] ok so far, here is what the deployment was supposed to do [11:03:58] example of external storage load https://grafana.wikimedia.org/goto/kl3BUShNR?orgId=1 [11:04:00] switch all main deployments for -api-in jobrunners and parsoid to php81. [11:04:23] connections went x10 [11:04:24] <_joe_> the slow queries dashboard is also saying there's a lot of slow queries in s4 [11:04:42] <_joe_> so I guess something scanning commons *AGAIN* [11:04:43] and from the mediawiki side, there were db erros [11:05:04] [{reqId}] {exception_url} Wikimedia\Rdbms\DBConnectionError: Cannot access the database: could not connect to any replica DB server [11:05:14] <_joe_> volans: look for uncached traffic to commonswiki [11:05:18] k [11:05:22] <_joe_> maybe api.php even [11:05:39] volans: are you looking for the traffic part? [11:05:48] commons.wikimedia.org not, checking api [11:05:49] yes [11:09:09] jynus: can you please check the DB's state? I would like to understand if I can run scap again or not [11:09:39] checking [11:10:09] effie: things are still weird [11:10:22] alright [11:10:34] https://grafana.wikimedia.org/goto/yLDkwShNR?orgId=1 [11:10:40] the incident is ongoing [11:10:58] ok, I can do IC [11:11:07] we have 3x to 4x the number of normal connections [11:11:13] anyone who is working on this, can you please share what you are doing atm [11:12:17] * volans checking traffic [11:12:28] <_joe_> I'm looking at the slow queries dashboard now [11:12:50] How can I help? [11:13:23] marostegui: if jynus is on the DB side and does not need any more assistance, we are alright I believe [11:13:38] ok, I will stand by, jynus if you need more eyes let me know [11:14:00] https://docs.google.com/document/d/1dO3TQVl7vSUV-4YPkEQhAwTUa4mZnDEypOTkTdEqZl0/edit?tab=t.0 [11:14:10] marostegui: one question, is es2024 down? [11:14:30] shouldn't be [11:14:31] checking [11:15:02] jynus: that host doesn'te xist [11:15:18] ok, then unrelated, it still appeared as uncontactable no board [11:15:19] volans, jynus, we are moving this to _private for now [11:15:23] k [11:15:24] not important now [11:15:50] jynus: it wasn't removed from zarcillo, doing it now [11:16:15] no worries, I only asked because I saw it while investigating the issue [11:16:15] done [11:36:54] I just got an error on some wdqs instances on puppet executions: Error: Could not send report: Error 500 on SERVER: Server Error: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null [11:36:58] is this a known issue? [11:37:53] I think filippo had it like yesterday, I don't know more [11:42:03] thx [12:29:51] vgutierrez: yes I filed T388629 [12:29:52] T388629: puppet error at the end of the run on prometheus2008: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null - https://phabricator.wikimedia.org/T388629 [13:00:14] gehel: brouberol, gmodena, Amir1, volans lets cont the chat here [13:00:29] SUP is paused for now, and that seems to fix the overload, but we're not ingesting any new change into the search indices. Not something we want to keep. [13:01:34] It seems that we've recently increased the parallelism in the search update pipeline, which could have increased the load on action API. We can decrease the parallelism again and re-enable the SUP. Objections? [13:01:35] can we tweak the rate of it? what's the minimum rate acceptable for the indexes? [13:01:58] (to have the indexes done in a reasonable amount of time) [13:03:11] The general SLO is 10 minutes (we want edits to be available as search results in < 10 minutes). But the issue here is likely more related to saneitizer, which is the component that goes over all pages every 8 weeks to catchup on updates that were missed. [13:03:15] it would be great to have commons and wikidata have their own "low prio" lane [13:03:23] gehel: can someone write on the status doc under the What Happened section, where we are now and what happened? [13:03:29] they are each 100M pages [13:03:43] effie ack [13:03:49] gmodena: cheers thanks [13:04:07] gmodena: no finger pointing, it is for the report afterwards [13:04:08] This patch reduces parallelism for the SUP conusmer-search job https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1126988 [13:04:11] Amir1: that sounds like a good idea. Let's talk with dcausse when he is back. [13:04:15] effie no worries :) [13:04:59] Amir1: I have a pending deployment that is blocking the train, ie, unless I go on with it, the train cant run [13:05:02] couldelastic parallelism was also recently increased, I'm trying to figure out whether that could have added pressure [13:05:13] and if it's safe to scale down [13:05:21] Amir1: shall I go on with it, DB wise? [13:05:50] I can't say if 2 is low enough but surely looks better than 3 :D [13:05:57] effie: sure! [13:06:41] mw-api-int canaries are whining again [13:07:39] although it may just be because there's so few of them they get easily saturated [13:08:12] claime: can you tell me which cluster "10.194.169.43:47532" belongs to? [13:08:28] (or tell me how to extract it) [13:08:33] mw-jobrunner [13:08:36] Amir1: dig -x [13:09:14] oh nice, last time I tried dig -x, it gave some general wikikube address [13:09:28] that's been fixed for a while iirc [13:09:45] I think c.danis worked on making the pod reverse DNS more informative a while back [13:09:46] *pods [13:09:52] That's awesome <3 [13:10:12] topranks as well :) T376291 [13:10:13] T376291: Authdns: automate reverse DNS zone delegation for k8s pod IP ranges - https://phabricator.wikimedia.org/T376291 [13:10:16] Of course it would be Chris, who else is DNS crazy [13:10:52] beside the point, the spot checks show that the current load on ES is from jobrunners [13:10:57] now time to dig jobs [13:11:17] I did https://trace.wikimedia.org [13:11:23] with "server.address="es2040" db.namespace="commonswiki"" [13:11:26] but nothing of use [13:11:27] Amir1: again? [13:11:55] effie: maaybe it's unrelated but es2040 is quite under duress https://grafana.wikimedia.org/d/d251bef4-d946-4bea-a8a5-b02a3546762e/mariadb?orgId=1&var-job=All&var-server=es2040&var-port=9104&from=now-3h&to=now [13:12:59] Amir1: fwiw es2037 too, is the one I was looking at before [13:13:02] are we ok enough to merge https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1126988 and re-enable SUP? Should we wait for something? [13:13:12] those didn't recover, and we were wondering with jaime if it could be a monitoring artifact [13:13:16] okay on my side [13:13:18] I updated https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1126988 to reduce parallelism also for consumer-cloudelastic [13:13:23] he said checking the host via ssh looked ok [13:13:34] volans: and same via mysql prompt [13:13:41] volans: "show processlist" still is quite large [13:14:03] https://trace.wikimedia.org/search?end=1741785228174000&limit=20&lookback=1h&maxDuration&minDuration&service=mediawiki&start=1741781628174000&tags=%7B%22db.namespace%22%3A%22commonswiki%22%2C%22k8s.namespace.name%22%3A%22mw-jobrunner%22%2C%22server.address%22%3A%22es2040%22%7D [13:14:05] I get 50 sleep on es2037 [13:14:05] 3 hits here [13:14:14] volans: yeah, there are no long running queries [13:14:32] and there have been no queries being killed [13:14:52] volans AFAIK parallelism=2 was the default for a while, we bumped it to 3 a couple of weeks ago to keep up with backfills/added write pressure on ES. [13:15:02] gmodena: ack, sgtm [13:15:08] if Amir1 is ok with reducing parallelism and re-enabling SUP, that sounds good enough to me. brouberol can you make that happen? [13:15:20] marostegui: can you explain https://grafana.wikimedia.org/goto/r5k7gN2NR?orgId=1 ? [13:15:24] Amir1: I will rollback my changes and run scap to bring us at least to the previous state [13:15:30] in terms of do they match the host status or not [13:15:35] volans: not really [13:15:55] unless it is a very fast query that we don't get to see [13:16:02] let me enable the general log for now [13:16:06] so much query very trace [13:16:27] I was afraid to ask :) [13:16:39] general log won't help much, they all go through the sql storage abstraction [13:16:43] if we could enable something for a short amount of time to verify, either peformance schema or the log [13:16:47] I want to know what job is doing the calls [13:16:57] gehel: on it. I've +1/+2ed the patch and will deploy [13:17:06] brouberol are you deploying the patch? [13:17:10] gmodena: any order I should follow? Aka deploy eqiad or codfw first? [13:17:10] oeps [13:17:12] brouberol ack [13:17:23] if you enable it, you'll get just something like "this row was selected" [13:17:35] usually I do the secondary DC first, and the primary second [13:17:42] ^ brouberol [13:18:25] marostegui: at least network graph from the generic host-overview seems to back the theory that something is still ongoing: https://grafana.wikimedia.org/goto/nbUdRN2NR?orgId=1 [13:18:36] gmodena ack [13:19:24] deploying to eqiad [13:19:28] Amir1: What I want to see if there are many rows being selected by a single query [13:19:28] not cpu though [13:19:52] marostegui: I don't think that'll be given, it'll be just one row [13:20:06] in mw, they all go through blob store abstraction [13:20:12] Amir1: Slow query logs tells you that [13:20:33] onto codfw [13:20:35] https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&refresh=5m&var-dc=codfw%20prometheus%2Fk8s&var-job=categoryMembershipChange&from=now-24h&to=now [13:20:40] this might be related [13:21:07] gmodena gehel: SUP has been deployed with replicas: 2 in both DCs [13:21:39] volans: yeah, the network graph and even the operations one says whatever is happening is still ongoing [13:22:42] The Handler_read_rnd_next is insane, so there is something doing some sort of table scane [13:22:58] https://grafana.wikimedia.org/d/d251bef4-d946-4bea-a8a5-b02a3546762e/mariadb?orgId=1&refresh=1m&var-job=All&var-server=es2037&var-port=9104&from=1741763693093&to=1741785293093&viewPanel=3 [13:23:00] marostegui: the thing is that in mw nothing can access ES hosts unless they use blob store, and that class takes away anything important we could extract from the query itself, and at least should do one query at a time [13:23:04] brouberol gehel the jobs are starting to catch up [13:23:38] I'm going to reduce the concurrency of category membership change job [13:23:53] also read_key went up 20x [13:24:04] Amir1: Yeah, but why would something start doing table scans [13:24:07] If the query hasn't changed [13:24:39] I think (not sure) is that they query a lot [13:25:26] to clarify/exclude any doubt I would enable the logging even just for few seconds to see what's in there [13:25:33] or the performance schema stuff [13:25:33] volans: I did already [13:25:36] ah sorry [13:25:38] whihc host? [13:25:48] volans: es2037 [13:26:06] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1127000 [13:28:23] did that change today? [13:28:54] Amir1: +1'd, fire away [13:29:17] let's see if it helps, if not, we revert [13:29:31] marostegui: https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&refresh=5m&from=now-2d&to=now&var-dc=codfw%20prometheus%2Fk8s&var-job=categoryMembershipChange&viewPanel=74 [13:29:36] something is going on with this job [13:29:51] let's go for that patch [13:31:16] Amir1: would that job also run against RO hosts? [13:31:26] yeah, it's a REALLY messed up job [13:31:30] Because there's also a big spike on RO hosts but it went sort of back to normal [13:31:34] like: https://grafana.wikimedia.org/d/d251bef4-d946-4bea-a8a5-b02a3546762e/mariadb?orgId=1&refresh=1m&var-job=All&var-server=es1029&var-port=9104&from=1741763693093&to=1741785293093 [13:31:42] it parses every revision after certain time [13:34:58] guys, I really need to run scap to make sure that the next scap won't be a flop [13:35:12] in theory, it will leave things as they are [13:35:40] I'm almost done with my deploy but it's also not scap so it shouldnt' affect you [13:35:48] as I have reverted the changes I intended to make a few fires ago [13:38:17] marostegui: it might be too soon but https://grafana.wikimedia.org/d/d251bef4-d946-4bea-a8a5-b02a3546762e/mariadb?orgId=1&refresh=1m&var-job=All&var-server=es2040&var-port=9104&from=now-1h&to=now&viewPanel=37 [13:38:56] Amir1: What worries me, if that fixes the issue, is what has changed to have this effect [13:39:05] Amir1: can I go ahead? [13:39:12] effie: sure [13:39:19] marostegui: template change, etc. [13:39:35] this can trigger a flood of reparses [13:39:53] Amir1: But then everytime the solution would be to decrease job? [13:39:55] actually, these all get recorded in rc table if something actually changes [13:40:06] marostegui: did you find any smoking gun in the query log? I had a look and didn't find much so far [13:40:09] oh no, the proper solution is to rewrite the job from ground up [13:40:34] volans: nothing [13:40:45] Amir1: if we are in a stable state, even if not idea, please take a break now [13:40:49] ideal* [13:41:18] I'll eat soon [13:42:08] k [13:42:19] Amir1: The effect was momentary, it all came back to the previous value [13:42:58] yeah the job concurrency hasn't budged either https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&refresh=5m&from=now-3h&to=now&var-dc=codfw%20prometheus%2Fk8s&var-job=categoryMembershipChange&viewPanel=74 [13:43:20] that should go down [13:44:16] also these jobs have been running for a long time, it'll take some time [13:46:04] I am stepping out for now, volans will be IC till he takes a break, thank you all very much ! [13:46:18] Amir1: it seems to be going down now [13:46:26] take some rest effie [13:46:39] Yes, take some rest effie <3 [13:46:51] marostegui: awesome \o/ I'll wait a bit [13:47:55] graph for es2037 improving a lot [13:48:13] yes, es2040 too [13:48:33] just the concurrency change? [13:48:36] of the job [13:48:40] looks so [13:48:44] but that's scary [13:48:51] cause it went up without any other change related to it [13:48:58] indeed and also unclear why it triggers the read_rnd_text [13:49:09] so what if we need to keep decreasing until it goes to 0? [13:49:49] My biggest suspicion is that a template in commons has changed and triggered a lot of reparse adding or removing a lot of categories [13:50:07] but pizza first [13:50:30] ok to resolve the status page? [13:50:30] specially the type with ketchup, corn, sausage [13:50:35] volans: I think so [13:50:41] +1 [13:50:54] Amir1: forgot the pineapple [13:51:43] <_joe_> claime: I'll film myself eating a greek crepe if you collaborate [13:51:57] This is a hostlie work environment [13:52:00] hostile* [13:52:45] claime: I'm talking about Iranian pizza https://tb-static.uber.com/prod/image-proc/processed_images/65e1d72877f54ce5b54cd0abc5e09fa7/b4facf495c22df52f3ca635379ebe613.jpeg [13:53:09] Although I'd actually pay to see _joe_ eat greek fusion food [13:53:21] <_joe_> Amir1: it's just too much, tbh [13:53:22] Because that may be hurting him more than me [13:53:48] <_joe_> claime: a white crepe with hot dog, fries, mayo and ketchup ? [13:53:58] lol [13:54:28] Now I'm craving Pide and there's none around [13:54:29] gg [13:54:38] status page updated [13:57:56] I was going to propose it could be some infinite replication, but that host doesn't replicate from anything [13:58:49] oh, it got fixed, what was it? [13:58:58] can we rule out that cirrus was the cause? [13:59:01] categoryMembershipChange job [13:59:05] reduced concurrency [13:59:15] volans: es2037 or the big one? [13:59:19] dcausse: not really I think it was concurrency [13:59:36] <_joe_> https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=dbstore&var-group=core&var-shard=es1&var-shard=es2&var-shard=es3&var-shard=es4&var-shard=es5&var-shard=es6&var-shard=es7&var-role=All&from=now-6h&to=now&viewPanel=8 [13:59:40] <_joe_> it was both [13:59:42] concurrency of the cirrus job? [13:59:43] jynus: the es overload the job, the general issue the index for search [13:59:52] *concurrency of [13:59:52] <_joe_> the first outage stopped when we stopped the cirrus job [14:00:06] volans: thanks [14:00:14] that was eqiad only? [14:00:17] <_joe_> it's possible this was just a perfect storm [14:00:58] <_joe_> dcausse: no, codfw only, and in codfw stopping cirrus didn't solve the issue [14:01:22] <_joe_> https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=codfw&var-group=core&var-group=dbstore&var-shard=es1&var-shard=es2&var-shard=es3&var-shard=es4&var-shard=es5&var-shard=es6&var-shard=es7&var-role=All&from=now-6h&to=now&viewPanel=8 [14:01:40] <_joe_> so I would say probably both things were putting pressure on the dbs [14:01:47] dcausse: both but with different graphs [14:02:06] *patterns [14:02:47] hi! are we good to start with the mediawiki deploy window? [14:02:56] in eqiad we've seen a pause of the issue between 12:42 and 13:18 while in codfw it was continuous [14:03:17] volans: do I "resolve" de gdoc, too? [14:03:20] *the [14:03:30] the gdoc needs a lot of work [14:03:39] I care mostly about the ongong [14:03:39] but yes the status sure [14:04:04] the rest can be done later [14:04:33] tgr_: I think we're out of the woods, at least that seems the case from the latest graphs, what do the other think? [14:04:35] but I made the mistake of leaving it ongoing for a long time, potentially alarming some people (I think is a bit redundant now with status page) [14:04:57] Unrelated to the outage: https://linuxiac.com/postfix-3-10-mta-arrives-with-openssl-3-5-support/ [14:04:59] fyi the parallelism decrease from 3 to 2 should have had a very minimal impact on mw-api-int query it's controlled by a separated pool, (the increase was mainly done to address a bottleneck writing to elastic) [14:05:41] cirrus query rate to mw-api-int@codfw: https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-kubernetes_namespace=cirrus-streaming-updater&var-app=All&var-destination=mw-api-int-async-ro&from=now-6h&to=now [14:06:10] dcausse: mmmh how would you explain the resolution once we reduced the concurrency? [14:06:33] volans: first it was paused I think? [14:06:40] or you think that resuming at 3 would have been the same? [14:06:48] yes was paused [14:07:32] volans: I think so, but I mean if we want to reduce the rate of mw-api-int requests from this job we need to tweak something else [14:07:58] ack, got it [14:08:02] thx for the clarification