[07:46:59] inflatador: when you get a chance https://gerrit.wikimedia.org/r/c/operations/puppet/+/1126486?usp=dashboard [08:26:00] arturo: m5 master has been switched over [08:32:48] marostegui: thanks! [10:55:42] this is parsoid going crazy overloading content dbs [10:55:48] marostegui: Amir1^ [10:55:49] yeah it seems es hosts are overloaded [10:55:57] es? [10:56:02] scrappers? [10:56:08] yep, cluster 3 , cluster26 [10:56:13] cluster27 [10:56:15] etc, [10:56:30] but there is also a deploy in progress FYI [10:57:27] page recovered [10:57:35] metrics seems recovering, at least memcached and parsoid [10:57:35] http latency increased since 10:47 [10:57:52] scap stopped [10:57:55] effie: deploy failed ,right? [10:58:17] yes it did [10:58:29] backend response time still high but might be starting the recovery [10:58:40] was everything rolledback or do we need to run some cleanup effie ? [10:58:42] we are recovering [10:58:47] yeah, I don't see full recovery but it could be metrics latency [10:59:29] do we have evidence that it was deploy-related or should I dig into incoming traffic? [10:59:30] elukey: most likely no [10:59:41] volans: I will first go through the deployment [10:59:46] and get back to you [10:59:58] ack, standing by [11:00:02] <_joe_> volans: what did you see? [11:00:11] volans: I checked sample and didn't see anything obvious, but who knows [11:00:31] so far nothing hence my asking, if we don't have evidence of anything deploy-related I will dig more to see if I can find anything [11:00:35] <_joe_> volans: the deployment was to move requests for parsoid all to php 8.1 [11:00:37] but if we think it was the deploy no point :) [11:01:10] parsoid, api-int, jobrunner [11:01:14] <_joe_> volans: no I mean what were the things you saw... if the issue was saturation of php workers on 8.1, then it's probably that [11:01:19] <_joe_> otherwise I doubt it [11:01:25] what was the previous pct? [11:01:27] <_joe_> (that == the deployment) [11:01:37] It was 50% [11:01:37] <_joe_> 50% [11:01:52] <_joe_> if you saw a spike of load in ES, that shouldn't be this deployment [11:02:03] <_joe_> it would look like something caused a spike of parses [11:02:05] I wonder if it's some maxUnavailable shenanigans [11:02:13] failed probes for parsoid, db-related errors [11:02:14] overload started 10:47 [11:02:15] <_joe_> claime: causing db load? [11:02:17] a ton of erros in logstash are like "Wikimedia\Rdbms\DBConnectionError: Cannot access the database: could not connect to any replica DB server" [11:02:21] we did increase maxUnavailable recently, right? [11:02:25] thats the es [11:02:28] https://logstash.wikimedia.org/goto/ded4cc29c3349c6dd0ecc7fe1bf748d5 [11:02:30] <_joe_> people, it's not the deployment [11:02:37] ok [11:02:41] ack [11:02:49] <_joe_> if anything, increasing maxunavailable would reduce the db load [11:03:06] kamila_: only for -migration, main remained at default [11:03:16] oh, ack, thanks claime [11:03:32] ok so far, here is what the deployment was supposed to do [11:03:58] example of external storage load https://grafana.wikimedia.org/goto/kl3BUShNR?orgId=1 [11:04:00] switch all main deployments for -api-in jobrunners and parsoid to php81. [11:04:23] connections went x10 [11:04:24] <_joe_> the slow queries dashboard is also saying there's a lot of slow queries in s4 [11:04:42] <_joe_> so I guess something scanning commons *AGAIN* [11:04:43] and from the mediawiki side, there were db erros [11:05:04] [{reqId}] {exception_url} Wikimedia\Rdbms\DBConnectionError: Cannot access the database: could not connect to any replica DB server [11:05:14] <_joe_> volans: look for uncached traffic to commonswiki [11:05:18] k [11:05:22] <_joe_> maybe api.php even [11:05:39] volans: are you looking for the traffic part? [11:05:48] commons.wikimedia.org not, checking api [11:05:49] yes [11:09:09] jynus: can you please check the DB's state? I would like to understand if I can run scap again or not [11:09:39] checking [11:10:09] effie: things are still weird [11:10:22] alright [11:10:34] https://grafana.wikimedia.org/goto/yLDkwShNR?orgId=1 [11:10:40] the incident is ongoing [11:10:58] ok, I can do IC [11:11:07] we have 3x to 4x the number of normal connections [11:11:13] anyone who is working on this, can you please share what you are doing atm [11:12:17] * volans checking traffic [11:12:28] <_joe_> I'm looking at the slow queries dashboard now [11:12:50] How can I help? [11:13:23] marostegui: if jynus is on the DB side and does not need any more assistance, we are alright I believe [11:13:38] ok, I will stand by, jynus if you need more eyes let me know [11:14:00] https://docs.google.com/document/d/1dO3TQVl7vSUV-4YPkEQhAwTUa4mZnDEypOTkTdEqZl0/edit?tab=t.0 [11:14:10] marostegui: one question, is es2024 down? [11:14:30] shouldn't be [11:14:31] checking [11:15:02] jynus: that host doesn'te xist [11:15:18] ok, then unrelated, it still appeared as uncontactable no board [11:15:19] volans, jynus, we are moving this to _private for now [11:15:23] k [11:15:24] not important now [11:15:50] jynus: it wasn't removed from zarcillo, doing it now [11:16:15] no worries, I only asked because I saw it while investigating the issue [11:16:15] done [11:36:54] I just got an error on some wdqs instances on puppet executions: Error: Could not send report: Error 500 on SERVER: Server Error: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null [11:36:58] is this a known issue? [11:37:53] I think filippo had it like yesterday, I don't know more [11:42:03] thx [12:29:51] vgutierrez: yes I filed T388629 [12:29:52] T388629: puppet error at the end of the run on prometheus2008: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null - https://phabricator.wikimedia.org/T388629 [13:00:14] gehel: brouberol, gmodena, Amir1, volans lets cont the chat here [13:00:29] SUP is paused for now, and that seems to fix the overload, but we're not ingesting any new change into the search indices. Not something we want to keep. [13:01:34] It seems that we've recently increased the parallelism in the search update pipeline, which could have increased the load on action API. We can decrease the parallelism again and re-enable the SUP. Objections? [13:01:35] can we tweak the rate of it? what's the minimum rate acceptable for the indexes? [13:01:58] (to have the indexes done in a reasonable amount of time) [13:03:11] The general SLO is 10 minutes (we want edits to be available as search results in < 10 minutes). But the issue here is likely more related to saneitizer, which is the component that goes over all pages every 8 weeks to catchup on updates that were missed. [13:03:15] it would be great to have commons and wikidata have their own "low prio" lane [13:03:23] gehel: can someone write on the status doc under the What Happened section, where we are now and what happened? [13:03:29] they are each 100M pages [13:03:43] effie ack [13:03:49] gmodena: cheers thanks [13:04:07] gmodena: no finger pointing, it is for the report afterwards [13:04:08] This patch reduces parallelism for the SUP conusmer-search job https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1126988 [13:04:11] Amir1: that sounds like a good idea. Let's talk with dcausse when he is back. [13:04:15] effie no worries :) [13:04:59] Amir1: I have a pending deployment that is blocking the train, ie, unless I go on with it, the train cant run [13:05:02] couldelastic parallelism was also recently increased, I'm trying to figure out whether that could have added pressure [13:05:13] and if it's safe to scale down [13:05:21] Amir1: shall I go on with it, DB wise? [13:05:50] I can't say if 2 is low enough but surely looks better than 3 :D [13:05:57] effie: sure! [13:06:41] mw-api-int canaries are whining again [13:07:39] although it may just be because there's so few of them they get easily saturated [13:08:12] claime: can you tell me which cluster "10.194.169.43:47532" belongs to? [13:08:28] (or tell me how to extract it) [13:08:33] mw-jobrunner [13:08:36] Amir1: dig -x [13:09:14] oh nice, last time I tried dig -x, it gave some general wikikube address [13:09:28] that's been fixed for a while iirc [13:09:45] I think c.danis worked on making the pod reverse DNS more informative a while back [13:09:46] *pods [13:09:52] That's awesome <3 [13:10:12] topranks as well :) T376291 [13:10:13] T376291: Authdns: automate reverse DNS zone delegation for k8s pod IP ranges - https://phabricator.wikimedia.org/T376291 [13:10:16] Of course it would be Chris, who else is DNS crazy [13:10:52] beside the point, the spot checks show that the current load on ES is from jobrunners [13:10:57] now time to dig jobs [13:11:17] I did https://trace.wikimedia.org [13:11:23] with "server.address="es2040" db.namespace="commonswiki"" [13:11:26] but nothing of use [13:11:27] Amir1: again? [13:11:55] effie: maaybe it's unrelated but es2040 is quite under duress https://grafana.wikimedia.org/d/d251bef4-d946-4bea-a8a5-b02a3546762e/mariadb?orgId=1&var-job=All&var-server=es2040&var-port=9104&from=now-3h&to=now [13:12:59] Amir1: fwiw es2037 too, is the one I was looking at before [13:13:02] are we ok enough to merge https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1126988 and re-enable SUP? Should we wait for something? [13:13:12] those didn't recover, and we were wondering with jaime if it could be a monitoring artifact [13:13:16] okay on my side [13:13:18] I updated https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1126988 to reduce parallelism also for consumer-cloudelastic [13:13:23] he said checking the host via ssh looked ok [13:13:34] volans: and same via mysql prompt [13:13:41] volans: "show processlist" still is quite large [13:14:03] https://trace.wikimedia.org/search?end=1741785228174000&limit=20&lookback=1h&maxDuration&minDuration&service=mediawiki&start=1741781628174000&tags=%7B%22db.namespace%22%3A%22commonswiki%22%2C%22k8s.namespace.name%22%3A%22mw-jobrunner%22%2C%22server.address%22%3A%22es2040%22%7D [13:14:05] I get 50 sleep on es2037 [13:14:05] 3 hits here [13:14:14] volans: yeah, there are no long running queries [13:14:32] and there have been no queries being killed [13:14:52] volans AFAIK parallelism=2 was the default for a while, we bumped it to 3 a couple of weeks ago to keep up with backfills/added write pressure on ES. [13:15:02] gmodena: ack, sgtm [13:15:08] if Amir1 is ok with reducing parallelism and re-enabling SUP, that sounds good enough to me. brouberol can you make that happen? [13:15:20] marostegui: can you explain https://grafana.wikimedia.org/goto/r5k7gN2NR?orgId=1 ? [13:15:24] Amir1: I will rollback my changes and run scap to bring us at least to the previous state [13:15:30] in terms of do they match the host status or not [13:15:35] volans: not really [13:15:55] unless it is a very fast query that we don't get to see [13:16:02] let me enable the general log for now [13:16:06] so much query very trace [13:16:27] I was afraid to ask :) [13:16:39] general log won't help much, they all go through the sql storage abstraction [13:16:43] if we could enable something for a short amount of time to verify, either peformance schema or the log [13:16:47] I want to know what job is doing the calls [13:16:57] gehel: on it. I've +1/+2ed the patch and will deploy [13:17:06] brouberol are you deploying the patch? [13:17:10] gmodena: any order I should follow? Aka deploy eqiad or codfw first? [13:17:10] oeps [13:17:12] brouberol ack [13:17:23] if you enable it, you'll get just something like "this row was selected" [13:17:35] usually I do the secondary DC first, and the primary second [13:17:42] ^ brouberol [13:18:25] marostegui: at least network graph from the generic host-overview seems to back the theory that something is still ongoing: https://grafana.wikimedia.org/goto/nbUdRN2NR?orgId=1 [13:18:36] gmodena ack [13:19:24] deploying to eqiad [13:19:28] Amir1: What I want to see if there are many rows being selected by a single query [13:19:28] not cpu though [13:19:52] marostegui: I don't think that'll be given, it'll be just one row [13:20:06] in mw, they all go through blob store abstraction [13:20:12] Amir1: Slow query logs tells you that [13:20:33] onto codfw [13:20:35] https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&refresh=5m&var-dc=codfw%20prometheus%2Fk8s&var-job=categoryMembershipChange&from=now-24h&to=now [13:20:40] this might be related [13:21:07] gmodena gehel: SUP has been deployed with replicas: 2 in both DCs [13:21:39] volans: yeah, the network graph and even the operations one says whatever is happening is still ongoing [13:22:42] The Handler_read_rnd_next is insane, so there is something doing some sort of table scane [13:22:58] https://grafana.wikimedia.org/d/d251bef4-d946-4bea-a8a5-b02a3546762e/mariadb?orgId=1&refresh=1m&var-job=All&var-server=es2037&var-port=9104&from=1741763693093&to=1741785293093&viewPanel=3 [13:23:00] marostegui: the thing is that in mw nothing can access ES hosts unless they use blob store, and that class takes away anything important we could extract from the query itself, and at least should do one query at a time [13:23:04] brouberol gehel the jobs are starting to catch up [13:23:38] I'm going to reduce the concurrency of category membership change job [13:23:53] also read_key went up 20x [13:24:04] Amir1: Yeah, but why would something start doing table scans [13:24:07] If the query hasn't changed [13:24:39] I think (not sure) is that they query a lot [13:25:26] to clarify/exclude any doubt I would enable the logging even just for few seconds to see what's in there [13:25:33] or the performance schema stuff [13:25:33] volans: I did already [13:25:36] ah sorry [13:25:38] whihc host? [13:25:48] volans: es2037 [13:26:06] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1127000 [13:28:23] did that change today? [13:28:54] Amir1: +1'd, fire away [13:29:17] let's see if it helps, if not, we revert [13:29:31] marostegui: https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&refresh=5m&from=now-2d&to=now&var-dc=codfw%20prometheus%2Fk8s&var-job=categoryMembershipChange&viewPanel=74 [13:29:36] something is going on with this job [13:29:51] let's go for that patch [13:31:16] Amir1: would that job also run against RO hosts? [13:31:26] yeah, it's a REALLY messed up job [13:31:30] Because there's also a big spike on RO hosts but it went sort of back to normal [13:31:34] like: https://grafana.wikimedia.org/d/d251bef4-d946-4bea-a8a5-b02a3546762e/mariadb?orgId=1&refresh=1m&var-job=All&var-server=es1029&var-port=9104&from=1741763693093&to=1741785293093 [13:31:42] it parses every revision after certain time [13:34:58] guys, I really need to run scap to make sure that the next scap won't be a flop [13:35:12] in theory, it will leave things as they are [13:35:40] I'm almost done with my deploy but it's also not scap so it shouldnt' affect you [13:35:48] as I have reverted the changes I intended to make a few fires ago [13:38:17] marostegui: it might be too soon but https://grafana.wikimedia.org/d/d251bef4-d946-4bea-a8a5-b02a3546762e/mariadb?orgId=1&refresh=1m&var-job=All&var-server=es2040&var-port=9104&from=now-1h&to=now&viewPanel=37 [13:38:56] Amir1: What worries me, if that fixes the issue, is what has changed to have this effect [13:39:05] Amir1: can I go ahead? [13:39:12] effie: sure [13:39:19] marostegui: template change, etc. [13:39:35] this can trigger a flood of reparses [13:39:53] Amir1: But then everytime the solution would be to decrease job? [13:39:55] actually, these all get recorded in rc table if something actually changes [13:40:06] marostegui: did you find any smoking gun in the query log? I had a look and didn't find much so far [13:40:09] oh no, the proper solution is to rewrite the job from ground up [13:40:34] volans: nothing [13:40:45] Amir1: if we are in a stable state, even if not idea, please take a break now [13:40:49] ideal* [13:41:18] I'll eat soon [13:42:08] k [13:42:19] Amir1: The effect was momentary, it all came back to the previous value [13:42:58] yeah the job concurrency hasn't budged either https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&refresh=5m&from=now-3h&to=now&var-dc=codfw%20prometheus%2Fk8s&var-job=categoryMembershipChange&viewPanel=74 [13:43:20] that should go down [13:44:16] also these jobs have been running for a long time, it'll take some time [13:46:04] I am stepping out for now, volans will be IC till he takes a break, thank you all very much ! [13:46:18] Amir1: it seems to be going down now [13:46:26] take some rest effie [13:46:39] Yes, take some rest effie <3 [13:46:51] marostegui: awesome \o/ I'll wait a bit [13:47:55] graph for es2037 improving a lot [13:48:13] yes, es2040 too [13:48:33] just the concurrency change? [13:48:36] of the job [13:48:40] looks so [13:48:44] but that's scary [13:48:51] cause it went up without any other change related to it [13:48:58] indeed and also unclear why it triggers the read_rnd_text [13:49:09] so what if we need to keep decreasing until it goes to 0? [13:49:49] My biggest suspicion is that a template in commons has changed and triggered a lot of reparse adding or removing a lot of categories [13:50:07] but pizza first [13:50:30] ok to resolve the status page? [13:50:30] specially the type with ketchup, corn, sausage [13:50:35] volans: I think so [13:50:41] +1 [13:50:54] Amir1: forgot the pineapple [13:51:43] <_joe_> claime: I'll film myself eating a greek crepe if you collaborate [13:51:57] This is a hostlie work environment [13:52:00] hostile* [13:52:45] claime: I'm talking about Iranian pizza https://tb-static.uber.com/prod/image-proc/processed_images/65e1d72877f54ce5b54cd0abc5e09fa7/b4facf495c22df52f3ca635379ebe613.jpeg [13:53:09] Although I'd actually pay to see _joe_ eat greek fusion food [13:53:21] <_joe_> Amir1: it's just too much, tbh [13:53:22] Because that may be hurting him more than me [13:53:48] <_joe_> claime: a white crepe with hot dog, fries, mayo and ketchup ? [13:53:58] lol [13:54:28] Now I'm craving Pide and there's none around [13:54:29] gg [13:54:38] status page updated [13:57:56] I was going to propose it could be some infinite replication, but that host doesn't replicate from anything [13:58:49] oh, it got fixed, what was it? [13:58:58] can we rule out that cirrus was the cause? [13:59:01] categoryMembershipChange job [13:59:05] reduced concurrency [13:59:15] volans: es2037 or the big one? [13:59:19] dcausse: not really I think it was concurrency [13:59:36] <_joe_> https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=dbstore&var-group=core&var-shard=es1&var-shard=es2&var-shard=es3&var-shard=es4&var-shard=es5&var-shard=es6&var-shard=es7&var-role=All&from=now-6h&to=now&viewPanel=8 [13:59:40] <_joe_> it was both [13:59:42] concurrency of the cirrus job? [13:59:43] jynus: the es overload the job, the general issue the index for search [13:59:52] *concurrency of [13:59:52] <_joe_> the first outage stopped when we stopped the cirrus job [14:00:06] volans: thanks [14:00:14] that was eqiad only? [14:00:17] <_joe_> it's possible this was just a perfect storm [14:00:58] <_joe_> dcausse: no, codfw only, and in codfw stopping cirrus didn't solve the issue [14:01:22] <_joe_> https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=codfw&var-group=core&var-group=dbstore&var-shard=es1&var-shard=es2&var-shard=es3&var-shard=es4&var-shard=es5&var-shard=es6&var-shard=es7&var-role=All&from=now-6h&to=now&viewPanel=8 [14:01:40] <_joe_> so I would say probably both things were putting pressure on the dbs [14:01:47] dcausse: both but with different graphs [14:02:06] *patterns [14:02:47] hi! are we good to start with the mediawiki deploy window? [14:02:56] in eqiad we've seen a pause of the issue between 12:42 and 13:18 while in codfw it was continuous [14:03:17] volans: do I "resolve" de gdoc, too? [14:03:20] *the [14:03:30] the gdoc needs a lot of work [14:03:39] I care mostly about the ongong [14:03:39] but yes the status sure [14:04:04] the rest can be done later [14:04:33] tgr_: I think we're out of the woods, at least that seems the case from the latest graphs, what do the other think? [14:04:35] but I made the mistake of leaving it ongoing for a long time, potentially alarming some people (I think is a bit redundant now with status page) [14:04:57] Unrelated to the outage: https://linuxiac.com/postfix-3-10-mta-arrives-with-openssl-3-5-support/ [14:04:59] fyi the parallelism decrease from 3 to 2 should have had a very minimal impact on mw-api-int query it's controlled by a separated pool, (the increase was mainly done to address a bottleneck writing to elastic) [14:05:41] cirrus query rate to mw-api-int@codfw: https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-kubernetes_namespace=cirrus-streaming-updater&var-app=All&var-destination=mw-api-int-async-ro&from=now-6h&to=now [14:06:10] dcausse: mmmh how would you explain the resolution once we reduced the concurrency? [14:06:33] volans: first it was paused I think? [14:06:40] or you think that resuming at 3 would have been the same? [14:06:48] yes was paused [14:07:32] volans: I think so, but I mean if we want to reduce the rate of mw-api-int requests from this job we need to tweak something else [14:07:58] ack, got it [14:08:02] thx for the clarification [14:08:25] I would like to have better o11y reagrding search updater too. [14:08:51] * volans wants tendril back :) [14:09:31] if you're gonna maintain it, why not [14:09:50] :D [14:15:19] Amir1: we use 3 dashboards, envoy-telemetry-k8s for mw-api-int query rate, flink-app for the flink internals and cirrus-streaming-updater for internal metrics [14:15:22] dcausse: if data persistence agrees, after the deployment window we could potentially see if setting it back to 3 has any effect on the metrics we now know we have to check. But this could have also possibly be a case of concurrent causes [14:16:02] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventBus/+/1127017 (JobExecutor: Trace job executions) may also be useful [14:16:03] and be so close to an apparent limit is not great, in particular if it's not totalluy clear the actual cause [14:16:11] for the db misbehaving [14:16:40] volans: sure, we can also try to explicitly reduce the rate of mw-api-int requests if you want but not sure how low we can go to still be able to catchup on updates [14:17:45] To overload a single es shard you need 30000 new simultaneous connections [14:22:22] I need a break, given that the status is resolved I think we can go back and improve the doc async [14:22:32] volans: I am here, take a break [14:22:37] thx [14:23:46] I may need (doesn't have to be today) a review on my preliminary explanation on the incident doc; meanwhile I will go over the rest of the doc [15:25:25] Emperor: vgutierrez fabfur: I want to bump it to 5% any objections? I'm planning to bump it 5% every day that deploy is allowed which means in a month it'll be 100% [15:26:54] no objection here [15:29:01] for the on-callers: Kartotherian in codfw is now running with jemalloc in LD_PRELOAD, I hope to see some improvements in the memory usage (we are battling with what seems to be a memory leak, but it could be memory fragmentation etc..). [15:29:30] So far the metrics look great, but as always please ping/blame me in case Kartotherian gets upset [15:30:09] <_joe_> elukey: I love how even in the k8s world things like LD_PRELOAD are still useful :P [15:30:50] <_joe_> but yeah, maybe we need to rebuild node linked to jemalloc for the base image? [15:30:57] <_joe_> I mean if it works [15:31:45] I followed what it seemed suggested by upstream, that of course doesn't mean it is right but :D [15:32:45] if the tests are good (as it really seem from the first datapoints) I'll look for a way to rebuild nodejs with jemalloc support, but no idea if there is a proper way or if we need LD_PRELOAD [15:32:53] it is not that bad come on [15:34:51] also it would be ideal if we could choose the allocator via nodejs parameter, to flip it only when needed [15:35:30] (maybe it is just a good default with nodejs, I am tempted to test it also for changeprop) [15:38:33] upstream node discussions seem to have rejected the idea of a build flag to toggle support for alternative allocators due to interop concerns (third-party libraries like openssl might take ownership of memory allocated by node/v8) [15:38:42] so LD_PRELOAD seems to be the way to go [15:40:20] ack thanks! [15:40:40] Amir1: no problem for me [15:42:07] https://github.com/nodejs/node/pull/17007 has some of the historical context [15:42:20] I will pin https://grafana.wikimedia.org/d/000000500/varnish-caching?orgId=1&refresh=15m&var-cluster=cache_upload&var-site=All&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5&from=now-30d&to=now until this is 100% [16:36:06] (deployed kartotherian-on-jemalloc also in eqiad) [16:40:47] FYI, I'm in the process of migrating the remaining ~ 30-40% of external mediawiki traffic to PHP 8.1 - I'll be posting to -operations as I go [16:41:05] elukey: I assume in terms of capacity the newly minted k8sotherian still needs to be active in both DCs as with the old clusters? [16:50:47] hnowlan: it depends on how many pods you let me run on Wikikube :D [16:51:14] jokes aside, we have now 50 pods for each DC, and the traffic is handled decently [16:51:37] I am still trying to figure out if we need something like 60 to be really safe when a DC is depooled, but it is not clear from the graphs [16:52:01] Currently we have an exception for kartotherian in the switchover cookbooks so no worries there [16:52:06] but it'd be nice to remove it ofc :D [16:52:11] at some point when the capacity is not enough for some traffic volume, kartotherian starts to backlog requests etc.. [16:52:13] no rush on it though for this switchover [16:52:22] so the tipping point is not clear [16:52:34] but I think it is "only" a matter of capacity [16:53:12] I'll probably not be able to test a dc depool before the switchover, but afterwards for sure [16:53:50] cool! [16:54:04] nice to even have the option, previously it was just Known Dangerous [18:42:40] anyone know what this is about? — Error: Could not send report: Error 500 on SERVER: Server Error: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null [18:43:36] I'm actually somewhat familiar with that dumb java error, but not sure what the context is [18:54:03] I know g.odog opened T388629 for it, and I've seen the same error myself, but that's about all I know [18:54:04] T388629: puppet error at the end of the run on prometheus2008: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null - https://phabricator.wikimedia.org/T388629 [18:54:35] swfrench-wmf: that helps [18:55:26] urandom: getting it on many other hosts as well so you are not alone [18:55:58] I guess we extend puppetserver with a report handler for logstash [21:21:38] mutante: ha thanks for the heads up on that patch [21:22:17] big coincidence, tbh :) [21:22:28] That does indeed look required, I'm not all that familiar with the discovery records [21:22:29] wanted to start adding a service to aux cluster [21:22:51] I was doing the network changes to allow the POD ranges in from the nodes in BGP and noticed the reverse DNS delegation was missing [21:22:53] hence my patch [21:23:05] ack, just let it sit for a review by traffic or serviceops [21:23:32] herron: might now more about it if he is online [21:23:36] ok [21:23:39] it will be needed once a service in aux needs ingress [21:23:45] brett gave it a +1 [21:23:48] ok that makes sense [21:23:54] also, the jaeger service is in aux but only has eqiad [21:24:09] so I suggested also https://gerrit.wikimedia.org/r/c/operations/dns/+/1126180 [21:24:21] but that is an existing service so yea.. [21:24:48] ok yep makes sense too [21:24:58] but I will leave to the experts :) [21:25:54] yep, alright:) and I need to go afk for a few