[08:13:11] hello all, just FYI, i'm going to apply a change to the changeprop service on eqiad and codfw clusters to support new changes from revertrisk-multilingual service (ml side) [09:06:19] thanks [10:17:59] I am looking at the ats thing [10:19:57] so it looks like there are DB overload exceptions since ~10 UTC [10:19:57] https://logstash.wikimedia.org/app/dashboards#/view/22708f70-bfc5-11f0-a09e-770caf2ecc5d?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-50m,to:now))&_a=(description:'MediaWiki%20error%20monitoring%20(uncaught%20exceptions,%20and%20silent%20runtime%20errors)',filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:type,negate:!f,params:(query:mediawiki),type:phrase),query:(match:(type:(query [10:19:57] :mediawiki,type:phrase)))),('$state':(store:appState),meta:(alias:'%5BIgnore%5D%20Debugging,%20Testing',disabled:!f,index:'logstash-*',key:query,negate:!t,type:custom,value:'%7B%22bool%22:%7B%22minimum_should_match%22:1,%22should%22:%5B%7B%22match_phrase%22:%7B%22host%22:%22mwdebug1001%22%7D%7D,%7B%22match_phrase%22:%7B%22host%22:%22mwdebug1002%22%7D%7D,%7B%22match_phrase%22:%7B%22host%22:%22mwdebug2001%22%7D%7D,%7B%22match_phrase%22:%7B%2 [10:19:57] 2host%22:%22mwdebug2002%22%7D%7D,%7B%22match_phrase%22:%7B%22host%22:%22cloudweb2001-dev%22%7D%7D,%7B%22match_phrase%22:%7B%22exception.trace%22:%22maintenance%2Feval.php%22%7D%7D,%7B%22match_phrase%22:%7B%22exception.trace%22:%22maintenance%2Fshell.php%22%7D%7D,%7B%22match_phrase%22:%7B%22level%22:%22DEBUG%22%7D%7D%5D%7D%7D'),query:(bool:(minimum_should_match:1,should:!((match_phrase:(host:mwdebug1001)),(match_phrase:(host:mwdebug1002)),( [10:19:57] match_phrase:(host:mwdebug2001)),(match_phrase:(host:mwdebug2002)),(match_phrase:(host:cloudweb2001-dev)),(match_phrase:(exception.trace:maintenance%2Feval.php)),(match_phrase:(exception.trace:maintenance%2Fshell.php)),(match_phrase:(level:DEBUG)))))),('$state':(store:appState),meta:(alias:'%5BIgnore%5D%20PHP%20OOM,%20Timeout',disabled:!f,index:'logstash-*',key:normalized_message,negate:!t,params:!('Allowed%20memory%20size%20of','max_state [10:19:58] ment_time%20exceeded','The%20maximum%20execution%20time',DBTransactionSizeError,DBQueryTimeoutError),type:phrases,value:'Allowed%20memory%20size%20of,%20max_statement_time%20exceeded,%20The%20maximum%20execution%20time,%20DBTransactionSizeError,%20DBQueryTimeoutError'),query:(bool:(minimum_should_match:1,should:!((match_phrase:(normalized_message:'Allowed%20memory%20size%20of')),(match_phrase:(normalized_message:'max_statement_time%20excee [10:19:58] ded')),(match_phrase:(normalized_message:'The%20maximum%20execution%20time')),(match_phrase:(normalized_message:DBTransactionSizeError)),(match_phrase:(normalized_message:DBQueryTimeoutError)))))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:query,negate:!f,type:custom,value:'%7B%22bool%22:%7B%22should%22:%5B%7B%22match_phrase%22:%7B%22level%22:%22ERROR%22%7D%7D,%7B%22match_phrase%22:%7B%22channel.keyword%22 [10:19:59] :%22error%22%7D%7D%5D%7D%7D'),query:(bool:(should:!((match_phrase:(level:ERROR)),(match_phrase:(channel.keyword:error)))))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:channel.keyword,negate:!t,params:(query:PageViewInfo),type:phrase),query:(match_phrase:(channel.keyword:PageViewInfo)))),fullScreenMode:!f,options:(darkTheme:!f),query:(language:lucene,query:'*'),timeRestore:!f,title:mediawiki-channels-errors [10:19:59] ,viewMode:view) [10:20:00] sorry [10:20:14] [10:20:14] [10:20:19] mediawiki exceptions in https://logstash.wikimedia.org/goto/85410d0d0dbd5aaa29f17d46e0d8eba8 [10:21:15] headsup: I'll reboot cumin2002 in 15 minutes for a kernel update [10:21:32] * volans starts a very long cookbook there on purpose [10:25:43] cont on _security [12:20:51] I have a highly visible UBN that needs backporting (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/1266984) but it looks like there is an ongoing DB incident [12:21:36] moritzm, topranks, effie: ^ [12:22:00] edsanders: I am afraid it will have to wait [12:23:34] For context - it's showing an experimental feature to all editors in production - can you let me know as soon as deployments are possible? [12:23:55] edsanders: yes will do [12:24:05] Thanks [12:54:28] hi folks, just a reminder that we will repooling codfw at 14:00 utc today [14:22:46] elukey: are you around? we're investigating an issue with the docker registry that may rhyme with the "small" blob issue you ran into early in the apus migration [14:26:37] swfrench-wmf: o/ [14:27:31] elukey: https://phabricator.wikimedia.org/T422166 [14:27:42] thank you :) [14:27:49] pulling together logs now [14:30:58] swfrench-wmf: in my case the small files issues was causing the docker client to completely hang [14:31:38] no progress during the push of some layers, and my understanding was that the rados gw in ceph was stalling due to a race condition with async handling [14:32:00] in this case it seems more a read-your-write problem, like we had with swift sigh [14:32:14] ah, got it - thanks. for some reason I thought it resulted in the "blob upload unknown" problem we see here [14:32:15] swfrench-wmf: the docker-registry in eqiad is getting a lot of traffic considering that dnsdisc says it's depooled [14:32:21] maybe we need to re-add some time to let eventual consistency to catch up? [14:32:25] cdanis: I was just about to ask ... [14:33:23] jasmine_: did this start *before* the repool or after? [14:33:27] cdanis: do you have a link? [14:33:38] also no mutating operations against codfw since 12:30 [14:33:50] elukey: just comparing the two datasources of https://grafana.wikimedia.org/d/StcefURWz/docker-registry?orgId=1&from=now-3h&to=now&timezone=utc&var-datasource=000000005&var-instance=$__all [14:34:28] so, I definitely see these failing ops in codfw's docker-registricted journal (e.g., on registry2004) [14:34:33] hmmm ok [14:35:03] cdanis: yeah this is why I was asking, I see some peaks but they are not new (at least, first panel) [14:35:06] swfrench-wmf: seems after - repool concluded at 13:44, error logged at 13:53 [14:35:20] did we touch apus? [14:35:25] and replication is slow? [14:35:39] it is weird since this is the first time that an issue like that occurs, and we had a ton of deployments since the switch [14:35:44] swfrench-wmf: checking on apus [14:36:09] apus is dnsdisc pooled in both DCs [14:36:14] * swfrench-wmf nods [14:36:27] eqiad looks good from moss-be1001's perspective, replication is in sync [14:36:34] (confirming that we only tried deploying after the repool) [14:36:47] so, registry in codfw (active) will be hitting apus in codfw [14:37:01] is there a way to see how behind apus in codfw is? [14:37:27] ah wait you mean that up to now the registry in codfw was pushing to apus in eqiad [14:37:31] okok lemme check [14:38:00] elukey: so, we don't switch the registry during the switchover. what I think just happened is that it was hitting eqiad, until the repool [14:38:07] and is now hitting codfw once again [14:38:10] yeah [14:38:48] to keep archives happy: `sudo cephadm shell -- radosgw-admin bucket sync status --bucket=registry-restricted` on moss-be1001 and moss-be2001 [14:38:54] it seems in sync [14:39:34] https://grafana.wikimedia.org/d/f1e5bb5b-7fff-40ec-9cb3-39ed7510e81f/ceph-pools?orgId=1&from=now-6h&to=now&timezone=utc&var-datasource=000000026&var-cluster=apus&var-site=codfw&var-topk=15&refresh=5m [14:39:49] we have bydirectional sync between apus dcs, but we have always assumed one way [14:39:50] not sure why codfw seems to be 200GiB (?) larger than eqiad apus [14:40:57] we also store gitlab's buckets in there, not sure what is their replication scheme [14:41:13] so, the two brute-force things I could imagine trying are: (1) depooling apus codfw again and trying to deploy, (2) doing a full image build in the current configuration [14:42:55] swfrench-wmf: I am reasonably sure that a new full image build will work fine, it has worked fine so far and we got eventual-consistency troubles when we repooled codfw, so I think you are right in saying that there may have been some delay in replication [14:43:03] sadly I cannot prove it now, we don't have history [14:43:13] hopefully in the future we'll have some metrics [14:43:55] 2) is probably ok, 1) one may or may not work since I think it is a matter of timings [14:44:01] if you are lucky or unlucky [14:44:09] we never really tested this use case [14:44:13] what do you think? [14:44:35] if someone wants to try a deploy, the two unmerged changes are (I think) 1267062 and 1266985 [14:44:45] (just for SAL / message purposes, would be good to mention both) [14:45:09] elukey: yeah, I think #2 should "fix" the problem and #1 is more just an experiment [14:45:32] Lucas_WMDE: so, those changes _did_ merge and get pulled into /srv/mediawiki-staging, right? (i.e., it failed at image build) [14:45:37] I believe so, yes [14:45:45] i.e., if I sync-world, I'll pick them up [14:45:51] yeah I can see at least the config change on deploy1003 [14:46:41] (and the other one in VE too) [14:46:55] Lucas_WMDE: Amir1: if I get a full-rebuild going now, set up to pause after testservers, similar to backport, would you be available to test your changes? [14:47:08] edsanders: ^ [14:47:26] my patch is not really testable [14:47:27] I’m not sure how testable Amir1’s change is at the moment, to be honest, but I might be misunderstanding it [14:47:28] ok [14:47:38] Amir1: ah, yeah - I just read it :) [14:47:43] (hadn't done that yet) [14:47:49] * Lucas_WMDE looks at the task for the other change [14:47:56] interestingly, codfw claims to be caught up [14:48:30] should I just try a normal spiderpig for both? or do you want to do a full-rebuild deploy? [14:48:50] https://phabricator.wikimedia.org/P90238 [14:49:15] I need a break [14:49:27] Amir1: ack, thanks! [14:50:44] swfrench-wmf: I would say try going ahead with it even if nobody can verify the VE backport, tbh [14:51:02] it looks pretty small, it’s JS-only (can‘t break anything serverside), it was reviewed on master [14:52:33] Lucas_WMDE: sounds good. if it's alright, I might run this directly myself so I have the option of the full rebuild (not available in spiderpig). [14:52:51] ok, sure [14:52:54] I'll ping edsanders in parallel [14:53:17] cdanis: elukey: anything you want to check before I do so? [14:53:53] Kemayo is taking over from me [14:54:21] ah, thanks! [14:54:22] o/ [14:54:40] great, Kemayo: I'll let you know when it's on testservers [14:54:46] thanks swfrench-wmf! [14:56:50] off we go ... [14:57:10] I'll attempt without a full image build first [14:57:11] swfrench-wmf: green light from me [14:57:16] elukey: thanks! [14:58:06] still fails - full build it is [15:01:25] would it help if we perhaps set loglevel to debug and restarted the docker registry? [15:02:59] so, I think the "restricted" registry (the one that uses apus as a storage backend) is already running at debug level :) [15:03:34] oh yes makes sense [15:04:26] * swfrench-wmf twiddles thumbs waiting for image rebuild [15:12:15] hmmm ... so, I think this might still fail [15:13:13] `Apr 02 15:11:29 deploy1003 dockerd[1070]: time="2026-04-02T15:11:29.273499198Z" level=error msg="Upload failed, retrying: received unexpected HTTP status: 500 Internal Server Error"` [15:14:06] ohno [15:15:35] weird [15:15:39] it makes really no sense [15:15:57] Apr 02 15:14:08 registry2005 docker-registry[631]: time="2026-04-02T15:14:08.523901214Z" level=error msg="error resolving upload: s3aws: Path not found: /docker/registry/v2/repositories/restricted/mediawiki-singleversion/_uploads/205e4760-3c31-4751-9e12-4f9dd1ea5a88/data [15:16:09] exactly, yeah [15:17:08] 💔cdanis@registry2005.codfw.wmnet ~ 🕦☕ ss -tr | grep apus [15:17:10] ESTAB 0 0 registry2005.codfw.wmnet:54380 apus.svc.eqiad.wmnet:https [15:17:12] ESTAB 0 0 registry2005.codfw.wmnet:45302 apus.svc.eqiad.wmnet:https [15:18:07] ... [15:18:23] I bet it never re-resolves [15:18:50] I think you might be right [15:19:09] oh my [15:19:27] it gets weirder https://phabricator.wikimedia.org/P90243 [15:19:27] so we need to bounce the docker registry in codfw? [15:20:02] I'm wondering if there's some inconsistency where _some_ clients are hitting codfw and others are hitting eqiad (within the s3 driver) [15:20:10] ok good the bytes-in-flight were transient [15:20:16] swfrench-wmf: that's exactly what's happening in codfw [15:20:34] on registry2004 it's connected to both lol [15:20:43] alright, once this falls flat on its face, I'll restart the two restricted instances and try again [15:21:02] I am pretty sure these are bugs fixed in version 3.x, that we could use if we didn't need the swift backend [15:21:23] sigh [15:21:26] some day, we can have nice things [15:21:29] nice catch cdanis [15:21:30] elukey: possibly we should take a 'mediawiki service mesh' approach here [15:21:42] this is still better than dealing with swift, so thank you elukey :) [15:22:20] fun fact: this is how every single node.js service used to work here, except there was even more ✨lore [15:22:21] swfrench-wmf: <3 totally unrelated but did you see my proposal in https://phabricator.wikimedia.org/T420978 ? [15:22:48] oh, I had not! but will take a look [15:22:59] cdanis: lol [15:23:08] cdanis: yeah even if IIRC with my tegola's experience, proxying an s3 client might get weird at times [15:23:15] hm ok [15:23:37] like I had to patch tegola to make it working, but maybe it was a weird use case [15:23:41] we can definitely try [15:23:53] this is really taking its sweet time to fail ... [15:23:55] I can't remember if that was s3-generic-specific or swift-specific [15:24:05] swfrench-wmf: what's wrong with just bouncing the registries? [15:24:29] there's nothing wrong with doing so, and I am about to press enter :) [15:24:34] 😌 [15:24:35] +1 [15:26:28] swfrench-wmf: my idea is basically to do a very simple rsync-like script to transfer image:tag combinations without any extra endpoints etc.. to eventually drop swift. But we'd need to figure out what buckets to use, how to structure the thing in apus basically. And if this work is warranted given the fact that we may want to move to Harbor in the future [15:27:38] swfrench@registry2005:~$ ss -tr | grep apus [15:27:38] ESTAB 0 0 registry2005.codfw.wmnet:45302 apus.svc.eqiad.wmnet:https [15:27:38] ESTAB 0 0 registry2005.codfw.wmnet:58782 apus.svc.codfw.wmnet:https [15:27:41] ?? [15:29:34] same pids? I am wondering if it is related to the swift registry [15:29:34] oh wait ml? [15:29:47] yeah, no - different pids :) [15:29:58] let's restart all the instances for a clean state sigh [15:31:30] sounds good - just hit the -ml ones [15:31:51] alright, I think this is the point where we scap-and-pray [15:44:53] Kemayo: are you still around? if this is effective, the cp of https://gerrit.wikimedia.org/r/1266984 should be live in testservers in about 10m [15:45:03] I am present. [15:45:21] awesome, thank you :) [15:45:57] images are built \o/ [15:46:07] thanks for puzzling through this with me elukey and cdanis! [15:46:40] 🥳 [15:47:42] \o/ [15:49:37] \o/ [15:49:54] * swfrench-wmf starts writing it up [15:50:22] Kemayo: if you could please test now, that would be swell [15:50:23] it reached the test servers! [15:50:35] Sure, it'll just take a minute. [15:50:38] nice :) [15:51:01] swfrench-wmf: Looks good. [15:51:08] Kemayo: awesome, thanks! [16:05:42] well, that was a blast. thanks, all! [16:21:41] thanks swfrench-wmf!