[08:55:29] o/ [08:55:41] there are some dns netbox uncommitted changes [08:56:03] for mwdebug2002, I saw that they were removed from prod some days ago, so I'll run the cookbook [08:56:08] cc: effie --^ [09:07:39] elukey: I decommed them [09:07:48] last friday [09:08:19] oh dear, I thought the cookbook would take care of that too [09:09:02] in theory yes, has it finished? If you have it in a tmux session it may be left hanging for that change [09:10:48] it finished unless I tgought it finished [09:10:59] I am off today [09:11:24] and I am far away from my laptop [09:11:56] off [09:12:04] oof [09:12:32] don't worry it is all good, I just added you as FYI, enjoy your day [09:12:40] could be this one: [09:12:40] 2025-10-31 11:59:58 jiji@cumin1003 decommission (PID 1623239) is awaiting input [09:16:04] I need an irc notification for that I believe [09:16:42] sorry guys:( [09:21:32] volans: seems that the cookbook was completed [09:22:07] I did 2001 and 2002 together [09:25:06] I checked the cookbook, it was started with mwdebug[2001,2002].codfw.wmnet and it removed only the 2001's dns records [09:25:27] so maybe there is some limits if we decom multiple hosts when it comes to drop the dns records [09:27:45] cheers, thanks folks [09:29:10] the address was deleted in https://netbox.wikimedia.org/extras/changelog/?request_id=a9c2ff48-9a3c-4b3c-87e9-890e8ab47621 [09:30:16] the commit in the exported dns repo is from Fri Oct 31 11:56:57 2025 +0000, so ~10m earlier [09:39:41] elukey: I think this might have been an issue with the `netbox_ganeti_{cluster}_sync.service` that didn't remove yet the VM from netbox until a later run, not sure why, mught depend on what Ganeti API replied [09:39:45] 2025-10-31 11:50:04,713 [INFO] Updating VM mwdebug2002 in Netbox [09:39:49] 2025-10-31 12:06:08,086 [INFO] Deleting VM mwdebug2002 from netbox [09:41:18] I had done 1001 and 1002 earlier, if that helps [09:41:36] and the run from the cookbook is at 11:51:11, so in the middle of those two [09:42:27] mmmh maybe it's a race condition let me try one thing [09:43:52] ok I think I have an idea elukey: systemctl start $UNIT returns 0 if the unit is already starting but it doesn't enqueue another start [09:45:00] so the timer started it already at 11:51:00 and was still running when at 11:51:11 the cookbook issued another start [09:45:11] that didn't run anything [09:45:29] in the end it all comes down to the TODO at line 372 :D https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/hosts/decommission.py#372 [09:50:14] of course there is a TODO [13:52:43] Hi SRE, can someone tell us if there is anyone working on T408632 ? ( https://phabricator.wikimedia.org/T408632 ) - This is the issue where VRT seems to not be able to send any outbound email, which is the only way we can communicate with most of the customers opening tickets in there [13:52:47] T408632: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632 [13:53:38] The service impacting user story is in T408967 [13:53:41] T408967: VRTS outbound emails not working - https://phabricator.wikimedia.org/T408967 [14:01:00] xaosflux: it is the middle of the night for the mail server admin, but given that he triaged it as High, I assume the answer is yes [14:03:27] Raine thanks for the note, neither of those tasks is "claimed" - would say 12 hours be sufficient before we should try to get this escalated up? Tickets are still getting created, but as far as I can tell we can't send responses to customers. [14:04:47] xaosflux: can you please update the ticket with that information rather than just IRC? it makes it easier to find for people who aren't awake now [14:05:47] I sure will, I'm mostly just checking on the expectations for resourcing right now. [14:06:49] I don't know exactly, I happened to be around when we were first notified, but I am not working on it myself [14:07:41] the best way to get an answer from the people actually involved is on the ticket [14:08:39] Thanks, that's sort of the problem - no one has claimed the ticket. :D [14:16:30] OK, notes added to tickets. Will wait half a day and hope work proceeds and is logged to the ticket for follow up before looking for next escalation. [14:20:11] xaosflux: thanks, I'll keep an eye on it too and poke someone if there is no update by my evening [14:40:22] xaosflux & Raine reading back scroll, I'll look at the issues this morning [14:41:21] oh hi morning! thanks <3 [15:45:17] on-callers: If no one objects, I am going to do a rolling-restart of Cassandra on the sessionstore cluster shortly (JVM upgrade). No impact expected. [15:45:44] sure, thanks for the headsup [17:58:30] federico3: RE: prometheus data for db state - https://grafana-rw.wikimedia.org/d/a972e119-a791-4c4f-9de7-c6a6be58e1e2/federico-s-mariadb-status?orgId=1&from=now-24h&to=now&timezone=utc&var-query0=true [17:58:48] this reminds me of the getLagTimes.php maintenance script, which we run on a cron and sends data to Prometheus as well. [17:59:25] https://codesearch.wmcloud.org/search/?q=getLagTimes.php%7CGetLagTimes&files=%5C.%28pp%7Cphp%29%24&excludeFiles=test&repos= [18:00:18] that goes into (some) of the panels at https://grafana-rw.wikimedia.org/d/G9kbQdRVz/mediawiki-rdbms-loadbalancer [18:38:39] dancy, dduvall or others, can you help me understand a thing that has changed with my gitlab/kokkuri workflow? It seems like the repo URI has spontaneously changed. [18:39:19] As of a few weeks ago (https://gitlab.wikimedia.org/repos/sre/wikitech-static-docker/-/jobs/635516) it was pushing to quay.io/wikitechstatic/static:latest [18:39:58] but now (https://gitlab.wikimedia.org/repos/sre/wikitech-static-docker/-/jobs/664665) it pushes to the invalid/incoherent quay.io/repos/sre/wikitech-static-docker/wikitechstatic/static:latest [18:40:06] I'm pretty sure that I didn't change anything with my build rules. [18:41:34] Krinkle: https://grafana-rw.wikimedia.org/d/55b4cbb1-961a-42ae-92ff-28d8f6307585/mariadb-weights-and-pooling?orgId=1&from=now-3h&to=now&timezone=utc&var-section_name=s7 something like this? [18:42:51] andrewbogott: yes, it did change. see https://gitlab.wikimedia.org/repos/releng/kokkuri/-/blob/25159af63c97865b006d28df0b2cf9e450f70d5f/includes/images.yaml#L147 [18:43:22] i did not anticipate cases where folks are pushing to non-wmf registries [18:43:55] that's reasonable :) [18:44:52] I'm still reading... can I just set PUBLISH_IMAGE_REPO="" in my variables? [18:45:17] in your case, you can set `PUBLISH_IMAGE_REPO: ${PUBLISH_IMAGE_NAME}` [18:45:43] great, will try [18:47:15] dduvall: you're talking about setting that in ci/cd variables? Or elsewhere? (It doesn't like the $ but I assume I can escape it somehow) [18:48:45] andrewbogott: in the `.gitlab-ci-yml` file under the job variables, yes [18:49:08] hmm, maybe do `PUBLISH_IMAGE_REPO: ${REGISTRY_IMAGE}` instead [18:50:04] it probably doesn't like the `{` outside of a string literal. do `PUBLISH_IMAGE_REPO: "${REGISTRY_NAME}"` [18:57:39] dduvall: much better, thank you! [19:54:43] andrewbogott: np