[07:57:40] o/ [07:57:43] TIL Cindy, the browser bot [07:57:47] o/ [07:58:18] yes its vm rebooted and I forgot to restart it [07:58:46] I updated some things but now it's not very happy, will have to dig into it [10:20:38] gehel: https://phabricator.wikimedia.org/T385970 [10:35:32] errand+lunch [10:45:07] restating mjolnir with pool, timeout and driver memory bump changes [10:45:15] *restarting [10:47:09] dcausse now we should be able to restart the subgraph dags that timeout because of mjolnir hogging the resource pool [10:47:30] can I go ahead, or is there any gotcha I should be aware of? [11:02:56] lunch [11:30:10] lunch [13:05:15] gmodena: no, feel free to restart them, thanks! :) [13:05:42] dcausse ack [13:52:03] sigh... upgraded mwcli to latest but now I need to upgrade the cirrus-integ03 instance to bookworm... [14:03:38] o/ [14:12:53] FYI, I've decided to repurpose 3 elastic hosts for relforge. Explanation in T386357 ... happy to discuss further if there are any objections [14:12:55] T386357: Replace current Relforge servers with repurposed Elastic hosts - https://phabricator.wikimedia.org/T386357 [14:16:49] inflatador: why 3 not 2? [14:21:03] dcausse we'd already requested 3 hosts for relforge (ref https://phabricator.wikimedia.org/T382906 ) [14:21:20] so it will be a 1-for-1 replacement when we get the new elastic hosts [14:22:13] inflatador: are these the same hw, IIRC relforge had big disks compared to production hosts? [14:23:23] dcausse Y, the new hosts have 3.3T, relforge has 2.7T [14:26:52] one thing I missed..elastic hosts have 256 GB RAM, the replacement relforge host quote only says 128 GB RAM. So I need to get a new quote [14:27:02] inflatador: ack, but the hosts you're taking from prod are 256G machines but the newer relforge hosts are 128? [14:27:03] yes [14:27:47] 128 is probably enough for relforge tho [14:28:35] but that means you'll need to give them back to prod instead of the new machines in the quote you shared [14:40:16] Just updated T382906 to request a new quote for the relforge hosts (256 GB RAM instead of 128 GB RAM) [14:41:12] inflatador: since we already approved the ticket, can you ping Rob / Willy directly (IRC / Slack) to make sure this is on track? [14:41:15] So that'll leave us with too much RAM in relforge, but since we'll be using the much cheaper Supermicro hosts I doubt DC Ops will reject it [14:41:39] gehel already pinged robh in IRC, he is not in yet but I spoke to him yesterday and he's aware of the HP issues (but not the new quote) [14:43:02] good! [16:01:36] dcausse: retro time! https://meet.google.com/eki-rafx-cxi [16:03:28] oops [16:09:55] eqiad cluster looks OK btw, I was concerned about a shard allocation failure that wasn't being explained but it went away [16:13:41] Hey buds, FYI, there are some (intermittent) high counts of SearchSatisfaction validation errors [16:13:42] https://logstash.wikimedia.org/app/discover?#/view/AXMlVWkuMQ_08tQas2Xi?_g=h@ecd6e60&_a=h@34fd230 [16:13:56] getting occasional flappy alertmanager alerts about it [16:17:38] ottomata is that data hosted from relforge by any chance? [16:19:49] "I have no idea" is an acceptable answer BTW ;P [16:47:48] "new" relforge hosts are reimaging...working out but will be back in ~40 [16:50:13] dinner [16:50:33] ottomata: thanks, we shipped an A/B test this morning, might be related... [16:51:12] ottomata oh, ack. Just read [16:51:14] checking [16:52:34] hm... but the errors appear to start around 16:00 we shipped the A/B test this mornong [16:52:42] perhaps train related [16:53:55] there quite a few of them also in the past days [16:58:53] weird it's just a spike https://logstash.wikimedia.org/goto/bd3e120394bc0f9d0a2d9d1137a07ae4 [17:00:37] not new it seems ideed [17:00:59] today it seems to correlate with a spike in validation errors in general [17:01:11] (just by looking at timeseries) [17:01:38] ottomata: have seen the meta.domain of some of them: https://logstash.wikimedia.org/app/discover?#/doc/logstash-*/logstash-default-1-7.0.0-1-2025.02.09?id=arId65QBPAEUXp-La7kB [17:02:19] seems like someone trying to inject something [17:04:52] oh ho [17:05:47] wow someone really trying a sql eval attack? [17:09:02] wow [17:09:07] dcausse good catch! [17:30:41] i would have appreciated if they tested their attack with the canonical 'bobby tables' test. https://xkcd.com/327/ [17:31:29] :) [17:46:09] relforge is red ATM. I might need some help re-creating the data after the new hosts finish reimaging [17:51:24] inflatador: ok [18:21:42] ok I think I got cindy working again, will see if it can vote from the new cirrus-integ4 host [18:22:47] oof. The mjolnir_pool is not behaving as I was hoping. Tasks have executed out of order: https://airflow-search.wikimedia.org/dags/mjolnir_weekly/grid?dag_run_id=scheduled__2025-01-17T18%3A42%3A00.449096%2B00%3A00&task_id=dbn-norm_query [18:22:53] pausing the dag again [18:24:22] :/ [18:44:04] dinner [22:53:31] ryankemper I reran puppet and both clusters are up w/4 nodes (relforge1004 is the only one on opensearch, so it's expected to be broken). They're both in red status though [22:54:18] good to hear they joined the cluster! [23:01:37] Yeah, Elastic clustering is soooo good