[08:41:02] o/ [08:44:29] o/ [08:45:33] just realized this morning that we had a feature selection task running from wed 12 feb [08:45:54] it somehow was not properly reported by airflow UI [08:46:01] :/ [08:46:34] actually, airflow was reporting the task as unning on yesterday's schedule [08:47:14] i killed the stale spark, and that marked the 2025-02-17 airflow task as failed [08:47:20] this behaviour is fishy [08:48:13] i think that whole dag run is borked. I'll skip over to training 20250206 [08:49:08] sure [08:50:35] I see "Too small initial heap" on this morning run [08:50:44] the driver failed immediately [08:52:08] oh, "--driver-memory 24" might be the G unit missing [08:52:44] ah shoot [08:52:50] let me check [08:53:59] yes I missed that when reviewing: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1071/diffs#b0aea176d2454011293a3aa2b0ca4d355436fbb2_236_234 [08:54:50] the skein mem is ok but totally missed that the driver mem was so low [08:57:24] gmodena: would you have time this morning to pair on CirrusSearch? [09:01:09] dcausse yes, absolutely. I am free as soon as I finish fiddling with mjolnir [09:01:19] sure [09:03:46] fix for the missing driver mem unit: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1076/diffs [09:07:51] approved, thanks! [09:20:01] merging [09:20:22] dcausse happy to pair whenever you have time. [09:21:05] gmodena: https://meet.google.com/dnn-vvrp-yzw [10:30:05] errand + lunch [11:06:35] cindy is happy, airflow less so... [11:07:37] we need to bump heap size for the jvm that runs Skein's feature selection master app [11:08:31] asked on slack how to do it with the wmf_airflow wrapper, hopefully it's just a matter passing around an env var [11:08:36] lunch + workout [12:30:54] Is it possible to get json from Special:MediaSearch ? format=json doesn't help, i.e.: https://commons.wikimedia.org/w/index.php?search=dog&title=Special:MediaSearch&type=other&format=json&filemime=pdf [12:37:36] new small patch to tweak mjolnir mem incoming [12:38:07] i wonder if we should refactor its config into runtime settings [13:30:55] dcausse: we're in https://meet.google.com/cke-qwgr-sse [13:31:50] Someone1337: I'm not sure that MediaSearch has an API endpoint, I'm asking on some other channel [14:15:39] o/ [14:37:05] o/ [14:38:16] dcausse i got a bit lost into skein / mjolnir... [14:38:29] Would you maybe have time for some pairing after 16:30, or maybe tomorrow morning? [14:43:41] gmodena: sure, 16:30 works for me [14:51:14] dcausse thanks! [14:57:15] dcausse no rush, but if you wanna look at the morelike P95 stuff LMK. I was gonna check if there were other things happening with MW at the same time [15:03:03] We might also consider turning on the CPU performance governor like we did for WDQS [15:09:59] also looking at the per-node percentiles to see if this is localized [15:12:05] inflatador: the issue is not only affecting morelike but almost all search types [15:13:11] inflatador: I had a quick look at the various graphs but noted nothing unusual, Gabriele noted that in one instance the P99 were particularly bad on two nodes of the psi cluster [15:13:35] odd that the alerts are coming from the k8s prometheus, but I guess the values are the same sine we can see them on the ops graph [15:14:01] inflatador: what do you mean? [15:15:30] Just that the alert emails say `prometheus = k8s` which suggests the values are coming from the k8s prometheus instance [15:17:32] yes mw is running on k8s and pushing to prometheus, the elastic-percentiles graph is sourcing data from thanos which should be the same (unless mw@codfw suddenly decides to query elastic@eqiad) [15:20:28] where did y'all find the badly-behaving nodes? I've been looking at per-node percentiles page ( https://grafana.wikimedia.org/goto/PSqJxTcHR?orgId=1 ) and so far don't see the same latency issues that we're getting from the CirrusSearch metrics [15:23:01] in that graph you pasted I can see a big bump on 3 psi nodes (1081, 1086 and 1090) [15:23:02] nm, I think I see it now [15:24:33] but unclear if related, psi & omega might not serve a lot of traffic and p99 could possibly go very high just because of few slow queries [15:25:29] https://fault-tolerance.toolforge.org/map?cluster=elasticsearch is a new tool for mapping out the DC...checking to see if those are on the same row/rack [15:26:03] nice [15:26:05] and the answer is no [15:28:03] I also think I might have to rebuild the plugins package based on https://etherpad.wikimedia.org/p/relforge-os-migration#L44 [15:29:30] inflatador: the list of mandatory plugins should be a config in puppet I think [15:29:49] dcausse So MediaSearch doesn't have an API endpoint? [15:31:37] Someone1337: I don't see one and haven't got a response yet from the maintainers of this SpecialPage, I think best is to file a task in phabricator and tag "MediaSearch" [15:32:16] dcausse ACK, I think I messed this one up...will diff/fix the plugins as time permits [15:32:47] dcausse I'm around if you are free, but no rush. I'll idle in the meet from this morning https://meet.google.com/dnn-vvrp-yzw [15:33:59] joining [16:49:59] dinner+kids. I'll be back online later tonight [17:11:56] physical therapy, back in ~1h [18:26:33] dinner [19:08:20] some examples of articlecountry weighted_tags: https://ab.wikipedia.org/w/api.php?action=query&prop=cirrusdoc&pageids=807&cdincludes=weighted_tags [19:10:16] dcausse thanks! [19:14:29] back [19:16:24] SUP got lagged, unsure why but suspecting it's due to me backfilling articlecounter, going to deploy an optimization to reduce the number of weighted_tags being pushed for the PageAssesssment extension during next backport window [19:16:37] s/articlecounter/articlecountry [19:17:31] context is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageAssessments/+/1088592 & https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1 [19:18:00] back at ~10pm for the deploy [19:20:00] There's a major outage going on ATM, not sure if that would affect the SUP though [19:56:54] inflatador is it a global outage, or specific to search services? [19:58:57] gmodena it's varnish/haproxy (the CDN, basically). I'm following along in #security . I think things are actually back up now [20:04:36] oof [20:12:25] inflatador (and others who love command line tips and tricks)... I recently learned about `vim -d file1 file2`, which does a very nice (though colorfully ugly) diff, and it is now my prefered command line diff tool. If you are comfortable with vi, it's great! [20:14:29] Trey314159 Thanks, I'll make a note. `diff` is frequently inadequate for my needs (or maybe I should RTFM more ;P ) [20:44:21] not a lot of details, but here's the ticket for the outage I was talking about earlier T386740 [20:44:21] T386740: 503 Service Unavailable on all production - https://phabricator.wikimedia.org/T386740 [20:48:28] inflatador thanks! [20:48:46] Trey314159 TIL [20:59:56] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1120654 CR for the relforge migration if anyone has time to take a look [22:01:15] well actually the optimization did not have the impact I was hoping for :(, I think it needs https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageAssessments/+/1120671 as a followup