[12:54:24] \o [12:54:31] o/ [12:55:50] ebernhardson: Tested the image recommendation successfully via airflow-devenv. Only curiosity (for me) was that a saw two yarn applications: a skein (wrapper?) and a SPARK one (that did the lifting). Would you know why we need that extra layer of skein? I was expecting this (python venv setup etc.) to happen as part of the spark-submitted application. [13:03:52] pfischer: hmm, looking through my history. I remember agreeing to it, but trying to remember why exactly it was necessary [13:04:58] this should be the design doc: https://docs.google.com/document/d/1hp6JYVy3SLRgTx1BYfnNOCPk5VFJeZ4jMpxD8WJKVB0/edit?tab=t.0 [13:05:38] curiously skein isn't mentioned in there :P [13:09:56] iirc it was something related to dependency management, ensuring disk space is available, auto-cleanup, etc. [13:10:16] basically that random airflow things have different requirements, and with skein it was more obvious how to ensure everyone could configure as necessary? [13:10:31] ottomata: maybe you remember? :) ^^ [14:12:22] We have a retro scheduled for today. Is there anything you’d like to discuss or shall we postpone until august when David is back? [14:12:31] pfischer: hmm, i don't have anything specific [14:18:35] I'm fine w/cancelling [14:55:24] postponing is fine [15:07:26] Alright, lets postpone then. [15:50:47] ryankemper: the wdqs federated rkd queries thing still has the same errors as yesterday, poking wdqs-blazegraph.service on wdqs1011 looks like it never got restarted. Maybe somehow cumin isn't returning all the hosts for A:wdqs-all? not sure how though, that's literally what it was made for [16:04:20] Unless Ryan updated the ticket otherwise, my guess is that we never restarted the services yesterdayt [16:05:05] inflatador: he said yesterday "trying again with semicolons", i think he did re-run it [16:05:14] but will have to wait and see i suppose [16:05:35] I know I also didn't fully apply the opensearch plugins packages recently, so I'm going to write some playbooks to verify these operations when I have time [16:06:08] yea a verify step might be warrented on a variety of things we do. We've made the same mistake reindexing before and missing an index or two [16:06:11] (or more) [16:13:08] workout, back in ~40 [16:15:02] ebernhardson: ah yeah I just derped and saw 1/38 fail and restarted the manual fail yesterday and neglected that the 1 failure made it not run on the rest of the ones [16:15:03] fixing [16:16:25] okay, added the -p flag to lower the % from 100. third time's the charm :) [16:17:06] pfischer: i am not familiar with recent work, especially with how things might have changed in k8s world. But! i think launcher=skein setting default comes from https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/wmf_airflow_common/config/dag_default_args.py?ref_type=heads#L158 [16:18:33] it was needed in non k8s world to keep driver off of the airflow scheduler (but allow for showing spark driver logs in the airflow UI) [17:02:30] back [17:42:12] alright, wdqs should be all sorted now [17:49:05] ryankemper: yup, looks good. thanks! [18:02:33] lunch, back in ~1h [19:09:43] back [19:14:05] ryankemper possibly dumb question, is there a reason why the wdqs restart cookbook wouldn't have worked for the allowlist update? ref https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/wdqs/restart.py [19:14:57] inflatador: I did run it first, it bombed out due to categories not existing on some hosts anymore [19:15:09] in retrospect just patching that right away would have been faster :P [19:15:19] ryankemper ACK, I figured it was something like that. Thanks for confirming ;) [19:34:04] may also want to update T398820 so the user knows it should be working now [19:34:05] T398820: Add RKD to WDQS allowlist - https://phabricator.wikimedia.org/T398820 [19:53:59] ryankemper I can't make pairing today, have to take my son to swimming [19:55:07] created T399900 to talk about verification scripts [19:55:07] T399900: Create verification scripts for common operations - https://phabricator.wikimedia.org/T399900 [20:00:49] ebernhardson: regarding the settings for the image_recommendation DAG: would you be fine using AirFlow variables for the kafka brokers? Would you know if those variables survive a restart (do they live somewhere in puppet?) Or would you go for a different config source after all? [20:06:16] heading to the pool, will be back online in ~30 [20:47:23] sorry, been back awhile