[07:28:56] <_joe_> Krinkle: s/doing basic search throughout// :P [07:55:55] as heads-up, I moved rsync::quickdatacopy to a systemd timer, should be harmless https://gerrit.wikimedia.org/r/692587 [08:46:42] volans: quick q we had the other day while trying to figure out how to safely do flink in kubernetes (we did find a different and hopefully better way at the end, but the question stands). So, non-SRE cookbooks... possible? in the roadmap? my search skills are failing me [08:47:45] akosiaris: in the roadmap, there is cuminunpriv1001 as a testbed host where some progress started. It would depend on k6s to be deployed more widely though (cc moritzm ) [08:51:41] and I see a mention of the 3 headed dog, I 'll pretend I am Eurystheus and go straight for the hiding jar. [08:51:48] consider the question not asked :P [08:52:17] you can't undo what you've done :D [08:52:46] sure I can. all I have to do is build a time machine and go back 5m in time [08:52:55] how hard can it be ? [08:53:16] good luck with that [09:03:36] <_joe_> volans: he has another trick up his sleeve [09:03:45] <_joe_> this won't be his problem soon :P [09:05:58] :-D [09:06:04] eheh [09:09:19] volans: has cumin dropped support for py3.5? [09:10:28] kormat: yes but it might still work, why? [09:10:37] 🤬 [09:10:46] https://integration.wikimedia.org/ci/job/tox-docker/18727/console [09:10:58] it's EOL since Sep 2020 [09:11:09] it's still what's in use on our debian stretch hosts :P [09:12:32] why do you need to import cumin there? ofc not to run cumin commands :) [09:12:58] you're asking why a unit test that depends on cumin functionality imports.. cumin? [09:13:39] no, why you need to support 3.5 for the parts that are cumin-dependent that will not run on 3.5 ever again [09:14:17] volans: you're right that that specific part of the code won't run on a stretch how. but now i'm left with trying to decide what unit tests to run under what versions of python [09:14:26] s/stretch how/stretch host/ [09:14:33] I don't see cumin installed on any stretch hosts: https://debmonitor.wikimedia.org/packages/cumin [09:14:40] do you install that via venv? [09:14:45] akosiaris: speaking of the 3 headed dog, both Analytics/DE and ML will need to authenticate to kerberos via kubernetes, and afaics it is not a fun experience. For Hadoop there is https://engineering.linkedin.com/blog/2020/open-sourcing-kube2hadoop but I am not sure how feasible it is [09:14:49] for dev/ci, yes [09:15:16] volans: currently i run all tests under all supported versions of python [09:15:42] kormat: if it makes sense you can easily skip tests with pytest skipif [09:16:39] volans: i think in this case i need to skip the entire test file, as it imports wmfmariadbpy.RemoteExecution.CuminExecution at the top (which then imports cumin) [09:17:54] https://docs.pytest.org/en/latest/reference/reference.html?highlight=importorskip#pytest.importorskip maybe? [09:18:30] actually easier [09:18:30] https://docs.pytest.org/en/reorganize-docs/new-docs/user/skipping.html [09:20:30] volans: the release notes for cumin mention deprecating py3.5 support, but nothing about it _breaking_ py3.5 support [09:22:09] was dropped support (both setup.py and tox.ini). That's what you get when a non-native speaker writes the changelog ;) [09:22:24] you can also change your setup.py and have different versions for different python versiosn [09:23:38] 'cumin <= 4.0.0; python_version < "3.6"' [09:23:41] pytest.importorskip fails with SyntaxError [09:24:12] 'cumin; python_version >= "3.6"' [09:24:26] I don't recall if the spaces are ok or not [09:24:30] i'm not familiar with the syntax here. what i currently have is `extras_require={"cumin": ["cumin"]},` [09:25:33] ripestat interface has had a very nice facelift https://stat.ripe.net/app/launchpad/91.198.174.192 [09:25:42] I think should work the same, that's the syntax for the install_requires so should be the same [09:25:44] elukey: k6s via k8s? This smells badly immediately [09:25:47] why btw ? [09:26:05] I mean, is it a use case that might be served somehow else? [09:26:14] the first "cumin" is the key install foo[cumin], so you can try changing the last "cumin" with 2 items [09:27:16] akosiaris: nono I mean kubernetes pods/containers that can authenticate to kerberos [09:27:36] volans: https://phabricator.wikimedia.org/P16092 [09:28:06] akosiaris: for example, in ML we'll need to fetch data from HDFS/Hive to train models [09:28:10] (on kubeflow) [09:28:17] elukey: yeah I got that, sorry for not phrasing it correctly, I got absorbed on the k6s/k8s pun [09:28:23] ahahahah [09:29:10] kormat cumin <= 4.0.0 without semicolon [09:29:53] ah hah. i think that's working. [09:30:02] kube2hadoop omg [09:30:16] yeah [09:30:23] trying to make 2 completely different auth systems compatible to each other, what could possibly go wrong [09:32:25] so, a set of pods and an initcontainer in every pod and an admission controller [09:32:29] I dropped the "what could possibly go wrong" after starting with Kubeflow and kubernetes [09:33:23] kormat: the other option, if you don't plan to have the with_cumin version available on stretch is to just have tox run the tests accordingly, up to 3.5 with just your software as deps and for 3.6+ the one with_cumin support [09:33:31] complexity is asking whether we can make the building taller, it's head is hitting too often the roof. [09:35:24] <_joe_> akosiaris: I see you finally saw it [09:35:28] why does it need to go through the name node ? [09:35:35] <_joe_> (kube2hadoop) [09:35:38] I mean it has a superuser keytab, can't it talk directly to KDC ? [09:35:51] <_joe_> I've been trolling elukey for weeks with it [09:36:00] not that this is the problem, that's actually the least of my concerns with this thing [09:36:36] akosiaris: hadoop implements a token service on top of kerberos, where tokens are not krb tickets, but something implemented as "lightweight" auth mechanism to avoid hammering the KDC [09:36:59] and HDFS tokens are distributed by the namenodes, once you have authenticated to kerberos [09:37:01] ah the hadoop delegation tokens [09:37:02] ok [09:37:06] exactly [09:37:34] I wonder how many initContainers are there going to be in the end there [09:37:53] it's this + then the istio ones for sure [09:38:14] debugging failures is going to be so much fun [09:39:10] We’ve also considered other solutions for accessing HDFS from Kubernetes. The most straightforward way would be to have the user fetch the delegation token before submitting the job and attach the delegation token as a Kubernetes Secret. However, since a namespace is not exclusive to a single user, any user within that namespace could access all [09:39:11] the secrets without specific resource-based access control (RBAC) rules, thus failing the security requirement [09:39:31] there you go, make a namespace exclusive to a single user and you don't need all of this [09:41:25] the fetch delegation token part might be tricky, they have an expiry time and they are not stored on disk like krb tickets [09:41:36] it is all done by the hadoop api behind the scenes [09:42:44] and Joe already trolled me a lot so I suspect I am faling to recognize clear attempts of making fun of me :D [09:43:19] the ephemeral nature of it is indeed an issue, but I don't see why the "stored on disk like krb tickets" is an issue [09:43:24] another thing that we (ML) can do could be to load data to another storage like Cassandra every time, and access it via user/pass auth [09:44:05] but it is a duplication [09:44:26] anyway, as you mentioned istio + knative is enough for the moment :D [09:44:48] this would only apply for the "private" data btw, right ? [09:45:10] all the training datasets in theory [09:45:34] the marvellous Feature stores :D [09:45:54] I haven't ran away screaming yet today, haven't I ? [09:46:01] maybe it's about time [09:46:02] not yet [09:46:18] can I follow you in case?? [09:46:19] fwiw, I 've read and still haven't understood what feature stores are [09:46:52] I get the 3km view, but not how they are implemented and what their failure modes are [09:47:39] yeah same problem, together with the fact that there are only two opensource implementations (feast and Hops) that do completely different things [09:48:00] the easy part is handling metadata about feature datasets, versioning etc.. [09:48:20] where to store them and how to retrieve them for training is a big problem [09:49:01] Feast solves it letting Spark do the heavy lifting in the background, in our case it would leverage Hadoop for computation and store datasets somewhere (HDFS,Swift,etc..) [09:49:13] but Spark needs to authenticate of course [09:49:39] Hops uses a Hive endpoint, that doesn't support kerberos with hadoop [09:49:51] (they forked hadoop a while ago and used only TLS auth for everything) [09:50:18] that leaves us with.. custom solutions from big players that are not open source, and explained briefly in conference talks [09:51:37] so I am very confused [09:57:22] _joe_: before I dig deep in the rabbithole, could the error in https://integration.wikimedia.org/ci/job/helm-lint/4144/console be somehow related to your Rakefile changes yesterday? [09:57:53] <_joe_> possibly [09:57:57] <_joe_> let me take a look [09:58:17] <_joe_> oh sigh yes [09:58:53] <_joe_> I must have forgot a begin.. rescue somewhere [10:08:50] <_joe_> akosiaris: on it in 5 minutes, I'm finishing another thing [10:10:35] <_joe_> akosiaris: what's the gerrit change? [10:11:19] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/692736 [10:12:33] <_joe_> akosiaris: uh ok I see, maybe something funny with rebases [11:04:27] <_joe_> akosiaris: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/692870 [11:05:06] <_joe_> somehow I forgot that guard for deployments, I had it for charts though :/ [11:09:56] <_joe_> yup now the output of legoktm's patch nicely shows the manifests for all three deployments https://integration.wikimedia.org/ci/job/helm-lint/4145/console [11:12:10] <_joe_> it might make sense to create an artifact with the diffs to make them easier to read at some point, but for now this is good enough IMHO [11:37:19] Sorry for the offtopic but it seems there's something going on regarding freenode [11:37:30] and given that devs/staff use it for day-to-day activities [11:37:37] I thought you'd be interested [11:38:00] tl;dr: freenode staff is resigning en masse due to some sort of company takeover [12:41:14] elukey: more fun for you: https://medium.com/polymatic-systems/service-mesh-wars-goodbye-istio-b047d9e533c7 [12:41:23] it's a rant, but it does have some valid points [12:42:29] "Finally, and most importantly, Istio deprecated Helm deployments in favor of their istioctl command line utility… and then they brought the Helm deployments back in a later version" [12:42:39] ah, that explains what we were trying to figure out the other day [12:49:27] I had a chat with jayme today about it :D [12:50:27] going to read the article in a bit [13:34:34] cdanis: thanks again for the expanded docs, but one question: https://wikitech.wikimedia.org/wiki/Dbctl#Removing_/_decommissioning_a_host - if you don't do the first step, does anything bad happen? [13:35:09] it depends ;) [13:35:30] if the host was primary for a section, not just a replica, yes [13:36:15] lol, well yes, ok. [13:41:28] in the usual case I think it will be okay, but there are some consistency checks you gain by doing it this way [14:42:43] elukey: it's my link feeding day for you: https://twitter.com/QuinnyPig/status/1394713357291122692 [14:48:39] ahhahahaha [15:19:51] kormat: the purge on pc1010 has completed [15:19:53] > pc255: ... found 339850 where exptime less than 20210528000000, deleted 339846 rows. [18:26:00] jbond: re https://gerrit.wikimedia.org/r/c/operations/puppet/+/689786 I've still been unable to successfully re-image wdqs2003, and could use some help investigating [18:27:36] Running `vsp` on `wdqs2003.mgmt.codfw.wmnet` while running the re-image script from cumin, I get dropped into `grub rescue`. Also trying to enter the install console via the puppetmaster doesn't work (connections to port 22 time out) [20:20:44] hm, google is failing me for this one. How can I programmatically detect if a debian package is or is not installed? I'm interested in a particular package which is in the 'rc' state; dpkg -s and dpkg - [20:20:53] -W both return success even though it's removed [20:21:09] and apt-l return success in every case [20:24:21] 'if dpkg --get-selections | grep -q "^$pkg[[:space:]]*install$" < /dev/null' seriously? [20:24:25] if it's in rc state, then it's deinstalled, but only conffiles are still present [20:24:35] can you just check for a specific file like the copyright file? [20:24:42] you can use "dpkg -L PACKAGE" to print the files still owned by it [20:25:49] the Icinga DPKG check also just does "dpkg -l | grep ^ii", so dpkg -l | grep '^ii ' [20:25:53] ? [20:29:23] <_joe_> andrewbogott: if dpkg -l $pkg | grep -q ^ii; then .. is proably the most efficient way to check if a package is effectively installed [20:29:30] <_joe_> I don't dare look at what puppet does [20:29:34] moritzm: It's not that I don't understand what's happening, it's just that I literally want to know whether or not the package is installed. Which is not a question dpkg answers :( [20:29:51] ok, thanks all, I'll use the (obviously wrong and complicated thing :) [20:30:01] <_joe_> andrewbogott: it does! dpkg -l $package_name gives you the info [20:30:05] <_joe_> you just have to parse it [20:31:14] yea, dpkg -l answers both desired state and actual states, hence the 2 i's if it is desired and actually installed [20:33:21] the other possible states are listed in the first couple lines at the top [20:43:15] moritzm: I'm not sure I understand your comment on https://gerrit.wikimedia.org/r/c/operations/puppet/+/691960 -- the logic in that patch is meant to address the discussion about chrony in phab and that other patch you referenced. [20:44:39] anyway, I amended my patch with a long comment explaining the terrible thing it does :) [20:57:41] well, if the issue is in a broken base image, then we should start with a fixed one and not kludge over it in Puppet [20:58:26] what's different with the creation of the bullseye image compared to what we used for buster? [21:00:55] moritzm: context is in https://phabricator.wikimedia.org/T280801, you can scroll to the bottom [21:01:31] yes, so we need to use a proper base image which does have this issue [21:02:31] 'this issue' is just chrony being installed, that doesn't strike me as fundamentally broken [21:02:46] I'd like our cloud to not be ultra-sensitive to variations in upstream images [21:03:02] But if you have a way to make the debian folks build different stock cloud images I'm fine with that! [21:06:00] you'll run into other issues down the line where the Debian cloud image will behave differently with our puppet.git (as opposed to bootstrapping a minimal image as was done for up to buster) [21:06:42] if that's the path to be taken, then at least make those workarounds in the Cloud VPS-specific profile and not in standard_packages where it affects all of prod [21:08:20] moritzm: Buster is also built from an upstream Debian image, but the upstream build process changed recently [21:12:05] and it will likely change again and again, hence my suggestion to bootstrap a minimal image to mimimise differences to prod :-) [21:29:54] yeah, it's a risk. The tool we've used historically for base images doesn't really exist anymore so dealing with puppet irregularities is likely to be the easiest solution in the near term.