[00:22:23] +1 to turning this into a jobqueue task or similar :) [07:20:36] hello people [07:21:04] In https://phabricator.wikimedia.org/T269519 there was an interesting discussion about Analytics' posix groups [07:21:51] the TL;DR is that we are planning (if nobody opposes) to experiment the possibility of people requesting access to 'analytics-privatedata-users' without the need for ssh access [07:22:10] the use case is outlined in https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Generic_data_access_%28can_go_together_with_the_Team_specific_ones%29%3A [07:22:31] (we have now only one posix group for our users, not 1000 like we used to) [07:22:50] I checked the admin module and it shouldn't complain [07:23:05] but please let me know any doubt/suggestion/etc.. [08:49:44] Hello all! Greetings. I am new to Wikimedia and would like to contribute with SRE projects. [08:49:44] Please where can I start? [08:49:44] * Sample repos ? [08:49:45] * Accounts to create? [08:49:46] Thanks in advance. [08:49:47] Python, JS and PHP repos would be preferable. [08:49:48] https://www.irccloud.com/pastebin/eKW8gIXD [08:53:56] hello fenn-cs ! You should get yourself set up with an account on wikitech and on gerrit: https://www.mediawiki.org/wiki/Gerrit/Tutorial#Create_a_Wikimedia_developer_account [09:04:06] <_joe_> effie: we have two more mc* alerting, maybe the same issue? [09:04:35] yes it is, I saw the task this morning [09:04:41] it calls for investigation :/ [09:04:55] although I will first reboot one server, out of curiocity [09:05:28] <_joe_> please do [09:05:40] <_joe_> volans: Dec 11 09:58:35 cumin1001 check-cumin-aliases[7520]: DC aliases do not cover all hosts: kafka-test1007.eqiad.wmnet [09:05:55] <_joe_> overall, icinga looks like a sad christmas tree [09:07:05] effie: I added some info to the task, I think a reboot is not needed, the /etc/network/interfaces file is messed up [09:07:18] I read that too [09:07:42] and mw1265 was reimaged yesterday [09:07:55] what is odd is, everything works fine [09:07:59] late_command.sh + augeas might be the culprit in here, no idea how though (too ignorant about what those do for the ipv6 interfaces) [09:08:29] so the extra bits are trying to add something to eno1/64 that doesn't exist [09:08:37] so it complains, but the rest work (the bits for eno1) [09:08:59] on mc1033 I'd remove "up ip addr add 2620:0:861:107:10:64:48:155 dev eno1/64" from /etc/network/interfaces [09:09:12] and then ifup eno1 should work fine [09:09:28] we have tons of hosts to reimage, if this is something on buster [09:09:33] we need a perm solution [09:09:35] <_joe_> elukey: uhm this seems a d-i bug recently introduced [09:09:52] <_joe_> effie: yes, let's raise this to the I/F folks :) [09:10:10] <_joe_> what's the task number? [09:10:13] anyway, I will ach the alerts, but it is not something very urgent [09:10:25] brb [09:10:28] _joe_ https://phabricator.wikimedia.org/T270220 [09:10:56] <_joe_> effie: just run "systemctl reset-failed" instead [09:11:20] can we try to fix that config instead? :D [09:11:41] <_joe_> I/F has no team tag? [09:11:48] <_joe_> elukey: that's a d-i bug [09:12:14] <_joe_> and someone will fix it, but in the meantime, that's a bogus alert that risks masking a deeper problem if it arises [09:12:16] https://gerrit.wikimedia.org/r/c/operations/puppet/+/648238/1/modules/install_server/files/autoinstall/scripts/late_command.sh [09:12:20] _joe_ --^ [09:12:25] jbond42: helloooo [09:12:31] <_joe_> elukey: yeah not gonna debug it myself [09:12:41] <_joe_> but yes, that seems to be the issue [09:12:48] I think that the above might cause some issues [09:12:55] <_joe_> definitely [09:13:08] <_joe_> lemme assign the task to john :) [09:19:46] checking the cumin_aliases failure, kafka-test1007 is a new test cluster that we are building [09:20:08] (is part of) [09:29:54] should be fixed [09:30:33] <_joe_> elukey: what was the issue? [09:32:00] _joe_ I added the cumin alias for the kafka-test, but the error msg was a little strange, and since it executed on the 11th the last time I suspect that it ran when kafka-test1007 was in a weird state [09:32:11] I restarted both service checks and they now pass fine [09:32:13] <_joe_> ack [09:33:23] the warning for dumpsdata1003 will go away tomorrow, cirrussearch datasets getting too big for their britches [09:38:31] I acked all the es shard alerts with T260083 (after chatting with David) [09:38:32] T260083: Reshard commonswiki_file elasticsearch index - https://phabricator.wikimedia.org/T260083 [09:41:09] elukey: ack looking now, [09:46:19] jbond42: <3 [09:50:23] IMHO I think that the email we had before or an automatic ack+task would be best suited for the cumin alias check [09:54:58] good morning [09:55:25] I was looking for your software to execute runbook / cookbook but I can't find it. Does that ring a bell [09:55:52] hashar: what are you looking for specifically? [09:56:02] there is spicerack and then there is the cookbooks repositories [09:56:14] ah spicerack! [09:56:43] so the idea is to automatize release engineering out [09:57:09] or to say it otherwise, convert a bunch of list of commands currently on wikitech to a script that ones can blindly run (more or less) [09:57:16] Mukunda did some exploration at https://phabricator.wikimedia.org/phame/post/view/217/runnable_runbooks/ [09:57:28] I recently wrote a wiki page to explain how to do datacenter switch others [09:57:50] and my shower though this morning was: isn't SRE already having something that we might use :] [09:58:36] so I guess my use case is to define a list of commands in a script/cookbook and be able to run it from a prod server [09:59:16] hashar: note that the main thing I'm trying to achieve is gradual automation - that is when some steps are still manual but most of the steps can be automated. [09:59:39] currently cookbooks are root-only but we have T244840 [09:59:39] T244840: Evaluate options for non-root operations with cumin and spicerack cookbooks - https://phabricator.wikimedia.org/T244840 [09:59:41] basically it steps through a list of commands, some of the commands can be just a comment that says "now do this, confirm when it's done" [09:59:51] twentyafterfour: \o/ to be fair I haven't read your blog post fully but it defnitely has caught my attention [09:59:52] and that should be the way to go IMHO [10:00:02] and I can surely use a similar system to deploy CI changes for example [10:00:34] alsmo my tool would run on your workstation but it can make ssh connections to multiple target servers to run the actual commands [10:01:05] much like ansible so ? [10:01:16] all this automation is already done in prod, audited and code-reviewed, I strongly suggest to use the cookbooks setup for this [10:01:38] volans: but it's root only? [10:02:16] what I was trying to build is a way to automate the remaining train deployment stuff so that we can finally kill this awful thing: https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys [10:02:43] right now this second yes, but next Q we should work to unblock T244840 [10:03:35] Mukunda blog post nails the problem statement pretty well [10:04:37] twentyafterfour: maybe we can evaluate the spicerack / cookbook and work with volans on adding support to non-root ? [10:04:54] the idea with gradual automation is that you gradually replace manual steps with automated ones and eventually you end up fully automated. Some things can't got from zero to automated in one jump but gradually working towards automation is more feasible [10:05:24] cookbooks can do as little as a single step and left the rest manual, can ask for confirmation between steps and you can do it as gradual as you want [10:05:27] fwiw [10:05:31] but hey my thing is a side project on my own time so I'll be doing it regardless ;) [10:05:51] volans: I'll definitely check that out [10:05:57] I too :] [10:06:35] start from https://doc.wikimedia.org/spicerack/master/introduction.html :D [10:06:49] volans: would you be open to one da ygive us a quick presentation of spicerack/cookbook ? [10:07:01] oh [10:07:06] * hashar RTFM [10:07:07] can I set it up locally or does it depend on some production infrastructure / service? [10:08:11] depends what you use, you can develop locally, run tox, and if a cookbook doesn't depend on prod stuff (like call only a public API for example) would also work locally [10:10:02] cool. looks good but it looks orders of magnitude more sophisticated than the thing I'm working on ;) [10:10:02] I'll read the blog post and let you know how much it "fits" [10:10:31] see also https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks [10:11:30] so spicerack is the orchestrator and the api, the actual run books are in the cookbook repo? [10:12:16] yes, spicerack is a single point accessor to multiple libraries, already setup for prod [10:12:18] with spicerack leveraging on cumin for the execution? [10:12:30] for remote execution yes, for other things no [10:12:32] like confctl [10:12:43] E_TOO_MANY_INCEPTIONS [10:13:10] some things that are more general are being migrated to wmflib (pywmflib in gerrit) [10:13:21] that can be used and installed everywhere without any root-dependency [10:13:24] *root-only [10:13:32] interesting [10:13:41] https://doc.wikimedia.org/wmflib/master/ [10:14:08] there are a bunch of generic utility functions in scap that I'd like to move out into a general library [10:14:18] maybe wmflib is the right place [10:14:35] T257905 [10:14:35] T257905: Spin off common Spicerack modules into a standalone Python library importable anywhere - https://phabricator.wikimedia.org/T257905 [10:15:08] wmflib is to be used by any kind of scripts in the infra, like icinga checks [10:15:21] spicerack uses wmflib for the things we migrated/are migrating [10:15:29] but all the automation is done in cookbooks [10:15:35] right, there are just some useful functions in scap that I always want to import but I don't always want to depend on scap directly [10:15:57] so maybe they could go in wmflib if that eventually becomes a standard thing everywhere [10:16:05] depends what [10:16:07] if it's generic enough [10:16:29] right, I'm thinking in general here, but yeah only if it made sense [10:16:41] not suggesting to move all of scap.util in there or anything [10:17:09] but yeah patches are always welcome, feel free to ping me for specific things so we can see how they could fit [10:17:27] cool thanks volans, this all sounds quite promising [10:17:35] definitely [10:18:13] twentyafterfour: I will be more than happy to automatize more things :] [10:18:32] a side question volans, have you considered Ansible for the runbook instead of rolling our own? [10:19:12] we already use puppet for cfg management :D [10:19:38] ansible has some really not very nice design choices. I don't think it is worthy of running on anyone's production infrastructure. [10:20:15] it dynamically generates a blob of python code and then uploads that to a server and runs it from /tmp over ssh [10:20:25] or something like that [10:20:40] yeah more or less :] [10:21:12] I am a pretty fan of the overall ansible concept, though looking at the implementation makes me cry :-\ [10:23:10] also spicerack is not inventing much, just a bit of glue to offer all the existing libraries/APIs in a sigle place (cumin, confctl, netbox, prometheus, etc.) and some higher level abstraction for some common things like puppet, ipmi, etc.. [10:24:18] similar to ansible modules yeah [10:24:51] I will dig into it and guess later on file in a task for my specific use case [10:25:14] but all very tighened to our infra, to simplify usage [10:25:27] and leveraging on cumin which has all the nice selectors [10:25:29] \o/ [10:25:55] slightly related, the non-root support task ( T244840 ) mentions kerberos , is there a plan to one day get rid of per user ssh keys on the servers in favor of kerberos? [10:25:56] T244840: Evaluate options for non-root operations with cumin and spicerack cookbooks - https://phabricator.wikimedia.org/T244840 [10:26:22] eg login to bastion, get a ticket, get access granted to stuff without having to mess with the admin modules [10:27:06] not sure on the details yet [10:27:18] this is a question more for mo.ritz, but he's out [10:27:51] ok [10:28:01] looks like a fairly large can of worms to open :] [10:28:42] thank you twentyafterfour and volans! [10:33:25] eheheh [11:36:21] volans: jbond42 [11:36:40] effie: ? [11:36:47] (I am writting) [11:37:06] does it make sense to try and work on renaming the eno* interfaces to eth* on the hosts we update ? [11:37:17] I know this is a long issue that is not going away [11:37:30] but I think at somepoint there was a way to rename an inf [11:37:31] AFAIK no as we're using the predictable name now [11:37:37] by choice [11:37:47] you mean the eno* ones? [11:38:17] yes you're migrating from jessie IIRC [11:38:24] that's why you're getting the renames [11:38:30] we're using predictable names everywhere [11:39:26] https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/ [11:39:46] alright then [11:40:04] I thought we'd have the same issue with mw*, but I hadn't check [11:40:07] sorry for the trouble [11:40:20] effie: tl;dr is that althogh eth0 is much more predicatable in most real world situations it is not perfectly predictable so systemd created this ^^^ [11:40:51] which make instefcaes predictable if you are familure wit all net drivers in the kernal tree [11:41:32] IIRC stretch hosts should already have the new names [11:41:39] yes they do I just checked [11:41:48] and you know the interface layout of your devices on the pci bus [11:44:11] thank you both! [11:47:06] lol I like some statements like "Stable interface names across reboots" [11:48:46] sigh, naming and ordering of interfaces [13:04:55] can someone help me fight reprepo please? I tried to copy a package from a stretch component to a buster component and got this: https://phabricator.wikimedia.org/P13555 [13:10:02] jayme: I can in few minutes if you don't have already fixed by then, I think I've done something similar in the past [13:12:46] volans: cool. I'm a bit anxious about messing with reprepo I must say as I don't know how to revert when I mess up :) [13:13:18] yeah it's not the best interface [13:18:27] volans: understatement, noun [13:19:04] ehehe [13:19:15] jayme: I'm here, pvt [13:54:09] for the curious 'copy' just copy binaries, if you want all binaries of a src package use 'copysrc' :) [14:58:32] jayme: since moritz is not around [14:58:40] you have nothing to fear, physically at least [14:58:45] * jayme hides [14:59:08] although, he does know where you live [15:03:24] effie: not sure... as he's probably not on vacation mor.itz might have more time to physically go there :D [15:04:30] :p [16:10:05] all puppet fans curious on feedback for something like this https://github.com/puppetlabs/puppetlabs-stdlib/pull/1150 instead of all the ensure_* functions (if it dosn;t go into stdlib i can add it to wmflib) [16:12:46] neat :) [16:17:30] :) thx [17:20:55] anyone free to put a quick stamp on https://gerrit.wikimedia.org/r/649929 and https://gerrit.wikimedia.org/r/649938 ? [17:41:57] thanks sukhe :) [20:07:31] Is there any chance of movement on T241195? Eager uses want py 3.8 for testing… [20:07:32] T241195: Add python3.8 to buster-wikimedia pyall component - https://phabricator.wikimedia.org/T241195 [20:50:35] James_F: not sure, mo.ritz is out this week, not sure about the next weeks [20:53:04] Ack. No rush, just it's been sitting around for a while.