[07:27:33] Check systemd state is returning "CRITICAL - b'degraded': unexpected" on a host [07:27:58] the error is real, but not sure the b'degraded' is expected [07:28:06] Cas merged a patch to port to Py3 yesterday [07:28:45] https://gerrit.wikimedia.org/r/c/operations/puppet/+/624733/ [07:28:55] thanks, I will report it [07:29:14] it may be just a display glitch [07:29:37] as it seems to work as expected, in terms of firing/not firing [07:31:42] yeah, I don't think we need to revert for now [07:32:11] I've reported as bin -> utf8 can have deeper issues potentially [07:33:00] ack [07:33:01] strangely it checks for UnicodeError exception [07:33:09] but may be missing a conversion [07:34:09] it does decode, so not sure why it shows that [07:35:24] actually, it is a real bug, I think it strips before decoding [07:35:52] I will send a quick patch [07:39:06] andrewbogott: hi there. let me know if you have any questions about reuse-parts [10:54:45] vgutierrez: does this looks familiar or something? [10:54:56] https://www.irccloud.com/pastebin/dDwnqrPv/ [10:55:21] arturo: you're missing a restart on the API server [10:55:33] ack [10:55:42] arturo: we cannot automate that cause it would trigger puppet fails across the acme-chief clients [10:56:11] in prod we temporarily disable puppet on the clients, restart acme-chief API server and then re-enable puppet again [10:56:25] but it's a different environment.. we don't auto-upgrade packages as well [10:59:37] it worked! thanks vgutierrez <3 [10:59:44] np :) [11:00:02] that's been triggered by acme-chief 0.29 introducing the alternate chain parts [11:00:24] the *.alt.* files [13:27:43] kormat: thanks. I made a couple of attempts last night but both failed. Would you expect it to work with lvm partitions? [13:28:01] lvm is supported in general, yes [13:28:16] if you can point me at a target machine, i can have a loko [13:28:18] *look [13:31:43] I reverted my changes but I'll put them back :) The recipe I made (which was a really half-hearted attempt) is reuse-labvirt.cfg [13:31:56] an example target is cloudvirt1014.eqiad.wmnet [13:34:14] ah, this is a very simple case. that's a relief :) [13:34:25] kormat: for context, 1014 is a host where I don't actually care about the /srv data; I'm practicing this for a future system where I actually do care. [13:34:34] 👍 [13:36:46] the reformat-everything partman recipe is labvirt-ssd.cfg. It probably has a fair bit of cruft in there, it's from many years ago. [13:43:03] replied on the CR [13:44:41] kormat: I'm pretty sure the recipe as it is reformated my data partition. It's somewhat possible though that I was running an old recipe because the install server (which apparently isn't install1003 anymore?) hadn't run puppet yet. [13:44:47] I'll re-apply and give it another try [13:44:59] (I'm interpreting your comment as 'this should work,' correct?) [13:45:04] (correct :) [13:45:33] do you know where all I need to force puppet runs after a patch like this is merged? [13:46:12] let me check my notes :) [13:46:37] andrewbogott: i think it's apt[12 [13:46:39] er [13:46:44] apt[12]001 [13:46:54] ok [13:47:24] install* are the actual dhcp/pxe servers, but the installer loads from the apt servers, aiui [13:56:32] yes, that's correct [14:04:40] kormat: I'm reimaging cloudvirt1014 with those changes and on the console I see a [!!] Partition disks dialog as though partman isn't active at all :( [14:05:14] mm. mind if i connect to the console? [14:05:54] not at all, I'll disconnect [14:06:07] be warned these are HPs [14:06:22] 'vsp' to connect to the serial console [14:06:35] `Unable to negotiate with UNKNOWN port 65535: no matching key exchange method found. Their offer: diffie-hellman-group14-sha1,diffie-hellman-group1-sha1` lovely [14:08:27] ok, i'm in [14:10:37] oh this is weird. i selected 'go back', and it took me to the partitioning screen, with everything correctly configured [14:11:09] huh [14:11:31] I guess if that is the procedure I'm maybe ok with that since I'm only going to use this twice [14:11:40] `install_console cloudvirt1014` isn't working from puppetmaster1001 either [14:12:18] have to do it from the cumin host, those hosts are on a weird network [14:12:30] want me to ctrl-c out of my current reimage script? [14:12:59] yeah, please. let me take a look at what's going on here [14:13:22] ok, I'm all clear [14:14:20] kormat: have you used FQDN? [14:14:31] for install_console, that is [14:14:39] volans: yes. issue was what andrew said about needing to use a cumin host [14:14:50] ah ok [14:15:31] andrewbogott: from what i can see form the logs, reuse-parts worked correctly. i notice a bunch of ssh connections. is that part of your reimage script? [14:16:38] kormat: as far as I know I'm not connected anymore [14:17:35] andrewbogott: https://phabricator.wikimedia.org/P12842 [14:18:15] oh, that could be the wmf-reimage script i guess [14:18:20] it must be [14:18:20] yes, it's polling [14:18:52] to know when d-i finishes and reboot the host [14:22:37] andrewbogott: my only guess at this point is that some keystroke on the console caused it to go to that screen [14:22:51] the logs show it was waiting at the usual partitioning screen here: Sep 29 14:10:16 debconf: --> GO [14:22:57] ok [14:23:05] so shall we try it again w/out a console attached? [14:23:05] and didn't continue until here: Sep 29 14:18:43 debconf: <-- 0 ok [14:23:17] wait [14:23:23] this last run I wasn't attached [14:23:36] and then after waitiing a while I attached and the first thing I saw was that screen [14:23:50] (that's not a strong case for not trying again though) [14:23:51] maybe attaching to the serial console causes some character to be sent [14:24:13] maybe [14:24:21] that's all i've got anyway :) [14:24:24] so you think the way forward is to never attach, or to be attached from the get-go? [14:24:42] in test mode (i.e. using `reuse-parts-test.cfg`, i would normally be attached from the start [14:24:53] once you're finished with test mode, you don't need to be on the console [14:25:03] ok [14:25:16] I'll try that now (if you're clear) just to convince myself :) [14:26:01] i'm clear now [14:26:07] ok! [14:31:49] kormat: ok, I'm at the [!!] Partition disks dialog again (which is not ideal but tolerable if there's a work-around); you said you did 'go back' from there? [14:32:55] * andrewbogott tries it [14:33:14] oh, wait, there's a 'finish partitioning' option way at the bottom here [14:33:17] so this is working properly [14:34:33] yep - 'finish partitioning' is what you want (so long as the current state looks correct to you) [14:34:47] and yeah, the partman ui is not the, uh, greatest [14:34:59] since when partman has a UI :-P [14:36:35] partman has a UI in the same way a rock avalanche has [14:37:31] hah! [14:40:51] FYI I'm going to resolve the labweb1002 ack'd VO alert from yesterday, investigating in https://phabricator.wikimedia.org/T264016 [14:42:38] thx godog [14:43:00] ah that paged, I was wondering wha was up [14:44:24] andrewbogott: np, I see there's a few other ack'd alerts for cloudvirt, assuming you are aware (?) [14:44:39] I'm currently building out restbase10(2[89]|30) as buster hosts. I've downtimed them for a while but if anyone notices anything up with them, it's a) my doing and b) nothing that will interfere with the restbase cluster [14:45:00] godog: yeah, rebuilding hosts [14:45:42] ack [14:57:34] kormat: I think it worked! Thank you for your help; I'll run another test later on today when I have another host to reimage. [15:08:49] andrewbogott: ok great :) [15:59:47] this is p. interesting https://qaul.net/ [16:01:13] * volans reads it as PiedPiper [16:14:53] heh [17:10:04] FYI: I cannot find any indication that the watchmouse alerts are indicating real issues on our side. Emails from watchmouse to noc@ temporarily disabled to cut down on the noise. https://phabricator.wikimedia.org/T264111 [17:13:39] moritzm: as per tradition, I am trying to edit something in the pwstore and can't. Warning: the following recipients are invalid: EC3D2B2DAC6964AFC7134FB69B91773F42CB71E3 [17:13:50] Is there something I'm forgetting? I did gpg --refresh-keys already... [17:14:38] ^ that same question to anyone else who wants to think about gpg [17:21:49] andrewbogott: i normally just try a whle bunch of keyservers now e.g. [17:21:52] for i in pgp.ocf.berkeley.edu pool.sks-keyservers.net eu.pool.sks-keyservers.net ipv4.pool.sks-keyservers.net pgp.mit.edu keyserver.ubuntu.com keys.gnupg.net ; do gpg --keyserver ${i} --recv-keys EC3D2B2DAC6964AFC7134FB69B91773F42CB71E3 ; done [17:22:27] that said im still seeing an expiry of 2020-09-02 for klausman key. [17:22:53] fwiw keyserver.ubuntu.com is a fairly reliable one [17:23:11] and I believe not part of the SKS network, which seems more of a pro than a con in 2020 [17:25:57] I'm running jbond42's command but of course it's hanging on the missing servers… will take a while [17:28:27] I'll work on https://phabricator.wikimedia.org/T262393 [17:28:57] in the next weeks [17:29:45] otherwise, simply drop users with expired keys [17:29:52] are those the ALERT! Wiki commons (s4) - UNCACHED: Bad Request emails? [17:29:57] shdubsh: [17:30:07] it's a PGP misfeature without any technical merit anyway [17:30:21] apergos: yes, those ones [17:30:21] huh, now —refresh-keys gets me "gpg: signature packet: hashed data too long [17:30:22] gpg: read_block: read error: Invalid packet" [17:31:14] ok, I just noticed now that we have quite a pile of them but if they do't indicate anything real (and I don't see how they do, no reports from any users)... good... I guess? [17:31:20] so I guess this password will live only on my laptop :( [17:33:56] andrewbogott: I'm afk in a bit, but mutante can kick the people with expired keys from the .users file as well, then you can add your new secret [17:40:13] I refreshed keys earlier today and was able to see a password but I didn't try to change the contents [17:40:22] ...in pwstore that is [17:44:04] btw jbond42 my usual recipe is for K in $(dig +short pool.sks-keyservers.net) ; gpg --keyserver $K --recv-keys ... [17:47:21] jbond42: updated expiry and send to the ubuntu ks, and sks [19:54:01] does mediawiki1.35 mostly run on Stretch or is it mostly Buster only? [19:54:16] (I'm wondering if I should bother to wrestle php7.3 onto wikitech-static) [20:08:39] It should be ok other than PHP 7.3... [20:14:56] I installed php 7.3 and things are just getting worse and worse [20:14:59] mostly because composer won't run [20:15:14] is it possible the composer.phar in 1_35 is out of date? [20:15:34] https://www.irccloud.com/pastebin/w8ppzH4d/ [20:16:00] thanks cdanis, klausman. andrewbogott see above you should be able to fetch klausman key now [20:18:14] andrewbogott: https://github.com/composer/composer/issues/7802 [20:18:25] Apparently a newer version of composer might fix it... [20:18:52] I thought I was using a composer version that's baked into the mw repo? [20:19:06] isn't that what composer.phar is? [20:19:37] there's not a version bundled with MW [20:20:28] we're on stretch for everything in production at any rate [20:20:55] problem is REL1_35 needs newer PHP than master/wmf prod [20:22:40] Reedy: thanks; I downloaded a fresh composer.phar and things seem mildly better [20:23:39] jbond42: I'm getting the same results as before, Warning: the following recipients are invalid: EC3D2B2DAC6964AFC7134FB69B91773F42CB71E3. [20:23:43] maybe I'm missing a step [20:25:25] oh great, wikitech-static just shows a 404 on the front page. [20:25:36] Composer seems to have worked but I wouldn't know it from the page load [20:26:04] andrewbogott: assuming you're running pws from the directory in which you have the pwstore repo checked out, did you also pass `--keyring .keyring` to gpg? [20:26:10] (when you did the --recv-key) [20:27:07] oh, you probably invoked `pws update-keyring` and that should have dtrt (assuming you ran it in the pwstore directory) [20:27:39] um... [20:27:48] most of those commands sound like things I've never done before despite having used this before [20:27:58] but, ok… if I move into the pwstore dir and run pws update-keyring I get [20:28:07] https://www.irccloud.com/pastebin/KZBg6dhn/ [20:28:31] not the pws checkout, the pwstore checkout :) [20:28:43] (pws is the tool, pwstore is our repo with the data) [20:28:48] .keyring and .users are files that live in the latter [20:29:02] .users is part of the repo; .keyring is maintained locally [20:29:02] when you say 'pwstore' you're talking about the 'pw' directory, yes? [20:29:07] no [20:29:22] ok, hang on... [20:32:44] well now it's failing on a different recipient, is that progress? [20:34:16] pwstore is the most secure tool because it always requires 4+ engineers to access [20:34:49] that is progress, yeah [20:34:54] :(