[07:27:22] <_joe_> dancy: I guess no one was around at that time... [08:27:43] moritzm: hey. i've updated my gpg key in the pwstore repo. the wiki page says that you/daniel should add me to `.users` and re-sign it. is that still current? `.users` looks a bit weird, with multiple levels of gpg signatures [08:31:10] if you simply extended your key validity (IOW key fingerprint is the same), no update to .users is needed, that only applies for updated/new keys (like changing key type or key size or so) [08:32:07] I'm updating the docs on Office Wiki [08:33:44] moritzm: i was removed from .users last week while i was out [08:35:09] ah, I see [08:35:49] I'm readding you (which will also fix current, broken .users file, these nested sigs are broken [08:35:56] great, thanks :) [08:36:30] plus the last upload won't show in people's pws anyway, since only two keys are in .pws-trusted-users anyway [08:39:16] I'm trying to use the current tcpircbot (and related) to send SAL messages for cloud (#wikimedia-cloud channel), for what I'm seeing looking around, it's hardcoded and restricted for #wikimedia-operations, is that correct? Is there a way to add also the cloud channel and tell it to log some of the messages there instead? [08:40:17] kormat: I've readded you now [08:40:23] also, is there a way to record SAL entries that is not through IRC? (http/cli...) [08:42:12] dcaro: there are two different `dologmsg` cli tools, other one is on some prod hosts and relays to -operations and another on some toolforge hosts that relays to -cloud, maybe look at them? [08:42:42] thanks! I'll take a log :), did not know about `dologmsg` [08:44:47] that looks promising 👍 [09:12:38] jbond42, my bad, I found a typo on the pki backup patch, sending fix [09:13:15] jynus: ack thanks [09:15:58] dcaro: what's the use case of logging to SAL but skipping IRC? AFAIK SAL it's supposed to be a log of the IRC ! log actions, that are made on IRC so that anyone can see that those actions happened. Nobody keeps an eye on SAL permanently while we all do on IRC. We mostly look at SAL only when you have a timeline of some event and are looking for potential related actions. [09:16:50] if you need one through IRC you can use https://doc.wikimedia.org/wmflib/master/api/wmflib.irc.html#wmflib.irc.SALSocketHandler (although needs a ferm rule to open the host from where you want to use it if not already authorized) [09:19:33] volans: to keep a record for latter review/debugging/audit even if irc is not reachable [09:20:30] (specially given the non-persistent nature of irc) [09:21:48] a backfill capability would be nice, it happened in the past that the bot was down and it was realized a few ! log after [09:23:32] I guess one of the various library that allow to write to wikis might be handful for this. [09:28:34] jbond42: on the weekend I tried to create a new deployment-prep instance and it failed to provision since it didn't have access to the cfssl secret on the private repo, so we need a solution for that before adding profile::pki::client to all cloud instances [09:29:36] I'd suggest turning the dependency around, and making SAL independent of irc in the first place, having a "connector" that reads irc and translates to SAL, but the core SAL being isolated (then even if IRC is down/unreachable you can still log events, and once IRC is back up the connector can replay if needed) [09:31:52] Majavah: the secret is in /etc/puppet/secret/hieradata/common.yaml on deployment-puppetmaster04 so it should be avalible to all deployment-prep nodes. do you still have the host i can take a look [09:32:59] jbond42: the problem is the cloud vps provisioning flow, where instances first have the cloud-wide puppetmasters when they are created and then get the correct puppetmaster from hiera [09:34:31] Majavah: hmm ill have think on that, i should be able to recreate the issue in my own project but will bing you if not [09:34:39] that initial puppet run fails, which also causes most logins to that instance to fail since it doesn't get to add the access rules [09:34:52] ack [09:34:55] that puppetmaster dance is kinda annoying yes [09:36:39] (I'm not able to look properly until about 12 UTC, so if you need my help it might have to wait until then) [09:37:06] Majavah: dopnt worry about i think i get the issue and should be able to repo [09:40:16] if we're reconsidering SAL maybe is worth reconsider the whole thing from scratch. The current setup isn't ideal IMHO for various reasons among wich the fact that SAL depends on a good chunk of the infra (and ideally shouldn't) and is quite slow (~4s for the page itself) [09:42:04] +1 [09:49:12] +1 for that (though I don't know how SAL is currently setup xd, just that there seems to be a dependency on IRC, I'd suggest though something more gradual unless it's either really broken or really small that a full-on rewrite, something like https://martinfowler.com/bliki/StranglerFigApplication.html) [09:52:49] the TL;DR is that, as everything else around here, historically, "it's a wiki"! (at the beginning everything was a wiki, openstack VPS were managed editing wikitech for instance ;) ) [09:53:48] stashbot is running in toolforge k8s and feeds it to most places, sal.toolforge.org uses some elasticsearch somewhere [09:53:48] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [10:15:37] <_joe_> so, I'm in quite some disagreement with most of the above [10:15:44] <_joe_> sorry for chiming in late, but [10:16:02] <_joe_> 1) SAL is not for emergencies, so I don't see why it should be decoupled from our infra [10:17:43] <_joe_> 2) SAL is tied to IRC right now, and that's not needed and somewhat convoluted for anything that is not a human, so I second the idea of having an independent API we can use within production for automatic messages, but I don't think it makes much sense for human interactions outside of IRC [10:17:59] <_joe_> (IRC/any other IM platform) [10:19:26] <_joe_> so I think the sensible course of action would be to kill tcpircbot (the thing we use to communicate with sal now) and add a different interface to stashbot that would specifically be internal-only (not sure if that can be done with stashbot hosted in toolsforge) [10:19:54] <_joe_> and we only use sal.toolforge.org and not the wiki (which is what makes it slow) [10:20:52] <_joe_> we can even rewrite the bot for production uses and host it outside of our infra, but I don't see many advantages, but I do see the disadvantage of not being able to have a private interface for tools [10:31:52] _joe_: I disagree that SAL is not for emergencies. Where there in an emergency most of the time we look if there is a SAL entry that aligns with the issue, being that a deployment, an infrastructure change (confctl/dbctl action, cookbook run, etc...). Some of us probably look at the IRC backlog, some at SAL. If we're assuming that IRC is not reliable and invert the logic, then we should start [10:31:58] looking at its source of truth (wikitech?) [10:32:29] <_joe_> so you mean you want to have the ability to consult SAL in a situation where all of production is down, more or less [10:32:59] <_joe_> making it eqiad-only is surely not great, on that I fully agree [10:46:53] if what makes it slow is the wiki, maybe that can be added as an async job, copying from wmcs? [11:12:51] <_joe_> that's adding even more complexity [12:29:06] Majavah: fyi i have added a fake auth key to the private repo which seems to be enough to get thing to build and install correctly [13:54:39] I might be misunderstanding (and misusing?) SAL, I understand it's main purpose to keep a record of any events happenning around in order to allow a quick discovery of causes in the event of an incident. As such I think that being extremely resilient and independent from as many services as possible should be a high priority goal for it? If there's another system already satisfying that [13:54:41] need, can you point me to it? Because that's the one I wanted to use from the beginning xd [15:59:37] Anyone around to rubber-stamp this quick `conftool-data` patch? https://gerrit.wikimedia.org/r/c/operations/puppet/+/684435 The 2 lines changed in this patch should have originally been in https://gerrit.wikimedia.org/r/c/operations/puppet/+/683679/ [16:07:23] ryankemper: might just be me being over-conservative, but I'm not 100% sure how the tooling handles a server name moving from one cluster to another + leaving a subset of services, all in one go. [16:07:33] I mean, maybe it's fine, but it's not a common-enough scenario that I'd know off-hand [16:08:27] the "safer" route might be to remove it and re-add it in two separate commits and submit->puppet-merge them separately in sequence, but again maybe I'm overparanoid and that's a bunch of pointless excess work :) [16:12:56] ryankemper: looks good [16:14:33] bblack: there is only one way to find out [16:15:15] well there's one quick way, and one tedious way (digging in the docs and/or code) :) [16:15:40] * bblack has trust issues with technology! [16:19:30] bblack: re the "quick way" - do you think it will be immediately obvious if there's a problem or could it manifest in a way that's nebulous (I know this is all theoretical)? [16:20:13] as far as the underlying host the blazegraph/wdqs service is identical between external/internal so as long as the requests still route properly I'm struggling to come up with any actual user impact [16:22:47] ryankemper: I assume it would be "obvious" in that the scripts that modify the relevant etcd data would misbehave/fail during puppet-merge [16:24:27] (if it were a problem at all) [16:24:29] bblack: good enough for me :) will report back in a sec just to satiate our mutual curioisity [16:25:13] <_joe_> ryankemper: remember to set the weight to non-zero for the new entries [16:25:17] <_joe_> and to pool them :) [16:25:24] the output of: confctl select "name=wdqs2004.codfw.wmnet" get [16:25:34] will confirm the state of affairs for it before/after, too [16:25:37] update: no problems with merging, if the log output is to be trusted it looks like it creates the new node `codfw/wdqs/wdqs/wdqs2004.codfw.wmnet` and then removes the stale `codfw/wdqs-internal/wdqs/wdqs2004.codfw.wmnet` [16:25:51] ok, awesome :) [16:26:11] _joe_: thanks, that'd be easy to overlook [16:26:42] hmmm [16:26:51] bblack@cumin1001:~$ confctl select "name=wdqs2004.codfw.wmnet" get [16:26:51] {"wdqs2004.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=wdqs,service=wdqs"} [16:26:54] {"wdqs2004.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=wdqs,service=wdqs-heavy-queries"} [16:26:56] <_joe_> yes [16:26:57] {"wdqs2004.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=wdqs,service=wdqs-ssl"} [16:27:00] ^ should those other two services be gone? [16:27:01] <_joe_> bblack: read above [16:27:16] <_joe_> he actually added them AIUI [16:27:28] oh right [16:27:38] I must be red/green colorblind reading diffs today or something [16:31:54] hey all, can someone merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/684034 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/684088 please? both beta only changes [16:59:33] jynus: just to verify before I do it, if I want to restore a single file from a backup, https://wikitech.wikimedia.org/wiki/Bacula#Restore_(aka_Panic_mode) is the process to follow? (not an emergency, just being able to look at the old file will help with debugging) [17:01:52] that's indeed the method [17:02:03] legoktm: yes, bconsole -> restore -> select client -> get some virtual file system -> cd and mark the file -> select target host and dir... wait a bit .. check target host [17:02:05] let me know if you want to double check before you press "confirm" [17:02:25] I was planning on reminding the procedure on next SRE meeting [17:02:57] *want me to double check [17:04:48] yes please, I'll ping you once I'm there [17:05:25] you will arive to a screen that says: Restore? yes/modify/cancel, paste that somwhere private [17:05:38] I think the client is very confusing the first time [17:05:56] later it just takes a few seconds, that is why I want all sre to test it once [17:06:30] * legoktm nods [17:06:56] much better to learn this when it's not an emergency :) [17:07:00] currently at the "Building directory tree" stage [17:07:04] exactl [17:07:16] yeah, that can take a while if there is a lot of files and several incrementals [17:08:04] there are 6 million files... [17:08:29] and it is possible the full backup for it hasn't run yet, so it may have to backtrack a month of changes [17:10:14] thankfully it seems that was 2 days ago so only 2 incrementals [17:10:22] ah, ok, then [17:13:31] jynus: https://phabricator.wikimedia.org/P15672 (only viewable to SRE team) [17:14:09] yeah, that loooks fine, it will restore to /var/tmp/bacula-restores but with full path [17:14:12] so it will be on: [17:14:33] /var/tmp/bacula-restores/var/lib/mailman/lists/contact-nl/config.pck [17:14:43] ack, on lists1001, right? [17:14:52] yes, original client by default [17:15:18] ok, I said "yes" [17:15:24] > You have messages. [17:15:27] it can be done to another client, but needs some extra steps for decryption [17:15:40] 330270 Restore 1 9.379 K OK 03-May-21 17:15 RestoreFiles [17:15:43] (done) [17:16:07] what to restore >>>> actual restore process [17:16:38] perfect, I see the file [17:17:08] I tried running "messages" like the wiki page suggested and it dumped a ton of output at me, stuff about Gerrit backups, etc. [17:17:22] nah, that's the log [17:17:24] ignore that [17:17:33] ok [17:17:34] it is stored on a file [17:17:58] you can check it at the end of that if you want, but not necessary [17:18:02] maybe that shouldn't be recommended then, if there's a better way to see progress [17:18:15] but thanks for the help! :) [17:18:17] yeah, I do "status dir" [17:18:25] and I saw there the sucessful ran job [17:18:42] which is what I pasted [17:18:53] thanks for trying our service :-D [17:20:21] (remember to vote bacula as "5 stars" in the next SRE survey!) [17:20:26] :-D [17:22:05] jynus: hehe sounds like https://xkcd.com/937/ ;) [17:37:09] apergos: Reedy: Is this no longer a concern? Or should we still fix the externalstore logging level? https://phabricator.wikimedia.org/T281048 [17:37:57] I'm off today, so I'll only wade into tthe fray if no one else knows [17:38:03] (and off tomorrow too) [17:51:54] the "[...] mailing list migration complete" emails aren't supposed to go to root@, it means the list didn't import properly, we're fixing that now [17:54:33] thanks, I mistakenly thought I was admin on some lists just because I created them a long time ago but that wasnt it then [17:56:51] if the list has no owners (either it has no owners or the import failed, it seems like MM or exim or something else will send it to root@ instead [17:58:11] ah, like falls back to default, kind of makes sense, nod [19:11:37] "CRITICAL: State: degraded, Active: 7, Working: 7, Failed: 0, Spare: 0" [19:11:56] ^ if 7 out of 7 are working and 0 failed. then where is the CRIT part ? :p [19:12:45] I love confusing icinga alerts [19:13:44] maybe this is what happens while it is rebuilding a _previously_ broken raid [19:20:02] an icinga alert scripted around a hardware RAID controller's status output as produced by a vendor-provided commandline utility is basically the leakiest abstraction imaginable [19:42:25] I could use some help figuring out why `wdqs1004` is failing to re-image (background context: we're switching from raid10 to raid0) [19:42:46] During the re-image it waits for the reboot to complete until hitting the 60m timeout [19:43:05] I ssh'd into the install console and poked around the syslog and I found some relevant seeming log lines: https://phabricator.wikimedia.org/T280382#7048070 [19:45:23] As a sanity check this is /dev/disk; I believe this confirms this host has 4 disks as expected https://www.irccloud.com/pastebin/jVzOdMn1/ls%20-lah%20%2Fdev%2Fdisk%2Fby-id [19:45:51] Per the phab comment I linked, the corresponding `partman` line is `wdqs*) echo partman/standard.cfg partman/raid0.cfg partman/raid0-4dev.cfg ;; \` [19:46:30] the "wait for reboot until hitting 60m timeout" seems a bit familiar but would be unrelated to the partman issue [19:47:08] I remember having servers that would do that but if you 'manually' tell them to reboot then the rest works fine [19:47:40] yeah actually don't quote me on it timing out like that, I'm 90% sure it did but not sure if I saved the log output (my assumption though was that it would have to re-partition before coming back up thus why it would look like a failure to reboot) [19:47:41] if it's waiting for reboot and not failing during install then it's 2 separate issues I believe [19:48:00] let's assume it's failing during install [19:48:13] if partman tails then you should normally see a Debian installer menu when connecting to console [19:48:16] I just googled the relevant part of the error message and found this: https://blog.icod.de/2019/10/10/caution-kernel-5-3-4-and-raid0-default_layout/ so it seems like the disks might be different sizes possibly? [19:48:17] fails [19:48:36] ah, now that's different, ack [19:48:45] mutante: Doing a `ryankemper@puppetmaster1001:~$ sudo install_console wdqs1004.eqiad.wmnet` drops me into `busybox` to be specific [19:49:22] or do you mean if I ssh into `wdqs1004.mgmt.eqiad.wmnet` that I should see the debian installer menu? [19:50:04] in some cases it you would get the debian installer just sitting at the partioning step, yea [19:50:26] that would not be install_console [19:50:41] but ssh to mgmt and racadm [19:50:53] yea, the latter [19:51:51] the "disks are different sizes" theory sounds plausible since this host just got hardware replacement, right? [19:53:30] mutante: this is a different host than the `wdqs2007` we were talking about [19:53:55] oh, ok [19:55:16] ryankemper: I notice wdqs1003 and wdqs1004 are from separate procurement tickets [19:55:42] is that what you are doing? reimaging all and counting up and 1003 worked but 1004 now fails? [19:56:52] mutante: not quite counting up; our fleet was roughly split between hosts with the new journals and without so the actual algorithm is "count up but prioritize hosts with old journals first" [19:57:08] But to answer the question-behind-the-question: yes the re-image has worked for other eqiad wdqs hosts so this host seems to be acting up [19:57:52] ok, so, what I was going to suggest was to look up the hosts in netbox and compare the procurement ticket numbers and see if this is the first one from this batch [19:57:55] `wdqs1006.eqiad.wmnet` and `wdqs1013.eqiad.wmnet` are the 2 eqiad wdqs hosts that have been successfuly reimaged thus far [19:58:11] and then you can track down the actual PDFs there that tell you which disks exactly are in them [19:58:34] or you can find that Dell lookup form and use the Dell service tag [19:59:48] looks like this ticket is just for `wdqs1004` and `wdqs1005` and I haven't reimaged `wdqs1005` yet [19:59:54] looking for the pdf now [20:00:47] meanwhile, looking at partman regex, so: [20:00:50] https://phabricator.wikimedia.org/F8803168 [20:00:55] wdqs101[1-3]|wdqs200[7-8]) echo partman/standard.cfg partman/raid10-8dev.cfg ;; \ [20:01:02] wdqs*) echo partman/standard.cfg partman/raid0.cfg partman/raid0-4dev.cfg ;; \ [20:01:23] so 1013 and 1006 were both succesful [20:01:29] though using different recipes [20:01:42] this should tell us both recipes are fine and this host is the special case, right? [20:01:47] agreed [20:02:59] The packing slip is kinda confusing btw because it just says 800GB Solid State Drive SATA S3520 [20:03:03] ok, 800 GB SATA disk. [20:03:55] hmm.. yea, maybe Rob can help at this part. [20:05:52] would try asking in -dcops since you have the 10Gig chat there already [20:06:12] will do [20:07:15] (thanks) [20:07:37] np! good luck, will be back later [20:31:46] ^ okay circling back on the above, I sanity checked that the number of disks is what we expect (4), but one interesting difference between `wdqs100[4,5]` and other wdqs hosts that have successfully reimaged is that the successful hosts have `4 x 960GB` whereas `wdqs100[4,5]` have `2 x 800GB && 2 x 960GB` [20:32:46] sending out the bat signal to any RAID gurus...is anyone aware of a constraint that would require all 4 disks to be equal size for `raid0-4dev` in our partman setup for sw raid?