[08:06:54] does anybody see any update in the Zayo ticket for the codfw <-> ulsfo link? They said that the have updated it in the email that I have received to noc@, but I don't see where [08:07:19] (namely I didn't receive any notification via email of the ticket being updated) [08:10:36] i get the impression that zayo only provides opportunistic connectivity [08:22:46] I don't recall if we have passw stored somewhere to access the zayo portal [08:26:34] mornin'. I have an apache config change that I'd like to deploy (adding a new wiki), but is that something to be avoided on a friday? [08:32:22] hnowlan: hello! So if the change has already been reviewed by others and it is safe etc.. in theory i could be deployed even today, but in practice monday would be better if not urgent (my 2c :) [08:40:24] elukey: makes sense! it's not super urgent, just a nice to have. probably worth waiting for my first one [08:41:20] yeah, I'd wait until Monday. if people try to use the fresh, new wiki over the weekend and noone is around to fix potential issues, that's also not ideal [08:51:11] aye good point [12:30:44] elukey: the plot thickens -- did you see the message from jbartig@internet2.edu to exchange-discussion@lists.equinix.com ? [12:31:03] I am having some more coffee and then was going to reply [12:37:38] ah wait [12:38:11] the link to Equinix SV8 is back up [12:38:30] Zayo still hasn't investigated lol [12:38:58] however, as of 20:41 UTC, OGYX/124337 between codfw/ulsfo is down [12:40:56] (yes, for 16 hours now, and there's been no ticket opened) [12:47:51] cdanis: o/ [12:48:02] I wasn't aware of that mailing list :) [12:48:22] I wasn't either! [12:49:10] I'm guessing noc@ is subscribed to it [12:49:56] also, I've contacted Zayo re: OGYX/124337 [12:50:03] ahhh cdanis I just realized that this morning I had the wrong circuit id, your email is not about 337 [12:50:09] but still Zayo-based [12:50:14] it fooled me sorry [12:50:18] ah yeah, the link to SV8 seems to have fixed itself [12:50:37] I wanted to ask for the transport codfw-ulsfo [12:50:40] uff sigh [12:50:41] yesterday I emailed about OGYX/274646 [12:52:11] yes yes but this morning I didn't realize it [12:52:18] (my morning) [12:52:54] I saw the zayo transport down, your email about a zayo link down and it didn't occur to me that it was not the same one [13:59:29] -- [14:01:21] I have just used the reuse-parts.cfg partman recipe created by kormat to preserve the /srv partition while reimaging a kafka node, all perfect, really great job. It should be really useful when we'll need to upgrade kafka-main to buster for sure (without dropping data and waiting for other replicas to stream it back etc..) [14:01:26] Cc: herron: --^ [14:02:17] (this kudos friday session is brought to you by the Analytics team) [14:03:06] 😊 [14:03:07] ah, v nice! [14:04:39] 🎉 [14:13:33] hey yall, its been a while since I had to think about adding new shell users. reading https://wikitech.wikimedia.org/wiki/SRE_Clinic_Duty#Access_requests it doesn't look like it is required to have SRE meeting approval, right? [14:13:43] until we get him sudo stuff, as long as manager approves we can add it? [14:13:49] is that right? [14:14:05] re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/622878/ [14:14:51] yeah, that's fine without meeting approval [14:20:24] ottomata: in this case there are sudo permissions involved, do we need to wait ? [14:20:32] there are sudo permissions? [14:20:39] oh for analytics-admins ? [14:20:45] (the analytics admins can sudo on a ton of hosts for hdfs/analytics) [14:20:47] yeah [14:20:51] ah [14:21:00] ok, then let's change the patch to do just analytics-privatedata-users for now [14:21:00] ya? [14:21:05] we can make a second request for -admins [14:22:37] ottomata: alternatively (since there will be no sre meeting next week) we can send an email to SRE explaining the use case and get approval, seems doable (so we can also have -admins etc..) [14:23:08] but I am fine either way of course :) [14:26:21] yeah, was just kinda hoping to merge it today so he could log in [14:26:48] logging in is overrated. logging out for the weekend is where it's at [14:28:10] ottomata: today might be too soon, we'd need to wait some time to allow people to read etc.. then I think you idea of privatedata users only is better, then we'll send an email for admins (so today Razzi will be able to work on hosts) [14:28:35] (too many 'then' but you got what I meant :) [14:29:05] oh ok, so you think doing just privatedata-userrs today is fine, email for -admins and do that next week [14:29:06] ya? [14:29:26] yes exactly [14:34:57] ottomata: elukey: moritzm: I thought we had decided that adding to groups (even new roots) was okay without SRE meeting approval? [14:35:39] cdanis: i don't know what was decided so you tell me! :) it isn't clear from the docs, although there is no mention of approval from SRE on those docs at all [14:36:17] so, this was a while ago now (I think two SF allhands ago), and for various reasons I think the docs never got updated [14:36:52] but IIRC we decided that, at least for new staff, as long as they had 'manager' and 'group owner' approval, we weren't going to block on the meeting anymore [14:37:21] ah the template create form says [14:37:23] [] - non-sudo requests: 3 business day wait must pass with no objections being noted on the task [14:37:49] is that true? [14:38:15] and, why the distinction between non-sudo and sudo, if the process is not different between them? [14:38:18] cdanis: ottomata: thats pretty much my understanding i.e. if a group exists the we seek approval from the users manager in in some cases the gourp owner. specificly nuria or the analytics team for the privdata group [14:38:19] cdanis: I wasn't aware of that [14:38:20] phab template not updated? [14:38:21] I believe the template form is one of the things that was not updated after this decision [14:38:31] ok [14:38:40] is the 3 day wait no longer a thing? [14:38:54] jbond42: yes that what what I got too, but not in case of sudo requests no? [14:39:02] personally I don't see a reason for it with new staff, since the point of the 3-day waiting period was to give others a chance to review [14:39:05] cdanis: yeah, hence my "yeah, that's fine without meeting approval" above :-) [14:39:09] but I don't remember what we decided [14:39:26] ok, sounds like I can just merge this WITH analytics-admins access then? [14:40:08] I would +1 it [14:40:09] elukey: if the grups exists and the sudo permissions are allready there then i think what i said above is fine. i think it becomes unclear if we are changing the sudo permissions on a pre-exisiting group [14:40:14] ok [14:40:15] great [14:40:31] yes, the thing that I believe still requires SRE meeting approval was adding new sudoers lines to a group [14:40:35] ack [14:40:38] got it [14:40:48] jbond42: I am very confused, but as always I trust you and Moritz :) Maybe let's update the docs next week? [14:41:20] ack i have made a note to update docs on monday [14:41:25] <3 [14:41:49] jbond42: cool, please update the phab template too [14:41:53] ty! [14:41:55] will do [14:42:09] :cat-jam: [17:53:20] Running into irc trouble: ``` [17:53:20] #wikimedia-office: nick change prevented [17:53:20] Guest9322 -> razzi [17:53:20] Cannot change nickname while banned/quieted on channel [17:53:20] ``` [17:53:21] This happened after I pasted a long error traceback into irc, and I guess that sent as a bunch of messages and blocked me [17:53:40] Any irc admins / experts know how I could resolve this? [17:54:35] Wikimedia don't run freenode [17:54:57] Guest97322: is your irccloud set up with your nickserv password? [17:55:07] Reedy: razzi is a new WMF SRE [17:55:17] So I guess I should reach out to freenode staff, huh [17:55:17] cdanis: should be [17:55:40] Guest97322: https://freenode.net/kb/answer/irccloud [17:55:48] Or just part/leave the channel before trying to change your nick? [17:56:02] yea, you can part the channel, then change nick and rejoin [17:56:27] I'm guessing it's some anti spam type prevention for that channel [17:56:52] Gotcha, let me try rejoining [17:57:35] that happens on some larger channels usually, like #debian . wasn't aware it happens on -office [19:14:10] ryankemper: probably not urgent but there are quite a few cert-related warnings on icinga for cloudelastic1005 and 1006 [19:16:57] that happened to me last week, if you wait long enough you can rejoin i think 4+ hours :-( [19:17:56] I've never had that issue before :/ [19:48:35] andrewbogott: thanks for the heads up, back from lunch now. taking a look, from previous context I have the certs themselves should be in fine shape so I suspect the issue might be alert-related [20:11:50] So cert-related checks for 1005 and 1006 are failing, this is an example of one of the several checks: `check_ssl_on_port_letsencrypt!cloudelastic1005.wikimedia.org!9243` [20:11:50] the relevant nodes/hostnames are in `acme-chief` though: https://github.com/wikimedia/puppet/blob/822259dc2de2eb5dc9669b6ad6f5f04b24dba2ba/hieradata/role/common/acme_chief.yaml#L18-L30 [20:12:44] I think next step is to figure out how the `check_ssl_on_port_letsencrypt` command works, basically I'm trying to figure out if the check itself could be wrong or if it does actually mean that the certs will expire without renewal [20:14:22] check command is here: https://github.com/wikimedia/puppet/blob/fc4f45d3f5331b4fccd6206ceba171467099a288/modules/nagios_common/files/check_commands/check_ssl.cfg#L20-L23 [20:17:34] and https://github.com/wikimedia/puppet/blob/production/modules/nagios_common/files/check_commands/check_ssl is the actual perl script that does the checking [20:31:40] ryankemper: I recommend asking v.gutierrez about it. It is most likely going to renew it but the monitoring check needs to be changed (the number of days before WARN/CRIT) [20:31:56] have been there but can't find the ticket so far [20:32:38] ryankemper: the checkcommand will have a -w and -c with number of days [20:33:29] mutante: yeah warning is 7 days and critical is 3 [20:33:35] ryankemper: found it. see https://gerrit.wikimedia.org/r/c/operations/puppet/+/594722 and https://phabricator.wikimedia.org/T251726 [20:34:01] well..maybe I am wrong then. What is the actual warning? [20:34:29] mutante: thanks, so sounds like I should follow up in #wikimedia-traffic and just get a sanity check that it's a warning issue and will be resolved by that patch/ticket? [20:34:38] one sec [20:34:47] ryankemper: yea, that sounds about right [20:35:15] except that might need a new, more generic, ticket [20:37:04] So there's 12 warnings total, 6 per node (corresponding to # of elasticsearch clusters on each node), but they all look like `SSL WARNING - Certificate cloudelastic.wikimedia.org valid until 2020-09-02 19:55:16 +0000 (expires in 4 days)` where the actual command is `$USER1$/check_ssl --warning 7 --critical 3 -H $HOSTADDRESS$ --cn cloudelastic1005.wikimedia.org -p 9443` [20:38:30] This alert will go critical over the weekend so I'll need to get it suppressed in the mean time [20:38:35] yea, so.. what I can say about this is that we got warnings about expiring Letsencrypt certs before for other hosts.. and they did eventually self-renew [20:38:41] but since this keeps happening it should be fixed [20:38:46] mutante: does acking an alert when it's a warning prevent it from going critical? [20:38:52] or does the critical surface as a new alert from icinga's perspective [20:41:21] ryankemper: an ACK stays until "next state change", so it will be removed when WARN turns into CRIT or OK [20:41:29] unless it is a "sticky ack" [20:42:06] mutante: thanks and last followup: would a sticky ack be a bad idea or does it make sense in this situation? (I'd set a reminder to circle back on monday and make sure things getting sorted out) [20:42:30] i'm gonna transport some context to #wikimedia-traffic right now but most of their team is outside of working hours right now so it's likely we wouldn't have this resolved today [20:42:50] ryankemper: i think it makes sense in this situation. especially if there is also a (new/open) ticket for it, which could ideally be linked in Icinga as comment [20:43:02] thanks for the help, will do that [20:43:05] yw