[07:03:37] good morning [07:04:05] not sure if already discussed but there is a BFD session down between cr1-codfw and cr3-eqsin [07:05:13] I see some maintenance for yesterday that mentions the link, but not today [07:05:23] (Telia EVPN) [07:06:33] and it shows up on the other side (eqsin) [07:06:56] looking, didn't get the alert about traffic on tunnel links [07:07:50] XioNoX: ack thanks! Maybe a 'clear bfd session address X' could suffice? If it is temporary [07:08:03] I hope we're not back in that bug :) [07:08:16] yeah :( [07:08:37] well if we use this once to clear a session stuck it might not be the same [07:09:08] I would need to briefly depool ulsfo to merge the DNS changes to migrate to netbox. Let me know if we're in a good shape network wise and when could be a god time for it [07:09:14] * volans preparing the patch in the meanwhile [07:09:51] volans: are you going to break ulsfo's network? [07:10:07] not planning to :) [07:10:32] changes are: [07:10:32] https://gerrit.wikimedia.org/r/c/operations/dns/+/627605/ [07:10:40] https://gerrit.wikimedia.org/r/c/operations/dns/+/628046/ [07:10:57] elukey: v6 bfd is down, v4 is up, but both families work when pinging the other side [07:12:03] clearing the BFD session didn't help [07:12:06] er [07:12:20] on both sides? [07:13:20] I'd run it also on cr3-eqsin [07:13:37] elukey: yep, that worked :) [07:13:38] IIRC the last time the weird one was the side that was showing the session up [07:13:41] \o/ [07:13:44] yeah now I remember [07:13:50] thanks [07:13:59] volans: you're good to go! [07:14:27] ulsfo is the backup path if the codfw-eqsin link fails, but now we're all green [07:14:28] XioNoX: thanks! [08:42:14] volans: cumin surprised me here: https://phabricator.wikimedia.org/P12716 [08:43:02] kormat: yes, expected [08:43:12] i should have expected to be surprised? :) [08:43:33] so in the first one you're mixing a global syntax (P{}) with a puppetdb-specific one dbprov* [08:43:52] that should be P{dbprov*} and P{C:wmfmariadbpy} [08:44:39] in the second one you're using hte puppetdb syntax for both, but unfortunately querying for bot resournces and nodes in the same puppetdb query is not supported by puppetdb [08:44:49] documented on wikitech Cumin's page [08:46:01] actually scratch this last bit [08:46:16] you can't mix facts and resources that cumin will tell you that [08:46:22] this last one I have to check [08:46:51] ah no, the last one is just correct, no host matches :) [08:46:55] sorry for the confusion [08:47:33] volans: can you point me to the part of the wiki page that documents that what i tried isn't supported? [08:48:37] for the first one? [08:48:55] yes [08:49:18] not sure it's that clear, but https://doc.wikimedia.org/cumin/master/introduction.html#query-language [08:50:23] volans: maybe i'm blind, but i can't see anything on that or on the wikitech page that documents this [08:51:25] extracting bits, let meknow if they are not enough, we can improve it! [08:51:29] The details of the main grammar are: [08:51:32] Each query part can be one of: [08:51:41] Specific backend query: I{backend-specific query syntax} (where I is an identifier for the specific backend). [08:51:44] Alias replacement, according to the aliases defined in the configuration: A:group1. [08:52:06] so the main grammar accepts only either an alias or a I{} specific grammar query parts [08:52:15] then [08:52:15] If a default_backend is set in the configuration, Cumin will try to first execute the query directly with the default backend and only if the query is not parsable with that backend it will parse it with the main grammar. [08:53:11] so 'dbprov* and P{C:wmfmariadbpy}' fails to be parsed by puppetdb because P{} is not part of its grammar [08:53:43] and fails to parse with the global grammar because dbprov* is not parsable by the global grammar [08:54:41] ahh. i see. [08:54:48] while 'dbprov* and C:wmfmariadbpy' is a valid puppetdb query and hence gets parsed and works [09:01:33] kormat: any suggestions to make it more clear? [09:02:50] https://giphy.com/gifs/idk-whatever-meh-xrbdBK5A5cIYo [09:03:19] and where exactly I should include that gif in the docs? :-P [09:03:33] volans: i kinda feel that it could be made perfectly clear, at the expense of having so much documentation that no-one is going to read it [09:04:17] yeah [09:21:13] "This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset." - except it's already based on the tip of the repo. wat [09:21:19] (https://gerrit.wikimedia.org/r/c/operations/software/+/629067) [09:23:33] wird [09:24:06] maybe try to remov eyour +2 and play with rebase, or just send a new PS [09:25:00] "Change is up to date with the target branch already (master)" [09:25:26] i like how jenkins fails the merge, but doesn't give _any_ details [09:25:30] it's telling you that you need to add dbctl support to spicerack and make a cookbook, not this utility bash script [09:25:44] i'm sure that's it [09:25:47] :) [09:28:01] amazing - i don't have permission to view my build history on jenkins. 👏 [09:31:23] hashar: do you have any clue why jenkins hates me? (https://gerrit.wikimedia.org/r/c/operations/software/+/629067) [09:31:33] apart from my personality, i mean [09:32:57] kormat: gotta rebase it ? [09:33:14] ah no [09:33:17] something else on the infra [09:36:11] kormat: you can +2 again https://gerrit.wikimedia.org/r/c/operations/software/+/629067 :) [09:40:01] 🤞 [09:40:47] hashar: \o/ thanks :) [11:04:51] from databases: I am going to purposely fail the backup check on s1-eqiad, to verify it is working as intended (icinga) [11:05:28] it is downtimed [12:10:14] godog: if, for my sins, i wanted to search for all current downtimes with a specific reason so that i could remove them, is such a quest achievable through the glories of icinga? [12:11:09] kormat: depends how much you want to suffer [12:11:30] i'm looking for a zero-suffering answer this time, folks. zero-suffering. [12:13:04] then just go to https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=6 select and delete downtimes [12:13:32] i don't think you (or icinga, tbh) understand the concept of zero suffering [12:13:46] there's about 200 lines i'd have to select [12:14:18] suffering is in the JD for SRE I think :) [12:14:41] bblack: i'd protest, but the JD talked about mysql, so it was strongly implied at least [12:15:21] kormat: if more code suffering but less repeatable stuff, you can find them in icinga's status.dat file [12:15:43] we already have a parser for it (see icinga-status in puppet), but because was meant for other things it skips the downtimes [12:15:51] alright, so much for trying to be smart. i'll do it the dumb way. search for each of the 14 hosts; check to see if there are other DTs than the ones i added (by looking at the timestamps and guessing), and if not remove all DTs. or something. gah. [12:15:53] but could be easily extended to support them too [12:15:57] Of course your instincts should lead you to reduce that suffering due to your embodiment of the three virtues http://threevirtues.com/ [12:16:01] that's rigged directly into spicerack's icinga module [12:16:20] kormat: consider also adding 'jsonoutput' to the query string above, and parsing that instead [12:16:46] and then issue the icinga commands you need of course [12:16:52] kormat: in that case you could use https://doc.wikimedia.org/spicerack/master/api/spicerack.icinga.html#spicerack.icinga.Icinga.remove_downtime [12:17:28] volans: the hard part is figuring out which hosts it's safe to remove all DTs from [12:17:59] kormat: if this is for s2 eqiad hosts....you can probably just let the downtime expire I would say [12:18:00] godog: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=6&jsonoutput redirected back to the original url [12:18:30] marostegui: that's definitely virtue 1 of bblack's link [12:18:45] kormat: curious, works for me [12:18:56] kormat: haha [12:19:05] godog: oh, it worked in a new tab. v0v [12:19:16] kormat: but it is the best solution for zero-suffering, isn't it? [12:19:40] marostegui: i give in, sensei. it is. [13:05:00] kormat: re: jenkins build history: https://phabricator.wikimedia.org/T177827 it was once disabled for being super-expensive, and no one has looked to see if it can be reënabled with modern versions [13:05:55] cdanis: oic [13:06:25] sorry, maybe the real task is https://phabricator.wikimedia.org/T178458 [13:06:33] but neither have been looked at for quite some time 🤷 [13:07:03] cdanis: i like your comment from earlier this month :) [15:01:04] _joe_: is `Giuseppe Lavagetto: service::catalog: format yaml consistently (7cb1125d78)` good to merge in ops/puppet.git ? [15:02:01] <_joe_> yes arturo sorry [15:02:12] <_joe_> given that change is just cosmetic, I forgot to puppet-merge [15:02:13] np! merging! [19:57:02] Looks like `elastic2037` is unreachable via ssh (see https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=elastic2037) - anyone know who I should reach out to? maybe dc-ops? [19:58:20] ryankemper: have you tried the remote console? [19:59:07] volans: no, what remote console are you referring to? [19:59:10] ssh elastic2037.mgmt.codfw.wmnet [19:59:52] then for what you can do it depends on the hardware, most commands are outlined here [19:59:55] https://wikitech.wikimedia.org/wiki/Platform-specific_documentation [20:00:46] i'm afraid Ryan does not have the pwstore access though to get that password [20:01:01] i'll check the console [20:02:14] I see him in pwstore's config [20:02:23] don't why shouldn't have access [20:03:07] oh, i thought i checked .users but was wrong then [20:03:37] ryankemper: so the console is empty.. let's try a powercycle to see what happens? [20:04:05] mutante: powercycle sounds good [20:04:09] currently digging up the pw from pwstore [20:07:38] ryankemper: ok, after login i did "console com2" that always gets you a console on any Dell server, then i did "racadm serveraction powercycle" and back to the console.. and it's starting to boot now after a while [20:09:51] ryankemper: regular ssh to the host works again, but feel free to to try mgmt anyways [20:10:10] mutante: thanks! [20:11:40] ryankemper: now to find out if there was a hardware issue, you can try to follow https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Failure_Log_Gathering [20:12:00] specifically the "racadm getsel" on the mgmt console [20:12:05] mutante: and for these kinds of scenarios in the future, I imagine https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_Documentation is what I should read for a rundown on the various commands? [20:13:21] ryankemper: correct, "console com2" and "racadm serveraction powercycle" are the most common ones probably [20:13:46] most of the time you just need to know if it's Dell or HP [20:14:06] this is jumping out at me https://www.irccloud.com/pastebin/3A6nwodM/ [20:15:05] yep, so if you have something like this. an actual hw error in getsel.. then it's time to involve dcops [20:15:20] mutante: thanks for the help. so would next step be for me to create a hardware troubleshooting phab ticket with the contents of `racadm getsel`? [20:15:22] make a ticket and paste the output of that [20:15:27] yes, exactly [20:15:35] ack [20:16:20] optionally you can check in netbox if this server is still under warranty or not [22:54:23] Malformed membership for ops user razzi, has additional group(s): set(['druid-admins', 'deploy-aqs', 'analytics-admins', 'analytics-deployers', 'eventlogging-admins'])