[09:45:59] dcaro: https://phabricator.wikimedia.org/T363341#10060929 that's a bug, will fix asap [09:46:19] 👍 [10:00:46] dcaro: fixed on -dev, should be good on prod in a few min too [10:06:50] ack thanks! [10:07:06] dcaro: all done, can you give it another try? [10:09:35] XioNoX: it worked \o/ [10:39:57] hi jayme feel free to merge [10:40:14] stevemunene: beyond that point already, you should be gtg [10:40:43] great [10:40:45] (with re-running puppet-merge I mean) [13:42:10] can someone remind me if there additional steps to be performed adding a conftool schema [13:42:27] I merged https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/cd77dabd7515d2bab80a293364a97153d30ad60e and puppet seems to have run everywhere it should [13:42:51] etcdctl -C https://conf1007.eqiad.wmnet:4001 ls /conftool/v1/geodns [13:42:52] Error: 100: Key not found (/conftool/v1/geodns [13:43:05] so I am trying to find out what I am missing here [13:47:26] sukhe: was there any output at the end of puppet-merge from the conftool data loader? [13:48:14] cdanis: good question but I don't recall :( [13:51:32] hmm [13:52:13] my understanding always has been that conftool-merge will be called by puppet-merge? at least last time I didn't have to do anything special [13:52:51] it is [13:52:55] but [13:53:08] in this case I think it maybe depended upon puppet being run on the puppetmaster first [13:53:25] 2024-08-13 13:53:18 [INFO] conftool::load: Adding objects for geodns [13:53:28] when I ran it by hand just now [13:53:41] and now I see for instance /conftool/v1/geodns/text-next [13:53:47] cdanis: weird because the logs indicated that Puppet had been run because the schema was created? [13:53:49] ok, that's a very funny wrinkle to have encountered [13:54:00] sukhe: which logs? [13:54:02] and yeah, thanks, finally I see the key! [13:54:12] cdanis: agent run logs on puppetmaster1001 [13:54:32] Aug 13 13:27:56 puppetmaster1001 puppet-agent[17618]: Applying configuration version '(cd77dabd75) Ssingh - P:conftool: add schema for geodns' [13:54:41] Aug 13 13:28:02 puppetmaster1001 puppet-agent[17618]: (/Stage[main]/Profile::Conftool::Client/File[/etc/conftool/schema.yaml]) Filebucketed /etc/conftool/schema.yaml to puppet with sum 888a925213b188f9c1f0107aa003ee0c [13:54:41] right, but that needed to have applied before conftool-merge ran [13:54:44] is what I'm saying [13:55:02] oh I see. which puppet-merge should have run [13:55:03] hmm [13:55:25] anyway I can certainly see it now. so you ran conftool-merge manually? [13:55:28] yes [13:55:37] ok thanks. let's keep it in mind if we observe more of this I gues [13:55:38] s [13:55:51] conftool-merge reads from a puppet agent-managed directory I think [13:56:12] ah, /etc/conftool/data is a symlink to /var/lib/git/operations/puppet/conftool-data [13:56:17] *but* schema.yaml is managed by puppet [13:57:07] so at the time that puppet-merge ran conftool-merge, the geodns schema hadn't yet appeared in puppetmaster1001's copy of schema.yaml [13:58:00] interesting! [13:58:23] that doesn't seem ideal and also doesn't seemed to have happened before [13:58:39] it's not ideal, but we add new conftool schemas, like, once every two years [13:58:55] so it's hard to say if it has happened before [14:01:17] it also would have fixed itself the next time someone else ran puppet-merge after the agent cronjob heh [14:24:57] filed T372408 [14:24:58] T372408: Adding a new conftool schema type might require a manual `conftool-merge` invocation - https://phabricator.wikimedia.org/T372408 [14:34:41] cdanis: thanks! [14:35:16] cdanis: fwiw, I added a schema this year as well and I don't remember having any issues but maybe I was just mistakne [14:35:27] sukhe: well, there's a chance you just win the race [14:35:43] like, a pretty good one actually [14:35:45] yeah! possible. but thakns for documenting it [14:36:07] the thing I'm chewing on is whether or not there's any tricky bootstrapping issues hidden here [17:05:21] denisse: fyi: I'm going to restart Cassandra on sessionstore to apply a jdk update. I expect no impact, but since it's sessions, I'll make you aware out of an abundance of caution :) [17:05:57] urandom: ACK, thank you! [18:12:51] There seem to be some problems with WMF wikis - I'm unable to see notification drawer contents on plwiki (it renders "no notifications after a long wait") and unable to upload anything to Commons. I've also encountered Unexpected DB exception of Special:Preferences but it was just a single time. The wiki content is loading, though, but slowly [18:14:38] Msz2001, known outage, SREs are investigating [18:14:38] thanks Msz2001 we appear to be having a db overload on one of our primary databases, we are investigating [21:00:26] Hello, the link: https://logs-api.svc.eqiad.wmnet/ provided in https://wikitech.wikimedia.org/wiki/Logstash#Extract_data_from_Logstash_(OpenSearch)_with_curl_and_jq does not render me a valid page; does anyone know the correct URL? [21:01:21] Otherwise, how do I go about 'SSH tunneling port 9200 from an opensearch node'? [21:01:50] My goal is to be able to see how to aggregate log data that feeds into LogStash [21:14:10] hi ecarg, sorry please hold, we're doing some firefighting right now :) [21:16:51] ecarg `wmnet` domain is only resolvable from WMF infrastructure, so unless you are a staff member or trusted volunteer you won't be able to reach this endpoint [21:22:00] TY cdanis! I do have Prod access as a staff member (gchoi-wmf) inflatador ! [21:22:39] ecarg ACK, welcome aboard! [21:22:41] Perhaps I need different/higher level of permissions ? [21:23:16] do you have ssh access to hosts yet? [21:24:31] Hmm I'm able to SSH into Prod 0-0 [21:31:09] do you have SSH access to a logstash host? like `logstash2027.codfw.wmnet` for example? [21:33:29] ecarg: afaict to be able to do this you would first have to request being added to one of the groups called "logstash-roots" or "elasicsearch-roots" [21:34:18] those are the shell user groups that allow access to those opensearch nodes [21:35:28] Oh I don't believe I have access to LogStash host! Not that I recall, this is my first endeavor into looking into this [21:35:59] ecarg: but what you can do now is use curl from a bastion host.. that may not be very convenient but it's an option [21:36:02] [bast1003:~] $ curl https://logs-api.svc.eqiad.wmnet/ [21:36:07] mutante Should I create a Phab task or is there a template I should use to request this access? [21:36:23] the API does respond to that for me [21:36:57] let me try that! [21:37:12] any of the "bast*" machines or even the deployment server [21:38:19] then you would have to copy that whole "curl -XGET ..." part from https://wikitech.wikimedia.org/wiki/Logstash#Extract_data_from_Logstash_(OpenSearch)_with_curl_and_jq [21:42:02] ecarg: lmk if you want help getting on the bastion hosts. it's possible you need a line in ssh config to tell it.. that bastion hosts dont require jumping via bastion hosts [21:50:07] TY mutante will lyk!! [22:15:01] I haven't ssh'd into prod in a couple of months; today I'm getting this msg, has anyone run into this before? It is a first for me: [22:15:01] ``` [22:15:02] @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ [22:15:02] @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ [22:15:03] @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ [22:15:03] IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! [22:15:04] Someone could be eavesdropping on you right now (man-in-the-middle attack)! [22:15:04] It is also possible that a host key has just been changed. [22:15:05] The fingerprint for the ED25519 key sent by the remote host is [22:15:05] SHA256:qXOclvHZAbVta11pt98lRSn5C2oZt55SzV6XfYkWIjs. [22:15:06] Please contact your system administrator. [22:15:06] Add correct host key in /Users/ecarg/.ssh/known_hosts to get rid of this message. [22:15:07] Offending ECDSA key in /Users/ecarg/.ssh/known_hosts:16 [22:15:07] Host key for deployment.eqiad.wmnet has changed and you have requested strict checking. [22:15:08] Host key verification failed. [22:15:08] ``` [22:15:27] (oops, I clearly dk how to paste code snips in IRC, sry) [22:17:37] out of curiosity ecarg are you using the `wmf-laptop` package? (it wouldn't *quite* fix this, yet) [22:17:54] hmmm I'm not sure, how do I check >_< [22:18:10] ah I just realized you are on OS X [22:18:10] I was able to ssh into Prod before using `ssh deployment.eqiad.wmnet` [22:18:44] ecarg: This is because the deployment server changed. In this case.. I can confirm that SHA256:qXOclvHZAbVta11pt98lRSn5C2oZt55SzV6XfYkWIjs is the correct fingerprint. So you can delete line 16 in that file it mentions and accepted the new one. But it's good that you asked and didn't blindly trust it. [22:18:52] Yes, I'm on Apple M2 Max [22:19:27] for reference, you can check that sort of thing at https://wikitech.wikimedia.org/wiki/Deploy1002 ... but I think SRE can/should put a little bit of infra work into this [22:19:47] the server that is actually behind deployment.eqiad.wmnet can change. Right now it is deploy1003.eqiad.wmnet [22:20:07] I have to go, sorry, worked far too late today [22:20:17] so the warning is because it's not deploy1002 anymore, but now deploy1993 [22:20:18] TY cdanis bye! [22:20:23] 1003 [22:21:01] mutante SG! [22:21:25] ecarg: open /Users/ecarg/.ssh/known_hosts and delete line 16. connect again, accept the fingerprint, then it should be gone [22:21:45] what is kind of missing here is the same page on wikitech but for Deploy1003 [22:22:07] let me just dump the fingerprints though at least [22:22:15] coolness [22:22:27] https://wikitech.wikimedia.org/wiki/Deploy1003 here you go [22:22:44] compare what you see there and if it matches you can accept :) [22:22:58] * mutante protects the page so only admins can edit it [22:31:31] perfect! They match <3 [22:34:43] mutante I'm able to see this: [22:34:43] `ecarg@deploy1003:~$ curl https://logs-api.svc.eqiad.wmnet/ [22:34:44] { [22:34:44]   "name" : "logstash1031-production-elk7-eqiad", [22:34:45]   "cluster_name" : "production-elk7-eqiad", [22:34:45]   "cluster_uuid" : "Ts3hzDAxQdqO8FUzv5G9Fg", [22:34:46]   "version" : { [22:34:46]     "distribution" : "opensearch", [22:34:47]     "number" : "2.7.0", [22:34:47]     "build_type" : "deb", [22:34:48]     "build_hash" : "b7a6e09e492b1e965d827525f7863b366ef0e304", [22:34:48]     "build_date" : "2023-04-27T21:43:41.976278281Z", [22:34:49]     "build_snapshot" : false, [22:34:49]     "lucene_version" : "9.5.0", [22:34:50]     "minimum_wire_compatibility_version" : "7.10.0", [22:34:50]     "minimum_index_compatibility_version" : "7.0.0" [22:34:51]   }, [22:34:51]   "tagline" : "The OpenSearch Project: https://opensearch.org/" [22:35:34] but the long curl command from https://wikitech.wikimedia.org/wiki/Logstash#Extract_data_from_Logstash_(OpenSearch)_with_curl_and_jq isn't successful; I don't think I'm able to get into the Logstash server [22:35:40] (yet) [22:36:08] ecarg: well, a big step though because you can talk to the API [22:36:09] I think I need to get onto the bastion host first? [22:36:17] woohoo yes [22:36:36] no, if this works for you from the deployment server then the bastion host would be the same [22:37:07] if that full script doesn't work but the simple curl does [22:37:16] then I think it's past the level of "having access" [22:37:43] and more in detail like why does the API not like what you send [22:37:59] note though it's actually 2 separate scripts that are linked there. I also didn't see this at first [22:38:38] the first example, search.sh is just the first 9 lines [22:39:53] ah, yea, the example tries to connect to localhost.. so that is meant to run on the search servers [22:40:36] and when I try to replace that with logs-api.svc.eqiad.wmnet I can't connect to 9200.. you are right [22:40:55] so at this point, my advice is one of 2 things: [22:41:02] Oops too much pasted from my initial attempt; I get this better error: [22:41:02] `curl: (7) Failed to connect to localhost port 9200: Connection refused [22:41:02] -bash: logstash-server:~$: command not found [22:41:03] -bash: timestamp:: command not found [22:41:03] -bash: host:: command not found [22:41:04] -bash: level:: command not found [22:41:05] -bash: message:: command not found [22:41:05] -bash: timestamp:: command not found [22:41:05] -bash: host:: command not found [22:41:06] -bash: level:: command not found [22:41:07] -bash: message:: command not found [22:41:07] -bash: timestamp:: command not found [22:41:07] -bash: host:: command not found [22:41:08] -bash: level:: command not found [22:41:08] -bash: message:: command not found` [22:41:12] hahahah [22:41:15] either create that phab ticket asking for the access to the hosts [22:41:22] or take it to the team specific channel [22:41:30] which would be #wikimedia-observability [22:41:34] or a combination of those [22:42:31] SG! [22:42:31] I've got a dumb(er) Q, was this more an #wikimedia-observability Q than a #wikimedia-sre Q? I'm not sure when to go to which, also for Slack [22:42:36] ecarg: here is the link to the phabricator form for making such requests: [22:42:37] https://phabricator.wikimedia.org/maniphest/task/edit/form/8/ [22:42:41] TYY [22:43:19] ecarg: it's not a dumb question. SRE is split into many subteams. basic questions we can all answer here. but once it get's into more specific territory the subteam is better [22:43:24] in this case the service is run by that subteam [22:43:34] like ppl say, go to that 'observability' person, but then they'd be like oh I'm SRE and vice versa soooo [22:43:45] cool makes sense, TY! [22:44:07] in my opinion it's never "which person to ping". it's good to always approach it on team level rather than person level [22:44:24] who owns which service you can technically lookup in the puppet repo, fwiw [22:44:29] \O/ SG! [22:45:26] if you fill out that phabricator form then someone will respond for sure [22:45:38] because we have a rotating "clinic duty" to handle those each week [22:45:46] and you can follow-up with questions there too [22:45:46] nice nicee [22:46:25] just treat it as if it was an email with subject and body and describe there what you are trying to do [22:47:24] will do!