[00:35:25] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9779420 (10Papaul) [00:37:00] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9779435 (10Papaul) 05Open→03Resolved [00:42:22] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9779440 (10Papaul) [09:22:20] 10netops, 06Traffic, 06Infrastructure-Foundations: mgmt ssh access for prometheus hosts in magru - https://phabricator.wikimedia.org/T364454 (10fgiunchedi) 03NEW [09:24:10] greetings, re: prometheus in magru other than ^ I believe we're good to go [09:25:10] 06Traffic, 06MW-Interfaces-Team, 06serviceops: map the /api/ prefix to /w/rest.php - https://phabricator.wikimedia.org/T364400#9779996 (10hnowlan) This seems like a pretty reasonable idea to me. Could we implement this remapping at the ATS layer rather than the Apache one, in a manner that would mean that wh... [10:58:37] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Comms to msw-d2-codfw down - https://phabricator.wikimedia.org/T364464 (10cmooney) 03NEW p:05Triage→03High [11:14:33] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Comms to msw-d2-codfw down - https://phabricator.wikimedia.org/T364464#9780328 (10cmooney) [12:48:25] 06Traffic, 06MW-Interfaces-Team, 06serviceops: map the /api/ prefix to /w/rest.php - https://phabricator.wikimedia.org/T364400#9780622 (10BBlack) >>! In T364400#9779996, @hnowlan wrote: > Could we implement this remapping at the ATS layer rather than the Apache one, in a manner that would mean that when we n... [12:52:56] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Comms to msw-d2-codfw down - https://phabricator.wikimedia.org/T364464#9780633 (10Papaul) @cmooney I think this is just a human error issue. We were racking all the lsw1-d* yesterday and maybe we accidentally bumped into the cable. We will check o... [13:07:20] 06Traffic: Remove mtail leftovers on ncredir puppetization and instances - https://phabricator.wikimedia.org/T364385#9780736 (10Vgutierrez) 05Open→03Resolved [13:19:20] 06Traffic, 06MW-Interfaces-Team, 06serviceops: map the /api/ prefix to /w/rest.php - https://phabricator.wikimedia.org/T364400#9780805 (10daniel) >>! In T364400#9779996, @hnowlan wrote: > This seems like a pretty reasonable idea to me. Could we implement this remapping at the ATS layer rather than the Apache... [13:28:48] inflatador: re the mysteries of config-master outputs [13:29:04] I was looking at it this morning too, because it puzzled me the other day [13:29:34] the disconnect in those two sets of URL outputs is basically the distinction between true etcd pools, and frontend service definitions that reference them [13:29:45] so for example: [13:29:49] https://config-master.wikimedia.org/pybal/eqiad/apertium [13:29:59] https://config-master.wikimedia.org/pybal/eqiad/blubberoid [13:30:23] neither "apertium" nor "blubberoid" appear in: https://config-master.wikimedia.org/pools.json [13:31:25] but that's because they're both just "front" service names, which both map to the same set of backends from etcd (which is why their outputs look identical) [13:31:49] the actual etcd-defined set of backends for both is for cluster=kubernetes,service=kubesvc in the pools.json output [13:32:34] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Comms to msw-d2-codfw down - https://phabricator.wikimedia.org/T364464#9780842 (10cmooney) >>! In T364464#9780633, @Papaul wrote: > @cmooney I think this is just a human error issue. We were racking all the lsw1-d* yesterday and maybe we accidenta... [13:32:52] in some simpler cases, all the names line up, but this layering disconnect is always there structurally [13:34:18] e.g. using a simpler case like "ldap-ro" in hieradata/common/service.yaml: it just so happens that the top-level "service" name in that file is "ldap-ro", and also within its metadata, the conftool "cluster" and "service" values are also all "ldap-ro", but that's a happy coincidence. [13:35:05] In https://config-master.wikimedia.org/pybal/eqiad/$foo , foo is the top-level service name (top level of keys in service::catalog) [13:36:08] in https://config-master.wikimedia.org/pools.json you're looking at the raw set of conftool-level dc[cluster][service][host], which the above services may use as backends with some arbitrary mapping. [13:37:06] 06Traffic, 06Data-Platform-SRE, 13Patch-For-Review: LVS hosts: Monitor/alert on when pooled nodes are outside broadcast domain - https://phabricator.wikimedia.org/T363702#9780855 (10CodeReviewBot) bking merged https://gitlab.wikimedia.org/repos/search-platform/sre/lvs_l2_checker/-/merge_requests/1 LVS: moni... [13:39:12] bblack thanks for clarification. If I'm understanding the implications correctly, that means pools.json is enough to get all the backend nodes we care about? We don't have to scrape each pool? [13:39:23] correct [13:39:43] I think the only reason you'd care to look deeper than that, is if you wanted to report affected services or something [13:40:02] e.g. if one of the kubesvc nodes is unrouteable, if you wanted to message that as "and this affects: apertium, blubberoid, ..." [13:40:35] Nah, that's a bit beyond the scope of what I'm doing [13:40:40] ok :) [13:41:11] But the pools.json thing is good news...should make things easier [13:41:17] awesome [13:43:31] ^^ volans see bblack comments above, sounds like pools.json doesn't need to be fixed [14:04:53] 10netops, 06Traffic, 06Infrastructure-Foundations: mgmt ssh access for prometheus hosts in magru - https://phabricator.wikimedia.org/T364454#9780923 (10ssingh) On https://librenms.wikimedia.org/alerts, I see the following for `mr1-magru`: ` #1: last_polled => '2024-05-08 14:01:37' last_polled_timetaken =... [14:11:50] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Comms to msw-d2-codfw down - https://phabricator.wikimedia.org/T364464#9780928 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm port 47 on the maw was going up and down on it's own. replaced the rj-45 terminator. remained steady. [14:32:30] 10netops, 06Traffic, 06Infrastructure-Foundations: mgmt ssh access for prometheus hosts in magru - https://phabricator.wikimedia.org/T364454#9780995 (10ssingh) On `mr1-magru`, I see `10.140.1.18` (`prometheus7001`) and `denied by policy`, which makes me wonder if we need to run `https://netbox.wikimedia.org/... [14:46:35] inflatador: are you sure you don't want to read from etcd directly, or set up a confd template specifically for your monitoring work? [14:46:54] config-master is another dependency/moving part [14:50:33] 06Traffic, 06Movement-Insights: Disable Chrome Private Prefetch Proxy - https://phabricator.wikimedia.org/T364126#9781040 (10OSefu-WMF) Pending additional review and approval from @ttaylor [15:02:50] 10netops, 06Infrastructure-Foundations, 06SRE: Extend BGP peer automation via Netbox to include VMs - https://phabricator.wikimedia.org/T364480 (10cmooney) 03NEW p:05Triage→03Medium [15:05:52] cdanis I started writing it that way. Ended up making ~41 calls to etcd. If that's not going to stress the system, then yeah, I'd rather use etcd [15:06:18] inflatador: please do, the steady state is that etcd serves several hundred qps of reads [15:06:42] cdanis excellent, thanks for the advice [15:12:39] cdanis: actually I think we were trying to keep the systems that reads directly from etcd fairl low, in particular now that the v3 API migration is upcoming... [15:12:48] okay [15:12:50] fair enough [15:12:55] but check with Scott [15:14:35] inflatador: FYI ^^ [15:18:53] 06Traffic, 06Data Products, 06Data-Engineering, 10Observability-Logging, 13Patch-For-Review: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117#9781136 (10Ottomata) > adopt topic names that follow EP conventions: . I'm sorry for not thinking about thi... [15:23:33] volans ACK...a quick pros/cons makes me think pools.json is probably best ATM, cdanis swfrench-wmf if y'all wanna review https://etherpad.wikimedia.org/p/monitoring-lvs-l2 and add any feedback LMK [15:23:59] yeah with the upcoming migration I'm fine with using pools.json [15:24:10] to clarify, confd would also be perfectly ok [15:26:10] ah, I've never used confd, is that just an API call? [15:26:21] it's the same mechanism that generates pools.json :) [15:26:49] er actually, perhaps it isn't [15:27:32] inflatador: ack, will take a look - thanks for flagging :) [15:27:36] https://gerrit.wikimedia.org/g/operations/puppet/+/98eda59c4f826f6e54617339599d67aa0f03a456/modules/profile/manifests/configmaster.pp [15:29:25] also +1 to confd if (a) a config-file makes sense for your use case and (b) you have no other use case to take a direct etcd dependency (I've not read the doc yet, so don't know if those are true) [15:30:31] Does this imply I need to run a confd service on LVS hosts, or how would I implement this? [15:30:54] confd should already be running there for other use cases [15:31:26] indeed, and on config-master / the puppetmasters in general [15:31:31] confd::instance is the puppetized class [15:31:33] https://gerrit.wikimedia.org/g/operations/puppet/+/98eda59c4f826f6e54617339599d67aa0f03a456/modules/profile/manifests/discovery/client.pp [15:31:35] you would need to add configuration to confd so it knows what etcd keys to watch, plus a config-file template it would populate based on said keys [15:31:35] example [15:31:48] I don't see it in systemd (looking on lvs1019) [15:31:55] and that example, btw, is included in config-master https://gerrit.wikimedia.org/g/operations/puppet/+/98eda59c4f826f6e54617339599d67aa0f03a456/modules/role/manifests/config_master.pp [15:32:45] I stand corrected, yeah - just checked :) [15:33:05] as c.danis notes, it should be easy to add using the existing puppet classes [15:34:09] 10netops, 06Traffic, 06Infrastructure-Foundations: mgmt ssh access for prometheus hosts in magru - https://phabricator.wikimedia.org/T364454#9781207 (10cmooney) >>! In T364454#9780995, @ssingh wrote: > On `mr1-magru`, I see `10.140.1.18` (`prometheus7001`) and `denied by policy`, which makes me wonder if we... [15:34:18] I'd need buy-in from Traffic since they own the hosts. I think I'll start with pools.json for the MVP, easier sell than "hey, can I install a new service on all your hosts?" ;P [15:37:43] 06Traffic, 06Movement-Insights: Disable Chrome Private Prefetch Proxy - https://phabricator.wikimedia.org/T364126#9781213 (10kzimmerman) @ttaylor and @mark confirmed we can move forward. [15:38:35] inflatador: let me check the script once again. sorry, haven't got to it yet [15:38:39] sounds good - as long as up to 5m of staleness (current timer interval for populating pools.json, though could probably be changed) is ok, then that sounds like a good starting point too [15:46:22] sukhe np, it's not a big priority. I'm just in one of my "manic/obsessive SRE" moods ;P [16:06:18] 10netops, 06Traffic, 06Infrastructure-Foundations: mgmt ssh access for prometheus hosts in magru - https://phabricator.wikimedia.org/T364454#9781281 (10cmooney) 05Open→03Resolved a:03cmooney Sorry for the delay, the capirca script times out a lot for some reason will need to look at that. Working... [17:18:15] 06Traffic, 06Data-Platform-SRE, 06serviceops: Investigate why pools.json does not match https://config-master.wikimedia.org/pybal/${datacenter}/${service} T363702 - https://phabricator.wikimedia.org/T364037#9781567 (10bking) 05Open→03Invalid Recording @BBlack 's explanation of why this is from #wikim... [17:36:02] 06Traffic, 06Data-Platform-SRE, 13Patch-For-Review: LVS hosts: Monitor/alert on when pooled nodes are outside broadcast domain - https://phabricator.wikimedia.org/T363702#9781601 (10CodeReviewBot) bking opened https://gitlab.wikimedia.org/repos/search-platform/sre/lvs_l2_checker/-/merge_requests/2 Source ba... [19:52:27] 10Acme-chief: acmechief: add support for providing files with they private key before the public key - https://phabricator.wikimedia.org/T364424#9781983 (10jhathaway) 05Open→03Invalid Setting this task to invalid, given that I should be able to retrieve the individual files from the puppet filesystem api... [19:56:58] 10Acme-chief: Suggestion on how to setup acme-chief in the dcl testing environment - https://phabricator.wikimedia.org/T364511 (10jhathaway) 03NEW [21:42:04] 06Traffic, 06Data-Engineering, 10Observability-Logging, 13Patch-For-Review: Benthos loses messages when under high load - https://phabricator.wikimedia.org/T364379#9782170 (10Fabfur) @CDanis helped me a lot in this direction and he found a workaround|solution for this specific issue, optimizing Benthos con... [21:44:38] 06Traffic, 06Movement-Insights: Disable Chrome Private Prefetch Proxy - https://phabricator.wikimedia.org/T364126#9782171 (10OSefu-WMF) a:05OSefu-WMF→03KOfori