[00:13:09] FIRING: LVSHighRX: Excessive RX traffic on lvs6001:9100 (enp175s0f0np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs6001 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [00:13:09] FIRING: [8x] LVSHighCPU: The host lvs6001:9100 has at least its CPU 1 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs6001 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [00:18:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs6001:9100 (enp175s0f0np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs6001 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [00:18:09] RESOLVED: [8x] LVSHighCPU: The host lvs6001:9100 has at least its CPU 1 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs6001 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [11:01:29] FIRING: HAProxyRestarted: HAProxy server restarted on cp1107:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqiad%20prometheus/ops&var-instance=cp1107&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [11:01:49] uh [11:02:04] fabfur: ^^is that you? [11:06:29] FIRING: [2x] HAProxyRestarted: HAProxy server restarted on cp1107:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [11:11:29] FIRING: [6x] HAProxyRestarted: HAProxy server restarted on cp1107:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [11:16:29] FIRING: [7x] HAProxyRestarted: HAProxy server restarted on cp1107:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [11:21:17] nope [11:21:29] RESOLVED: [7x] HAProxyRestarted: HAProxy server restarted on cp1107:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [11:36:44] ryankemper: thanks for spotting this. My bad. Interesting that it went unnoticed for so long, it highlights the issue with duplication (which is also covered in https://phabricator.wikimedia.org/T270071). I don't know if you saw, but Scott added the entries in netbox, so I think we are good now, just waiting for the next instance of this to strike, [11:36:44] I guess. [11:37:11] one of the threads on cp1107 crashed and took down the process, the call trace trips into a ha_panic [11:37:31] apparently one of the stick table lookups [11:42:12] moritzm: yeah, it's already identified [11:42:19] ah, ok [11:42:44] our puppetization produced an invalid haproxy config on some hosts [11:42:53] one that isn't detected by the haproxy config check BTW [11:43:33] as a result `filter bwlim-out limit-by-source key src table limit-by-source limit 32m` was missing on the impacted hosts [11:44:31] ack [11:44:41] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1100423 fixes it (check PCC output for cp1107 in PS2 VS PS3) [11:44:48] fabfur: ^^ can I get your review there plz? [11:45:19] checking [11:48:31] yeah, filter now's everywhere [11:48:35] thx [12:34:09] Dear traffic, I will be replacing another kafka server today in eqiad [12:35:59] vgutierrez: I have a question from idle curiosity if you get a free minute (in other words completely unimportant so feel free to ignore!) [12:38:48] effie: ok [13:02:01] topranks: shoot [13:03:24] I was playing with a dashboard to show the cp host bandwidth [13:03:30] https://grafana.wikimedia.org/goto/NhWJfFVNg [13:04:14] in the aftermath of the saturation on cp1107 yesterday from WME [13:04:33] today I see their scraping is ongoing, and mostly it was on cp1107 too [13:05:01] but for ~20 mins it seemed to switch from cp1107 to cp1109 [13:05:28] I'm just curious - if you know - is what might make that change? I'm assuming here they didn't change what source IP they were coming from [13:07:44] timestamp being 11:00 - 11:30? [13:07:55] yeah around then [13:13:45] oh today [13:14:04] we had some issues with haproxy [13:14:07] and cp1107 got depooled [13:14:16] so obviously traffic jumped to other cp hosts [13:17:46] ah cool - I should have checked SAL probably that makes sense [13:17:49] thanks! [13:22:22] topranks: btw, source for the dashboard should be thanos, then you can filter per DC using the site variable [13:23:08] dashboard is pretty nice [13:23:45] thanks... I've sometimes found targeting the prom. servers seems to perform better? [13:23:58] but maybe I am hammering the active nodes pulling data and that's bad right? [13:25:24] that's a question for godog and o11y folks :) [13:27:21] yep I'll ask them as I'm doing more stuff in grafana now that we have some data from the routers on there [13:27:23] cheers :) [13:56:43] 06Traffic, 13Patch-For-Review: bring katran to liberica - https://phabricator.wikimedia.org/T380450#10379746 (10Vgutierrez) As noticed while working on https://gitlab.wikimedia.org/repos/sre/liberica/-/merge_requests/87, liberica needs to populate the `lru_mapping` eBPF map to avoid letting katran fallback to... [14:32:36] 06Traffic, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10379903 (10ssingh) 05Open→03Resolved a:03ssingh [15:36:49] vgutierrez: do you offhand know why upload-lb ATS has to fetch full objects to answer HEAD? [15:39:00] FIRING: PurgedHighEventLag: High event process lag with purged on cp1115:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqiad%20prometheus/ops&var-instance=cp1115 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [15:41:37] cdanis: hmmm could be related to the backend not providing a C-L header? [15:41:49] so ATS fetches the object and calculates it? [15:41:56] yeah I haven't even looked yet, but that was on my list [15:41:59] but that's a wild guess [15:42:03] was wondering if you knew offhand :) [15:42:05] yeah [15:42:12] I was also wondering about etags but I don't know if we do anything at all with that on upload [15:42:25] effie: are you working on kafka ATM? [15:42:44] 07:34:09 < effie> Dear traffic, I will be replacing another kafka server today in eqiad [15:42:47] ^ [15:42:51] vgutierrez: yes, I finished restarting them a few minutes ago [15:43:30] hmm it looks like purged hasn't recovered on cp1115 [15:45:03] !log restarting purged on cp1115 [15:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:00] RESOLVED: [2x] PurgedHighEventLag: High event process lag with purged on cp1115:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqiad%20prometheus/ops&var-instance=cp1115 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [15:50:03] Dec 04 15:32:27 cp1115 purged[4177332]: %4|1733326347.316|SESSTMOUT|purged#consumer-1| [thrd:main]: Consumer group session timed out (in join-state steady) after 10021 ms without a successful response from the group coordinator (broker 1001, last error was Broker: Not coordinator): revoking assignment and rejoining group [15:50:13] it looks like purged fails to capture that error from librdkafka [15:52:41] on kafka.Error it should log "Fatal error: " or "Recoverable error: " --> https://gitlab.wikimedia.org/repos/sre/purged/-/blob/main/kafka.go?ref_type=heads#L190 [15:52:51] fwiw it doesn't seem like HEAD is fetching the full object [15:53:00] see cp1111 log [15:53:05] Dec 04 15:30:09 cp1111 purged[119080]: %3|1733326209.552|FAIL|purged#consumer-1| [thrd:ssl://kafka-main1008.eqiad.wmnet:9093/bootstrap]: ssl://kafka-main1008.eqiad.wmnet:9093/1003: Connect to ipv4#10.64.32.45:9093 failed: Connection refused (after 0ms in state CONNECT) [15:53:05] Dec 04 15:30:09 cp1111 purged[119080]: 2024/12/04 15:30:09 Recoverable error (code -195) while reading from kafka: ssl://kafka-main1008.eqiad.wmnet:9093/1003: Connect to ipv4#10.64.32.45:9093 failed: Connection refused (after 0ms in state CONNECT) [15:53:29] so we got a kind of event that the manageEvent function is missing entirely :) [15:53:58] fabfur: ^ perhaps you should pick this one up given he has a presentation upcoming [15:54:05] sure [15:54:06] sukhe: not urgent [15:54:10] looking anyway [15:54:16] good luck vg! [15:54:45] happy to see it fixed after coming back from my PTO though ;P [15:59:36] that's interesting [15:59:46] log message on librdkafka is generated here: https://github.com/confluentinc/librdkafka/blob/cb8c19c43011b66c4b08b25e5150455a247e1ff3/src/rdkafka_cgrp.c#L5685 [16:00:33] but it sets the error to no error? rkcg->rkcg_last_heartbeat_err = RD_KAFKA_RESP_ERR_NO_ERROR; [16:00:46] so we cannot detect that [16:00:48] uh oh.. meeting starting :) [16:01:19] vg: we can raise/lower the log level specifically for librdkafka, with a specific param, if needed [16:01:26] but yeah, think about the pres : ) [16:02:58] fabfur: yeah... not like we can have a production instance with debug logging [16:04:14] about logging that's the "easy way" that can be done with params passed to librdkafka, never intended to lower log level to debug :D [16:05:50] yeah.. we need to reproduce that behavior in a controlled environment [17:01:26] inflatador: sukhe: tiring out the dog now and then gonna quickly eat after. So ready to launch in 90’ (maybe bit less) [17:02:14] ryankemper: I am OK with that but check with inflatador once, had some conversation with him in DM about doing it today [17:06:24] ok, happy to do whenever you want [17:11:42] ryankemper sukhe OK, let's forge ahead around 10:30 PT/12:30 CT/1:30 ET then. Will recreate invite [18:31:24] 3 mins for me [18:32:10] sukhe ryankemper I'm in https://meet.google.com/afq-irpv-gmb [18:32:24] joining is optional, but I'll be available [18:32:59] I am here on IRC since it's easier that way plus you both are doing all the work anyway [18:33:37] sukhe ACK SGTM [18:34:46] ryankemper I have a hard stop at noon PST /3 EST so I'm going to get started within the next 5m [18:39:25] sukhe we are tarting now, you can follow progress at https://etherpad.wikimedia.org/p/internal-graph-split-lvs [18:39:47] thanks [18:39:51] merging step 2 rn (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1097541) [18:44:27] sukhe do we need to `conftool-merge` after merging ^^ ? That's mentioned below the steps in https://wikitech.wikimedia.org/wiki/LVS#Etcd_as_a_backend_to_Pybal_(All_of_production) . Or maybe we just need to run puppet on config-master hosts? [18:45:20] conftool-merge is invoked via puppet-merge [18:45:22] the puppet merge step should have already run it for you. [18:47:17] Yup we're good, new inactive entries were present. I just ran the commands to set them active and assign weights [18:47:30] yep sounds good [18:48:10] OK, and that makes the pybal config templates appear on config-master? We should see alerts if the data is invalid [18:48:20] ? [18:49:58] sukhe: I think we merge the A/PTR record DNS patch now correct? before doing step 4 where we add the service entries [18:50:52] inflatador: not at this step I think, only when you add the service definition [18:51:12] ryankemper: yes, merge and run authdns-update [18:51:24] great, proceeding [18:53:34] once you are done, confirm IPs for {wdqs-internal-main,wdqs-internal-scholarly}.svc.{eqiad,codfw}.wmnet [18:56:00] possible you might get NXDOMAIN so if you do get it, sudo cookbook sre.dns.wipe-cache [18:56:08] the TTL is fairly large so might as well do it [18:56:21] long :] [18:56:29] it's one hour I think [18:56:48] sukhe: having a bit of trouble with the dig command args [18:57:08] I naively did this from the docs but I suspect it's different args since we're doing a .svc and not the type of address in the docs [18:57:10] https://www.irccloud.com/pastebin/usKZGKBD/ [18:57:25] the A/PTR records are there [18:57:34] yeah ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 49681 [18:57:42] just a sec [18:58:03] The following does give me a response with the IP: [18:58:03] try now [18:58:04] `for i in 0 1 2; do ns=ns${i}.wikimedia.org; echo $ns; dig +short @${ns} wdqs-internal-main.svc.codfw.wmnet; done` [18:58:26] I think SRV needs an underscore and maybe tcp/udp? [18:59:19] no, I think the cache needed to cleared so it should work now (these are not proper SRV records) [18:59:35] presumably I was testing the IP before and you might have been seeing that NXDOMAIN [18:59:39] it works now for me but do confirm it [19:00:19] sukhe@cumin1002:~$ dig wdqs-internal-main.svc.eqiad.wmnet +short [19:00:19] 10.2.2.93 [19:00:32] https://phabricator.wikimedia.org/T379334#10380998 yeah everything looks good [19:00:34] so just confirm these IPs match the ones in the service defintion and feel free to proced then [19:00:49] ack, proceeding [19:01:00] that's good...guessing we don't need to explicitly add SRV records based on the dns repo [19:01:06] yep, no need [19:04:06] Merged the service definition for step 4. Proceeding right to step 5 since step 4 is a no-op per the docs [19:04:20] Oh wait, actually we need to merge the envoy patch here [19:04:26] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1097542/ [19:04:32] yeah, get the backends ready [19:12:27] sukhe: the puppet run didn't show any change, and I don't really see any differences between `wdqs2018` (where puppet was run) and `wdqs2019` [19:12:42] I wonder if it needs the service catalog to be further along than just `service_setup` maybe [19:13:12] 10netops, 10Cloud-Services, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10381052 (10VRiley-WMF) Understood, I will close this this and ask for a replacement! [19:13:24] 10netops, 10Cloud-Services, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10381053 (10VRiley-WMF) 05Open→03Resolved [19:14:12] yeah that's expected. see a PCC run later down https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094069 which I just ran [19:14:17] https://puppet-compiler.wmflabs.org/output/1094069/4639/wdqs2018.codfw.wmnet/index.html [19:14:31] full diff https://puppet-compiler.wmflabs.org/output/1094069/4639/wdqs2018.codfw.wmnet/fulldiff.html [19:15:27] makes sense. proceeding to step 5 [19:15:31] just a sec [19:15:34] checking the confd alerts [19:16:55] ack [19:17:00] I'd hit +2 but haven't merged yet [19:17:11] sukhe: maybe it's https://wikitech.wikimedia.org/wiki/Confd#Stale_template_error_files_present [19:17:29] if an error happens once then it can happen that those .err files just stick around [19:17:38] checking confd logs [19:17:40] and monitoring doesnt resolve until they are removed [19:18:09] it's a common thing when adding new services, so I wanted to mention it [19:18:21] `19:17:44Z config-master1001 /usr/bin/confd[835480]: ERROR 100: Key not found (/conftool/v1/pools/eqiad/wdqs-internal-main/wdqs) [3380547]` [19:18:25] mutante: no it's actually missing keys [19:18:26] /conftool/v1/pools/eqiad/wdqs-internal-scholarly/wdqs [19:19:41] sukhe That means https://gerrit.wikimedia.org/r/c/operations/puppet/+/1097541/6 needs to be fixed? [19:19:45] I think our patches expected it to look like `/pools/codfw/wdqs-internal-main/wdqs-main/` and `/pools/codfw/wdqs-internal-scholarly/wdqs-scholarly/` [19:20:02] like conftool-data/node/eqiad.yaml is missing something [19:20:28] so [19:20:29] + wdqs-internal-main: [19:20:29] + wdqs1026.eqiad.wmnet: [wdqs-main] [19:20:54] this means: wdqs1026 is part of the wdqs-internal-main cluster with service wdqs-main [19:20:58] does that add up for what you want? [19:21:12] Yeah, cause we're trying to make it look like how public wdqs-main does [19:21:28] https://www.irccloud.com/pastebin/RiiHq3Xk/ [19:21:54] but I'm a bit confused where on the template side of things the configmaster is deciding it should be `wdqs-internal-main/wdqs` and not `wdqs-internal-main/wdqs-main` [19:21:56] but you are calling the service wdqs-main though and not internal-main. the [ ] part [19:22:26] ryankemper: it's looking up this [19:22:27] keys = [ [19:22:27] "/pools/codfw/wdqs-internal-scholarly/wdqs/", [19:22:27] ] [19:22:56] yeah my question is how does it know to look up that path [19:23:14] specifically what tells it to look for service `wdqs` as opposed to service `wdqs-scholarly` (or main etc) [19:23:15] it forms that from the layout in conftool-data/node/eqiad [19:23:25] so if you look at what you have now: [19:23:30] wdqs-internal-main: [19:23:31] wdqs1026.eqiad.wmnet: [wdqs-main] [19:23:31] wdqs-internal-scholarly: [19:23:32] wdqs1027.eqiad.wmnet: [wdqs-scholarly] [19:23:52] the etcd layout of this will be: [19:23:52] sukhe@cumin2002:~$ etcdctl -C https://conf1007.eqiad.wmnet:4001 ls /conftool/v1/pools/eqiad/wdqs-internal-scholarly/wdqs-scholarly/ [19:23:55] /conftool/v1/pools/eqiad/wdqs-internal-scholarly/wdqs-scholarly/wdqs1027.eqiad.wmnet [19:24:11] basically: /conftool/v1/pools//// [19:24:19] right and that's what i expect [19:24:36] but i feel like the configmaster is telling me it wants e.g. `pools/codfw/wdqs-internal-main/wdqs/` [19:24:52] here is a diff between known-good internal template and broken internal-scholarly template if it helps https://phabricator.wikimedia.org/P71557 [19:24:53] but if it's reading our config I assume it should be looking for `pools/codfw/wdqs-internal-main/wdqs-main/` [19:25:26] ah [19:25:40] so we are confusing it with cluster name and service names [19:25:51] because below it you have: [19:25:52] wdqs-scholarly: [19:25:52] wdqs2023.codfw.wmnet: [wdqs-scholarly] [19:25:53] wdqs2024.codfw.wmnet: [wdqs-scholarly] [19:27:08] I think the scholarly pool is what we want to copy. New diff: https://phabricator.wikimedia.org/P71557#286710 [19:28:37] sukhe: it must be parsing something besides the conftool-data file because otherwise it should have the right service key [19:28:39] `/pools/codfw/wdqs-internal-scholarly/wdqs/%s` <- should be `/pools/codfw/wdqs-scholarly-internal/wdqs-scholarly/` . Next question is where we do we source this data from? [19:29:00] ryankemper: I wonder if this is cruft from the previous run when we tried and it failed? [19:29:45] I kinda doubt it, since previous alerts went away when we rolled back [19:30:09] inflatador: do you have to patches handy by any chance? [19:30:20] the specific one in which we added conftool data? [19:30:38] sukhe looking [19:30:58] sukhe: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1088383/15/conftool-data/node/codfw.yaml so it can't be cruft bc it looks the same [19:31:43] sukhe this is the one we had to roll back https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094548 [19:31:46] yeah [19:32:45] oh maybe there were manual steps done besides the patch itself [19:32:53] it might be a mistake from the manual conftool additions? As opposed to anything in Puppet? [19:33:13] the right keys exists in etcd at least [19:33:17] in the format we expect it to [19:33:37] sukhe@cumin2002:~$ etcdctl -C https://conf1007.eqiad.wmnet:4001 get /conftool/v1/pools/eqiad/wdqs-internal-scholarly/wdqs-scholarly/wdqs1027.eqiad.wmnet [19:33:38] are there incorrect keys in etcd too maybe? [19:33:40] {"weight": 10, "pooled": "yes"} [19:33:56] that matches the above, cluster internal-scholarly, service wdqs-scholarly [19:35:10] my best guess is that the template file doesn't come from it parsing conftool-data but some other source [19:35:30] because i checked the SAL for manual commands around the time that original patch was merged and there's nothing that seems like it would set the service as `wdqs`: [19:35:31] https://sal.toolforge.org/production?p=0&q=conftool&d=2024-11-22 [19:37:35] trying a few things [19:38:08] sukhe: I figured it out, it's service.yaml [19:38:14] https://www.irccloud.com/pastebin/C5hCDdW5/ [19:38:21] aaaa [19:38:22] yes [19:38:24] patching now [19:38:30] should be wdqs-scholarly! [19:38:43] don't forget wdqs-internal-main too [19:39:01] that should be wdqs-main and not wdqs [19:39:16] nice find, this explains why the keys were in etcd but the lookup key path was wrong [19:42:59] ryankemper: just do a new patch separate from that chain and rebase all other patches on top of that [19:43:05] ok just saw [19:44:50] sukhe: I was just going to rebase the remaining patch chain onto `origin/production` after merging this hotfix, but if I just rebase directly onto this hotfix patch pre-merge would that preserve the chain? [19:45:06] I think regardless this chain is going to be split into the patches we've merged thus far vs the ones already done but not sure [19:45:07] ryankemper: no idea :] [19:45:15] yeah that's OK I think [19:45:18] that makes two of us! ok [19:45:20] we have a fairly good idea on the order [19:45:57] yeah part of me was just hoping if this process went 100% smoothly I could link the patchset in the lvs docs for future travellers, but maybe liberica will remove the need for that anyway [19:46:01] okay rebasing now [19:46:06] let me know once you merge this one so that we can check config-master [19:46:13] then we can check the rest of the stuff [19:46:59] ryankemper: noble cause but I think we have made lots of edits and that will continue. for example I will add a note about what we saw today [19:47:20] * inflatador wonders if we could add the conftool json schema to CI and catch those errors earlier [19:48:32] inflatador: it's tricky because they are distinct steps. conftool-data doesn't enforce how you layout stuff (beyond a few base things) so it doesn't know if abc is a service or cluster; it infers that from how you lay it out [19:48:53] but if you meant CI based on that data for the actual definition, then yeah, that's a possiblity [19:49:41] CI could check that the service definition has a corresponding `conftool-data` entry, it just obv won't be able to check that the original conftool pool command is run correctly [19:50:01] looks good now! [19:50:01] 2024-12-04T19:49:40.966518+00:00 config-master1001 puppet-agent[841376]: (/Stage[main]/Pybal::Web/Pybal::Conf_file[/srv/config-master/pybal/codfw/wdqs-internal-scholarly]/Confd::File[/srv/config-master/pybal/codfw/wdqs-internal-scholarly]/File[/etc/confd/conf.d/_srv_config-master_pybal_codfw_wdqs-internal-scholarly.toml]) Scheduling refresh of Service[confd] [19:50:04] so in this case it could have failed the service.yaml patch by saying there's no `conftool-data` entry for e.g. `wdqs-internal-main/wdqs-main` [19:50:06] Dec 04 19:49:44 config-master1001 confd[842730]: 2024-12-04T19:49:44Z config-master1001 /usr/bin/confd[842730]: INFO Target config /srv/config-master/pybal/eqiad/wdqs-internal-scholarly has been updated [19:50:09] {◕ ◡ ◕} [19:50:21] but there's also always a million things that could/should be in CI so prioritizing is tricky [19:50:24] ryankemper: yeah, I see what you and bking mean [19:50:25] yep [19:50:34] sukhe: inflatador: cool so I think we can head on to step 5 now? [19:50:38] yes [19:50:50] just a sec [19:51:00] afterwards, I'd be happy to guide someone through making, and then review, such a patch [19:51:02] maybe let's force a puppet run [19:51:56] cdanis: yeah it might be helpful. one could argue it's a trivial mistake to not copy the service/cluster part but then again, it can happen and did [19:52:15] ryankemper: envoy patch done? [19:52:19] that's where CI shines, saving us from ourselves :D [19:52:29] We used traefik at a previous job and we did use its json schema to validate our LB config [19:52:48] sukhe: envoy patch done. altho seemed to be a no-op at this stage [19:53:17] that's interesting. I don't have much understanding of that bit so I leave that to you [19:53:26] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094069/4 is next [19:53:31] er 7 [19:53:41] I'll probably take cdanis up on their offer during code freeze if they have time ;) [19:54:24] ryankemper: merge it please [19:54:42] and ensure clean puppet run on affected WDQS hosts [19:54:57] ensuring that might save us some pain later [19:56:12] yup, starting with wdqs2018 and will report back here [19:58:18] https://puppetboard.wikimedia.org/report/wdqs2018.codfw.wmnet/3e7162299bdffcb8a934646124823a7d6e8ef1f9 [19:58:22] looks good IMO but do check [19:58:56] we're running against 1026/1027 (eqiad hosts) now [19:59:43] Looks like we got a clean run on 1026 [20:00:30] nice [20:00:52] let's do lvs_setup and the restarts (disable puppet first) [20:01:27] sukhe ryankemper I have to depart, I leave things in y'all's capable hands [20:01:42] inflatador: thanks and no worries, we will carry it through. should be all smooth sailing. [20:03:50] back [20:03:55] alright running on rest of hosts [20:04:19] and then can proceed to lvs_setup patch after [20:04:28] I can take care of rolling that bit out [20:04:33] you can do lvs_setup [20:04:49] by that bit you mean the pybal restarts? or which [20:04:54] sure [20:05:01] I meant the agent run on the WDQS hosts [20:05:09] but you can do that and I can do the pybal restarts [20:05:13] so doing that (pybal) [20:05:18] oh, those are ongoing rn [20:05:22] I'm happy to drive the pybal [20:05:26] ok [20:05:27] * ryankemper has been trying to get one of those shirts for ages [20:05:50] (have a hard stop at 45 and hence I thought we can parallelize) [20:05:57] ryankemper: budget cuts, you get a sticker [20:06:03] no more t-shirts [20:06:25] ironically our wdqs server costs are probably why we had to make that cut [20:06:28] ok merging [20:08:22] definitely use the aliases for the restarts. if you are doing codfw first, A:lvs-secondary-codfw and then A:lvs-low-traffic-codfw [20:08:36] then A:lvs-secondary-eqiad and A:lvs-low-traffic-eqiad [20:08:59] since this is low-traffic, disabling on just eqiad/codfw is fine fwiw but no worries in doing on all (as you did) [20:09:58] and let me know which one you hit first so I can check [20:10:58] 06Traffic: ncmonitor should have a dry-run option - https://phabricator.wikimedia.org/T380287#10381236 (10BCornwall) 05Open→03In progress [20:11:16] ack [20:11:22] sorry got held up a sec but merging now [20:12:54] sukhe: should I re-enable puppet on lvs* and then run puppet on just A:lvs-secondary-codfw and A:lvs-low-traffic-codfw [20:13:01] or should i just run puppet everywhere and then use those aliases for the restarts [20:13:08] probably option #2? [20:13:13] no, just enable Puppet on *one of these* [20:13:19] ok [20:13:19] and run agent and restart pybal [20:13:47] (puppet does not restart pybal for us thankfully but there is on reason to roll it out everywhere and hence the caution in this step) [20:13:54] s/on/no [20:13:55] will do codfw first then eqiad after [20:14:16] sudo cookbook sre.loadbalancer.restart-pybal A:lvs-secondary-codfw --reason "rolling out new WDQS services" [20:14:16] so first up `A:lvs-secondary-codfw` [20:14:36] or simply restart pybal on lvs2014.codfw.wmnet and !log [20:14:42] either is fine [20:16:03] https://www.irccloud.com/pastebin/kCUA9Wtv/ [20:16:12] sukhe: how's this look for order of operations [20:16:21] well i'll throw a grep on the ipvsadm but you get the point [20:16:28] looks OK. you can skip the last step and we can check the pools [20:16:38] looks good for first two [20:16:47] kk proceeding in operations [20:19:21] https://www.irccloud.com/pastebin/JJWLbZXG/ [20:19:37] sorry, restart_daemons in the end [20:19:51] sukhe: does it want `--alias`preceeding the `A:` as well? [20:20:15] yeah --query "A:lvs-secondary" [20:20:20] the usual cookbook one [20:20:36] let me try in eqiad sorry [20:21:25] oops I already pressed the button but it looks good [20:21:29] ok great [20:21:32] let's check the pools [20:21:55] sukhe@lvs2014:~$ curl localhost:9090/pools/wdqs-internal-scholarly_80 [20:21:55] wdqs2027.codfw.wmnet: enabled/up/pooled [20:21:56] wdqs2026.codfw.wmnet: enabled/up/pooled [20:22:24] sukhe@lvs2014:~$ curl localhost:9090/pools/wdqs-internal-main_80 [20:22:24] wdqs2018.codfw.wmnet: enabled/up/pooled [20:22:24] wdqs2020.codfw.wmnet: enabled/up/pooled [20:22:24] wdqs2019.codfw.wmnet: enabled/up/pooled [20:23:26] ipvsadm looks happy as well, I see both `.93` and `.94` [20:23:37] yep [20:23:39] sukhe: onto `A:lvs-low-traffic-codfw`? [20:23:42] yes please [20:23:45] I am taking care of eqiad [20:24:20] ack [20:26:48] you have one host each for eqiad [20:26:52] I think you know it, but just reminding that [20:27:18] yup [20:27:39] we have one more host we're bringing into `wdqs-internal-main` tomorrow so that one will be up to 2 but we'll be on 1 for scholarly until the cutover [20:27:50] ok [20:29:10] doing eqiad main now [20:30:45] sukhe: codfw completely done, altho I see `wdqs2026.codfw.wmnet: enabled/down/not pooled`. every other host is up, investigating [20:31:08] was weird? it was pooled earlier when I checked (see above) [20:32:13] sukhe: yes it's up on the secondary but down on the low-traffic...not sure why [20:33:09] must just be failing probe [20:33:09] Dec 04 20:32:40 lvs2013 pybal[1590965]: [wdqs-internal-scholarly_80 ProxyFetch] WARN: wdqs2026.codfw.wmnet (enabled/down/not pooled): Fetch failed (http://localhost/readiness-probe), 3.055 s [20:33:28] but why from 2013 then and not 2014 [20:33:48] yeah that's what I don't get [20:34:29] true [20:34:33] and 200. that's what we want right? [20:34:56] yup [20:34:57] I mean 200 is all what we care about here anyway [20:35:11] ran test queries on the host to be sure and it's responding fine [20:35:20] sukhe@lvs2013:~$ curl -k http://10.192.11.11/readiness-probe [20:35:20] curl: (7) Failed to connect to 10.192.11.11 port 80: No route to host [20:35:25] that's a fun one [20:35:49] sukhe@lvs2013:~$ ping 10.192.11.11 [20:35:49] PING 10.192.11.11 (10.192.11.11) 56(84) bytes of data. [20:35:50] From 10.192.11.4 icmp_seq=1 Destination Host Unreachable [20:35:52] was about to say it feels like some genuine network issue [20:35:54] yep [20:36:09] topranks: hello :D [20:36:28] 06Traffic: ncmonitor should have a dry-run option - https://phabricator.wikimedia.org/T380287#10381339 (10BCornwall) 05In progress→03Resolved [20:37:20] lack of arp more than likely [20:37:36] yeah but how/why [20:37:47] the interface is down [20:37:51] cmooney@lvs2013:~$ ip route get fibmatch 10.192.11.11 [20:37:51] 10.192.11.0/24 dev vlan2029 proto kernel scope link src 10.192.11.4 linkdown [20:38:19] https://www.irccloud.com/pastebin/lWIGB5xg/ [20:38:23] wow. [20:38:28] why is a good question [20:39:09] it's been reconfigured wrong I think?? [20:39:37] scrap that - I'm looking at old data in netbox-dev instance [20:39:47] (i was working in that earlier had it open) [20:39:52] ok [20:40:17] stepping out for 5, sorry. nrb [20:40:19] brb [20:41:01] no wait hold on.... [20:41:10] everything should be on the primary interface [20:44:10] there is a typo on this line in gerrit [20:44:11] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/lvs/interfaces.yaml#288 [20:44:35] ^^ seems I got it wrong and (to shift blame away) sukhe didn't catch my deliberate error :( [20:45:57] :P [20:46:28] I gather it needs `s/np1/np0`? [20:49:09] it's a different NIC port so the bit before that is different also [20:49:23] oh right [20:49:58] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1100540 [20:50:24] basically all the vlan ints use the host's single top-of-rack uplink in codfw now [20:50:38] wow [20:51:13] soooo [20:51:17] Jan?! [20:53:13] yeah [20:53:24] I mean maybe nobody put anything behind an LVS on the new vlan until now [20:53:30] (not sure what got you guys to notice this) [20:54:14] well we added a new service and noticed lvs was not able to connect to it [20:54:18] and then we went on from there [20:54:52] ryankemper: I think in the meantime [20:54:57] you can move ahead with lvs_setup to production [20:55:02] and the puppet run A:dnsbox [20:55:08] eqiad looks good so it's fine to do that IMO [20:55:27] topranks: should we have alerts for situations like these I wonder? [20:55:31] patch is merged now [20:55:44] running agent [20:56:17] couple min break then merging next patch [20:56:37] it might need a reboot to do it properly [20:56:49] or we can just fix it with a few 'ip' commands to avoid the disruption [20:57:31] either is fine [20:58:33] I think let's do a reboot [20:58:42] so that it's consistent in that regard [20:58:45] I can start draining [20:59:11] ok cool [20:59:16] btw this wasn't January [20:59:34] sukhe: will the reboot impact the switch of service state to prod or shall I proceed [20:59:47] ryankemper: yeah let's wait a bit just to be safe [20:59:50] kk [20:59:54] - vlan-raw-device eno12409np1 [20:59:54] + vlan-raw-device eno12399np0 [21:00:12] ok I am draining lvs2013 [21:01:52] ryankemper: actually this step will only affect the DNS hosets [21:01:53] *hosts [21:01:59] I think you should move ahead with the step and run agent there [21:02:05] after changing to production that is [21:04:39] ok [21:09:20] ryankemper: https://puppetboard.wikimedia.org/report/dns1004.wikimedia.org/bf051821c60761008be28283d50220a605208a51 [21:09:23] looks good [21:09:49] paging you are not doing today I think? (I guess with one host probably doesn't make sense) [21:09:58] that means we can do the last step [21:10:06] fwiw netbox confused me when I looked... I've opened a task for dc ops to fix it up [21:10:06] https://phabricator.wikimedia.org/T381533 [21:10:09] and since lvs2014 is low-traffic, you should be fine to move ahead [21:10:15] yes we are not making this service page [21:10:34] cool puppet runs on remaining dns hosts are ongoing now so can do the final dns patch merge in a couple mins [21:11:18] sukhe: it was a patch at the end of August we missed it on [21:11:19] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1056563 [21:11:29] ^^ should have changed here but wasn't :( [21:11:47] it is also 100% on me for not noticing when the migration was done the link was down [21:12:16] Not sure how I didn't spot that or check properly... sorry [21:12:31] na, reviewer I also missed that so it's on me too [21:12:44] topranks: question I have is [21:12:52] yeah, though I'd say it's easy to miss one making the patch, and easy to miss as a reviewer [21:13:06] when doing the migration it's pretty clear if a link is down, I really should have spotted that on the night [21:13:12] till we have to live with the current LVS network setup [21:13:27] how do we detect this and alert on it? because unless I am mistaken, this happened in the past too? [21:14:29] I think this is the only one I messed up in this way.... but sure it's almost certain that we've made errors in that interfaces.yaml file before [21:14:38] any I know of were generally spotted right away [21:15:22] topranks: if you recall during magru, we had a link down as well that we didn't have any awareness and only discovered when ssh to that lvs was failing from bastion7001 [21:15:29] the issues I know about that were present for an extended time were when the switch-side config was not matching what was configured on the LVS (i.e. vlans missing) [21:15:44] yes we had that in magru, but corrected quickly [21:15:45] so I wonder how do we cover those cases [21:16:15] I literally only discovered during the magru work that this scenario is possible... and tbh I find it mind-boggling [21:16:21] and I think I have made a mistake in interfaces.yaml too [21:16:39] i.e. that on Linux a sub-interface of a physical one will get an attribute "linkdown" associated if the parent physical is down [21:17:03] But it won't actually bring the sub-int down, it'll consider it "UP" and put the IPs in the routing table etc [21:17:09] which is like madness to me [21:17:09] link down but still puts it in the routing table? [21:17:16] yeah [21:17:17] the parent link down [21:17:39] but yeah, on any network platform that doesn't happen, if the port goes down the sub-int does too [21:17:54] anyway that's only a complication, we can get a link down and a problem for the LVS regardless [21:18:16] as it won't have L2 adjacency to the required vlan [21:18:35] :} [21:18:59] but the quirk means regular routed traffic breaks too, as it tries to use the down int (it should treat that as down and use its default route - I'll mention it to Linus over Christmas dinner) [21:19:17] lol [21:19:28] topranks: seems like you have a kernel patch in your future. [21:19:53] ryankemper: let me know when the last patch is merged, we can try the dns records [21:21:46] topranks: rebooting lvs2013 [21:22:04] sukhe: as to how we spot this [21:22:18] Brian did some work on a potential Icinga check that would catch this kind of thing [21:22:18] https://gitlab.wikimedia.org/repos/search-platform/sre/lvs_l2_checker/-/blob/main/lvs_l2_checker.py?ref_type=heads [21:23:14] ohhhh yeah [21:23:18] https://phabricator.wikimedia.org/T363702 [21:23:22] sukhe: authdns update done so we can check dns now [21:24:41] sukhe: a check for an interface in a "down" state would be much easier I think though (this specific issue) [21:24:59] yeah we can just alert on that. I will think about it later from at least the LVS side. I am sure we have some metrics around that [21:25:02] I think the approach that was being taken in the above task would cover the scenario we had today / in magru last week though [21:25:05] topranks: lvs2013 is back up [21:25:13] I will wait for you to check things [21:25:34] cmooney@lvs2013:~$ ip -br link show | egrep ^vlan | wc -l [21:25:34] 36 [21:25:34] cmooney@lvs2013:~$ ip -br link show | egrep ^vlan | grep eno12399np0 | wc -l [21:25:34] 36 [21:25:56] yeah it looks good [21:26:30] https://www.irccloud.com/pastebin/nvA44li6/ [21:26:37] awesome [21:26:49] ryankemper: DNS looks good, cleared caches for discovery records [21:28:18] excellent proceeding to the final step of the rollout [21:28:22] * ryankemper cracks open nearest IPA [21:28:37] sukhe: thanks for all your help [21:28:40] save one for topranks too [21:29:41] haha thanks guys... Atlanta we can make good on that : ) [21:29:50] out of IPAs but I'm still stocked up on stouts [21:29:53] bte sukhe I think this should be a help for detecting this [21:29:57] but what do the irish know about stouts? /s [21:30:03] perfect! [21:30:21] sukhe: these stats are probably what you need [21:30:22] https://w.wiki/CJMP [21:30:37] yep perfect [21:30:42] without a real example I'm not sure what the status will show as for the vlan int though [21:30:45] Traffic will take care of the alerting on that [21:30:48] physical will definitely show down [21:31:02] topranks: yes and maybe there are other metrics or can be added worse case [21:31:09] certainly better than us discovering it like this IMO [21:31:19] till whatever time we have to live with this configuration and LVS :) [21:31:28] yeah if the vlan "always shows up" (as the kernel seems to do), then you'd need to work out what physical the vlan was on, and check for that [21:31:58] an alternate way may be to take all the physical links from the interfaces.yaml, and check node_network_carrier for those [21:32:08] ryankemper: it looks good to me unless there is something else missing [21:32:11] the other good way to see is if there is an IPv6 link-local address on the interface [21:32:36] as that only gets added when the int gets a router-advertisement in, which means packets are going in and out ok [21:33:11] topranks: ok, I will revisit all this later with a fresh-er mind and someone in Traffic will add you for the review when I am gone [21:33:58] thanks for stepping in so quickly :) [21:34:16] yup I think we're good here, on our end we'll need to do some more testing and iron out the plan for the cutover on the mediawiki side but hopefully we'll be ready for that starting next week [21:34:27] ok [21:34:32] np... probably best way to check if we downtime a backup lvs, manually shut one of the physical links, and see what that prometheus metric shows for the vlan [21:34:34] practically that means expect us to tear down wdqs-internal from lvs early next week if we don't hit any big roadblocks [21:34:38] if it changes very easy to make an alert on it [21:35:30] topranks: yep [21:35:57] ryankemper: please ping fabfu.r or bret.t from Traffic as I am out till January starting next week and they should help you [21:36:32] but I think this is all resolved and I need to head out to clean the snow so see you later :) [21:36:52] snow! wow nice [21:37:03] (probably not for you lol) [21:37:18] topranks: 10 cm [21:37:48] could be worse :) [21:56:18] back. Sounds like y'all succeeded [21:58:36] thanks to s-ukhe and everyone else who helped!