[09:38:51] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095106 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [09:45:09] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095118 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [09:52:27] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095141 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [10:14:29] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095259 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [10:24:58] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095281 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [10:29:56] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [10:41:30] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095368 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [11:01:21] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095415 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [12:01:03] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095590 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from mw2292 to... [12:01:51] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095592 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host... [12:56:28] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10095765 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wiki... [13:23:29] sukhe brett ryankemper looks like we accidentally deployed the wdqs services with `encryption: false` yesterday ;( https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/service.yaml#3133 [13:23:52] which means we'll need to fix that and restart pybal again [13:24:06] Sorry for the trouble ;( [13:25:59] inflatador: my understanding was this was intentional (for some reason) since all other wdqs services also have it disabled? [13:26:06] but yeah, no worries if you want to fix it [13:27:47] sukhe yeah, the legacy cruft makes things confusing...there are public HTTP endpoints in that config that probably don't work anymore due to HTTPS redirects, plus an internal-only endpoint that is HTTP. We should really clean all that stuff ;( [13:30:45] We've had T325602 open for awhile, probably should do it soon ;( [13:30:46] T325602: Decide whether or not to keep wdqs-heavy-queries and wdqs-ssl PyBal pools - https://phabricator.wikimedia.org/T325602 [14:00:07] inflatador: sorry was in a meeting but yeah let us know what you prefer and no issues on doing restarts or anything [14:59:13] 06Traffic, 10conftool, 07Epic: Create simple web view of requestctl status - https://phabricator.wikimedia.org/T371782#10096466 (10Joe) 05Open→03In progress a:03Joe [14:59:48] sukhe ACK, in meeting for next hr but should be ready to do it any time after that [15:00:05] inflatador: same here, Traffic meeting, so happy to do after [15:01:33] 06Traffic, 10conftool, 07Epic: Extract an api class for requestctl - https://phabricator.wikimedia.org/T373449 (10Joe) 03NEW [15:02:20] 06Traffic, 10conftool, 07Epic: Extract an api class for requestctl - https://phabricator.wikimedia.org/T373449#10096493 (10Joe) [15:06:07] 06Traffic, 10conftool, 07Epic: Extract an api class for requestctl - https://phabricator.wikimedia.org/T373449#10096506 (10Joe) 05Open→03In progress p:05Triage→03High [15:20:25] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10096550 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by kamila@cumin1002 from kubernetes20... [15:22:59] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10096557 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wik... [15:43:23] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10096619 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from mw2293 to... [15:44:01] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10096620 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host... [16:05:39] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10096726 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikub... [16:20:18] sukhe just an update, ryankemper and doing additional testing around the encryption theory [16:31:01] sukhe are you aware of any logstash dashboards or other ways we could trace the traffic as it goes thru ATS to pybal? [16:38:13] sukhe nm, think we got it [16:44:45] inflatador: sorry was away for lunch [16:55:21] sukhe np, looks like you saw our patch. We looked the mappings and ATS is definitely mapping to an HTTPS endpoint that doesn't yet exist in pybal [16:56:01] can test with ` curl -v -IL https://wdqs-main.discovery.wmnet/sparql` from cumin...fast failure [16:58:40] anyway, hit us up when you're done reviewing, we can restart pybal after that if that works for you [16:59:07] run thru https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_balancers that is [16:59:10] sukhe: assuming https://gerrit.wikimedia.org/r/c/operations/puppet/+/1067383 looks good to you, can we just do an lvs rolling restart directly or do we need to move service.yaml back to lvs_setup (from production) first? [17:00:02] looking [17:01:11] checking the mapping [17:04:42] ryankemper: I think a simple restart should be sufficient; disable Puppet on A:lvs and (A:eqiad or A:codfw) and then enable on one secondary and try it first [17:04:54] we might have some cleanup to do but I *think* that should be it [17:05:03] (we will worry about that later) [17:05:32] sukhe: cool, I can get started in a couple minutes [17:09:15] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10097092 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wiki... [17:10:15] let me know where you merge first [17:10:24] (which site for the secondary basically) [17:12:08] sukhe: eqiad first, doing it rn [17:12:49] 10netops, 06Infrastructure-Foundations: Apply egress Source Address Validation on the Wikimedia core routers - https://phabricator.wikimedia.org/T372158#10097098 (10Southparkfan) >>! In T372158#10056948, @ayounsi wrote: >> However, in reality, it should be possible to reject all IP packets where the source IP... [17:14:00] thanks for logging <3 [17:14:20] fyi [17:14:20] sukhe@lvs1020:~$ curl localhost:9090/pools/wdqs-scholarly_443 [17:14:20] wdqs1024.eqiad.wmnet: enabled/partially up/not pooled [17:14:56] yup, seeing that [17:15:13] I think this might be related to the health check? [17:15:19] sukhe guessing the health check needs to be changed to https too...one sec [17:15:58] We have it set to `http://localhost/readiness-probe` right now. I was thinking because it says `localhost` that it's fine but clearly I was misinterpreting that :D [17:16:09] sukhe: shall we try switching that to `https://localhost/readiness-probe` or do you think we're on the wrong track here [17:16:15] no I don't think so [17:17:47] ryankemper: can you depool these for now for a sec [17:18:05] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1067388 sukhe ryankemper health check fix patch [17:18:08] sukhe: sure, pybal is going to refuse to depool more than 1 host though (depool threshold of .3, 2 hosts in pool) [17:18:18] oh just two sigh [17:18:29] no worries, let's figure this out directly [17:18:45] and if not then we will see [17:20:00] * inflatador sees the pybal alerts [17:20:25] I went ahead and ACKed them all [17:21:01] true [17:21:05] agreed re figuring it out directly (we could depool all of eqiad btw so it fails over to codfw, but this service isn't getting user traffic yet so there's not a reason to) [17:21:07] this is what we want for the healthcheck right? [17:21:27] as in, the string [17:21:27] sukhe: i'm not following [17:21:37] ah [17:22:01] yeah we have hte exact same check on `wdqs` so that should be fine [17:22:11] sukhe yes, I thought it was some weird/generic thing but it does actually look like the service is configured to give a special answer at that URL [17:22:57] inflatador: merge it please [17:23:16] sukhe ACK, will do [17:23:51] OK, puppet-merged [17:23:57] restarting [17:24:41] (ah i.nflatador told me it's still running so restarting in 30s) [17:25:33] sukhe: seeing a lot of other services say down but not up now [17:25:35] https://www.irccloud.com/pastebin/dAmRJXcT/ [17:25:45] sukhe: we should roll back, yes? [17:25:53] ryankemper: did you restart it? [17:25:57] Active: active (running) since Tue 2024-08-27 17:13:25 UTC; 12min ago [17:26:49] sukhe: nevermind, accidentally ran status instead of restart [17:26:51] restarted properly now [17:26:55] looks good now [17:27:01] sukhe@lvs1020:~$ curl localhost:9090/pools/wdqs-scholarly_443 [17:27:01] wdqs1024.eqiad.wmnet: enabled/up/pooled [17:27:01] wdqs1023.eqiad.wmnet: enabled/up/pooled [17:27:16] there are some other services but not related to us indeed :] [17:27:40] ', "Services in IPVS but unknown to PyBal: set(['10.2.2.36:80', '10.2.2.33:80']) [17:27:44] this is expected too [17:27:58] I will clean it up [17:28:27] don't apply on low-traffic or codfw, let's check everything here first [17:28:30] doing it [17:28:58] ACK, we will await your response [17:30:10] fwiw realized I misread that status earlier too. Was thinking it was saying those non-wdqs hosts were in the "pooled but not up" state but it was actually saying the opposite iiuc (up but not pooled) `mw1376.eqiad.wmnet (enabled/partially up/not pooled) is up` vs `WARN: wdqs1022.eqiad.wmnet (enabled/partially up/pooled): Fetch failed (http://localhost/readiness-probe), 0.001 s` [17:33:30] ok, I cleaned up the above [17:33:47] ryankemper: inflatador: good to go for lvs1019 [17:33:59] we will need to do a similar clean up there for the eqiad IPs [17:34:05] sukhe: ack. would you like us to run the `sudo ipvsadm ---delete-service --tcp-service 10.2.2.36:80` etc commands after the restart or do you want to [17:34:13] ryankemper: go for it please [17:34:23] sudo ipvsadm --delete-service --tcp-service 10.2.2.36:80 [17:34:26] sudo ipvsadm --delete-service --tcp-service 10.2.2.33:80 [17:34:42] so main and scholarly [17:35:25] for codfw: [17:35:29] sudo ipvsadm --delete-service --tcp-service 10.2.1.33:80 [17:35:32] sudo ipvsadm --delete-service --tcp-service 10.2.1.36:80 [17:35:34] but for later [17:39:23] sukhe: might have a flag wrong? [17:39:25] https://www.irccloud.com/pastebin/BwmPdtqz/ [17:39:33] it matches what u logged in ops tho so i'm confused [17:39:57] ah nvm [17:39:58] extra dash [17:39:59] we're in https://meet.google.com/fde-tbpf-wqh if it's easier to follow [17:40:02] that's what i get for not copypasting [17:40:02] sorry, I put an - but see above [17:40:44] s/for not copypasting/from not copypasting from the right place [17:40:45] :D [17:42:29] sukhe: okay, permission to proceed to codfw? [17:43:59] ryankemper: checking, give me a second [17:44:21] looks good! [17:44:24] please do [17:44:38] note the different svc IPs for codfw above, so basically 10.2.1.33 and 10.2.1.36 [17:44:48] got it [19:29:55] FIRING: [7x] VarnishHighThreadCount: Varnish's thread count on cp3066:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [19:34:55] FIRING: [8x] VarnishHighThreadCount: Varnish's thread count on cp3066:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [19:36:28] yeah [19:39:55] FIRING: [15x] VarnishHighThreadCount: Varnish's thread count on cp3066:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [19:49:55] FIRING: [15x] VarnishHighThreadCount: Varnish's thread count on cp3066:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [19:59:55] RESOLVED: [8x] VarnishHighThreadCount: Varnish's thread count on cp3066:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [21:13:14] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10098130 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by swfrench@cumin2002 from kubernetes... [21:15:08] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10098132 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by swfrench@cumin2002 for host w... [22:02:06] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10098199 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by swfrench@cumin2002 for host wikik... [22:57:13] 10netops, 06Infrastructure-Foundations: Publish, and maintain ASPA records for valid AS14907 upstreams - https://phabricator.wikimedia.org/T372161#10098371 (10Southparkfan) >>! In T372161#10056965, @ayounsi wrote: > [...] >> However, the ASPA record is yet another duplicate of the transit_provider list in Hom...