[04:48:49] 10serviceops, 10Anti-Harassment, 10IP Info, 10SRE: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10Niharika) [08:15:36] 10serviceops, 10Maps, 10SRE-swift-storage, 10Patch-For-Review: Tegola staging doesn't connect to swift - https://phabricator.wikimedia.org/T289076 (10fgiunchedi) I can confirm that's indeed the Bullseye upgrade, good find @Jgiannelos ! [08:16:34] 10serviceops, 10Maps, 10SRE-swift-storage, 10Patch-For-Review: Tegola staging doesn't connect to swift - https://phabricator.wikimedia.org/T289076 (10fgiunchedi) cc @elukey as I know he'll be using v4 signatures too with thanos-swift [08:22:57] 10serviceops, 10Infrastructure-Foundations, 10netops, 10Kubernetes: kubernetes1005 BGP down for 3 weeks - https://phabricator.wikimedia.org/T289111 (10ayounsi) p:05Triage→03High [08:47:38] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops, 10Kubernetes: kubernetes1005 BGP down for 3 weeks - https://phabricator.wikimedia.org/T289111 (10JMeybohm) a:03JMeybohm [09:19:10] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops, and 2 others: kubernetes1005 BGP down for 3 weeks - https://phabricator.wikimedia.org/T289111 (10JMeybohm) This happened while I was running docker pull tests 2021-07-21 ~15:04Z and kubernetes1005 is one of the dedicated sessionstore nodes runnin... [11:07:38] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Enable the Priority admission plugin - https://phabricator.wikimedia.org/T289131 (10JMeybohm) p:05Triage→03High [11:09:09] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops, and 2 others: kubernetes1005 BGP down for 3 weeks - https://phabricator.wikimedia.org/T289111 (10JMeybohm) K8s event logs (https://logstash.wikimedia.org/goto/b16700661b703799af5ac188db2d3f5c) are pretty clear on that I created a lot of disk pres... [12:12:08] effie, jayme : is there a way to see somehow past IP associations? I'm particulary interested in 10.192.64.127 [12:17:26] zpapierski: more like by accident I guess [12:18:08] zpapierski: https://logstash.wikimedia.org/goto/860b7686200c3004a380b4b58392505f [12:24:47] thx [12:25:03] I think I got it, it was jobmanager [12:25:13] and for some reason, port 6123 is unreachable there [12:25:19] I don't know why [12:26:37] https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/flink-session-cluster/templates/networkpolicy.yaml - it's defined here (.Values.main_app.config.jobmanager_rpc_port) [13:33:46] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops, and 2 others: kubernetes1005 BGP down for 3 weeks - https://phabricator.wikimedia.org/T289111 (10JMeybohm) 05Open→03Resolved Ok, really dumb situation! A bunch of (failing) sessionstore Pods are clogging all resources on kubernetes1005, leavi... [13:38:28] 10serviceops, 10Maps, 10SRE-swift-storage, 10Patch-For-Review: Tegola staging doesn't connect to swift - https://phabricator.wikimedia.org/T289076 (10Jgiannelos) Things look better on staging now after the last deployment. [13:39:14] 10serviceops, 10Maps, 10SRE-swift-storage, 10Patch-For-Review: Tegola staging doesn't connect to swift - https://phabricator.wikimedia.org/T289076 (10Jgiannelos) 05Open→03Resolved a:03Jgiannelos [14:01:48] jayme, effie : are you able to tell me where did we made a mistake in that networkpolicy above? or if there are any other reasons why that port is unreachable? (connection times out, btw) [14:04:01] zpapierski: I can take a look in a few, sorry [14:04:38] no worries, I'll appreciate any help, whenever you can provide it [14:09:04] zpapierski: I am not sure myself :/ [14:17:14] zpapierski: so, what do you expect to work? Like tcp/6123 connections to jobmanager from where? [14:17:35] from taskmanagers, mostly [14:17:54] what do you mean by mostly? [14:18:23] I see connection attempts from job manager itself as well [14:18:44] but those are localhost, right? [14:18:45] either those are connections on external interface, or something else is happening I don't understand [14:18:49] nope [14:19:36] that means jobmanager is talking to itself via it's Pod-IP? [14:20:01] looks like it - there's an underlying actor system that Flink uses, akka [14:20:23] but that works in general or does it give you errors as well? [14:21:18] nope, errors all around - which explains nothing why it works with a single TM, but that's what I see in the logs [14:22:31] maybe it's about time to have this tracked in an phab task. At least I kind of lost track of when/what works and what does not. Do you already have one? [14:23:22] zpapierski: from what I understood, it worked ok with 1 task manager? [14:24:05] not specificaly for this issue, no - what I do is a part of T273098 [14:24:39] yep it worked with a single TM yesterday, which is weird and makes me doubtful about the root cause here [14:25:04] I'll add all details I've already gathered, unfortunately tomorrow (I'm starting a series of meetings now) [14:26:09] hmm, there's another IP here that also uses a port 6123 [14:26:46] zpapierski: I think we should also start looking at the software itself [14:27:21] it might not be directly related to our k8s setup but rather we are missing something in the flink on k8s part [14:27:47] true, I feel the same way [14:36:13] It would be nice if you could tell how to reproduce the issue/errors you see in staging. That would already be of great help. [14:37:20] staging just because it's staging ( :) ) and so that we have a single environment/cluster to talk about and not different ones with different IPs, settings etc. [14:42:13] 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Jelto) 05Open→03Resolved [14:42:23] 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Jelto) From my side everything is done. Thanks everyone. I'm going to close this ticket. Feel free to... [14:42:57] 10serviceops, 10SRE, 10Thumbor, 10ops-eqiad, 10User-jijiki: (OoW) thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10Jelto) [14:49:49] 10serviceops, 10SRE, 10Thumbor, 10ops-eqiad, 10User-jijiki: (OoW) thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10Jelto) [15:57:17] \o for T285355, the analytics team is working on setting up the new `an-web1001` host which will basically be the new `thorium`, and I had a question about https://gerrit.wikimedia.org/g/operations/deployment-charts/+/2cf9cb4cd5708721875637e46c3b24cb1983d249/helmfile.d/services/linkrecommendation/values.yaml#31 [15:57:57] this was added early march 2021 in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/667561 (see also dependent patch https://gerrit.wikimedia.org/r/c/research/mwaddlink/+/667553) by service-ops [15:58:15] looks like the goal was to avoid going through edge caches for various reasons, and just talk to the production instance directly [15:59:32] anyway this is the usual "we need to change this hardcoded value when we do the production cutover to the new host" scenario, and was just wondering the best way to go about it. ideally we could have a CNAME that would route to the desired host and then that way in the future if we switch hosts again it'll be just a CNAME change (if not though changing the value in hiera isn't the end of the world) [15:59:58] first things first...I see that `helmfile.d/services/linkrecommendation/values.yaml` has it set to `http://thorium.eqiad.wmnet/published/datasets/one-off/research-mwaddlink/` whereas `charts/linkrecommendation/values.yaml` has `https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/` [16:00:36] Are those actually the same thing or is there a subtle difference? In https://gerrit.wikimedia.org/g/operations/puppet/+/production/hieradata/common/profile/trafficserver/backend.yaml#9 we map `analytics.wikimedia.org` to `https://thorium.eqiad.wmnet:8443`, so perhaps `helmfile.d/services/linkrecommendation/values.yaml` needs it to be `http` and not `https` and therefore can't use the existing `analytics.wikimedia.org` cname? [16:01:15] 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Cmjohnson) 05Resolved→03Open a:05Jelto→03Cmjohnson re-opened and assigned to me to use this s... [16:29:23] 10serviceops, 10MW-on-K8s, 10SRE: Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 (10Legoktm) >>! In T288848#7287882, @JMeybohm wrote: > I'd assume that MW makes HTTP calls to the public endpoints of MW. Those will be blocked in k8s as we generally prohibit egr... [16:49:17] 10serviceops, 10SRE, 10Performance-Team (Radar), 10User-jijiki: Shrink redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki) [17:04:41] 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Cmjohnson) 05Open→03Resolved all are decommissioned and removed from the rack [18:20:09] 10serviceops, 10Release Pipeline, 10Services: Provide a node 12 production image (based on bullseye?) - https://phabricator.wikimedia.org/T284346 (10Legoktm) 05Open→03Resolved a:03Jdforrester-WMF nodejs12 images are now available, however I would strongly recommend coordinating with ServiceOps when rol... [19:39:11] 10serviceops, 10MW-on-K8s, 10SRE: Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 (10TK-999) For the record, to resolve the same issue during our effort to upgrade Fandom's MW-on-k8s deployment, we ended up creating an HttpRequestFactory service override to dyn... [20:09:24] 10serviceops, 10MW-on-K8s: IPInfo MediaWiki extension depends on presence of maxmind db in the container/host - https://phabricator.wikimedia.org/T288375 (10Legoktm) First, I think the solution picked depends on whether we expect other services or extensions to want to use this GeoIP information in the future.... [21:18:23] 10serviceops, 10SRE: Run httpbb periodically - https://phabricator.wikimedia.org/T289202 (10RLazarus) [21:18:34] 10serviceops, 10SRE: Run httpbb periodically - https://phabricator.wikimedia.org/T289202 (10RLazarus) p:05Triage→03Medium [22:23:56] legoktm: Thanks for the bullseye/node12 merge! [22:24:18] yw :) [22:25:39] Naturally I immediately ran into a blubber bug that's haunted is for a while so can't switch to it immediately for function-evaluator, but it's there. ;-) [22:28:23] :/ [22:28:38] And of course `python-pkgconfig` has been replaced. [22:28:58] Isn't dependency management such fun? [23:56:01] James_F: probably now python3-pkgconfig?