[07:17:58] 06serviceops, 06Infrastructure-Foundations, 10netops, 06Traffic: weighted maglev viability for low-traffic services - https://phabricator.wikimedia.org/T368545#9929335 (10ayounsi) Strictly on the network side, there is no blocker one way or the other. I think I miss some context, what's the current low-tr... [07:23:05] 06serviceops, 10WMDE-TechWish-Maintenance, 10Maps (Kartotherian), 13Patch-Needs-Improvement: Upgrade maps servers to node >= 14 - https://phabricator.wikimedia.org/T321789#9929343 (10awight) [07:24:36] 06serviceops, 06MW-Interfaces-Team, 06Traffic: map the /api/ prefix to /w/rest.php - https://phabricator.wikimedia.org/T364400#9929351 (10daniel) This is unblocked from our side - which team is going to take care of creating the routing rule? #traffic or #serviceops? [07:24:48] 06serviceops: Migrate docker registry hosts to bullseye - https://phabricator.wikimedia.org/T332016#9929352 (10MoritzMuehlenhoff) Why bullseye, this should be bookworm? docker-registry is packaged in Debian, so we can simply use bookworm and use the package from it. In fact, we are already using the bookworm pac... [08:54:00] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migrate wikikube control planes to hardware nodes - https://phabricator.wikimedia.org/T353464#9929574 (10JMeybohm) >>! In T353464#9926653, @jijiki wrote: > Unless something else pops up, we shall be retiring the old hosts (aka the VMs) next... [08:58:06] hello folks! [08:58:48] I am building the new mcrouter image on bookworm, and one thing that popped out is that systemd-journal (systemd is a dep of mcrouter) gets user 999, the same that we set for mcrouter [08:59:23] I noticed the comment about uid=999 and pod security policies, but I don't find where it is set [09:00:01] does it need to be explicitly 999, or can we choose another one? [09:01:49] looks to me like there's no specific dependency upon it being 999, just as long as it's numeric [09:07:32] 06serviceops, 10docker-pkg, 06Release-Engineering-Team: Attach opencontainers image metadata to docker images - https://phabricator.wikimedia.org/T345070#9929594 (10MoritzMuehlenhoff) [09:10:39] hnowlan: o/ in the Dockefile I see that there is a mention of pod security policy, maybe it is a special range [09:11:46] I think that it doesn't need to be of a specific range, but that it needs to be an int [09:11:52] could be wrong though, lemme check [09:16:21] elukey: ah yeah, MustRunAsNonRoot just requires a numeric UID rather than a username [09:16:28] 06serviceops, 06Infrastructure-Foundations, 10netops, 06Traffic: weighted maglev viability for low-traffic services - https://phabricator.wikimedia.org/T368545#9929623 (10Vgutierrez) >>! In T368545#9929335, @ayounsi wrote: > I think I miss some context, what's the current low-traffic setup ? Usually servic... [09:20:54] hnowlan: ah wait so we could allocate a random uid for mcrouter and use USER {{ "mcrouter" | uid }}, and it would be fine? [09:22:08] 06serviceops: Migrate etcd::tlsproxy Nginx certs and etcd itself to PKI - https://phabricator.wikimedia.org/T352245#9929647 (10MoritzMuehlenhoff) >>! In T352245#9894693, @Scott_French wrote: > Thanks, @MoritzMuehlenhoff - you're absolutely right that we could decouple these. > > The TLS proxy will go away with... [09:22:16] I think so (obviously a consistently random one though :P) [09:22:52] a random one every time would make things more spicy! [09:36:24] 06serviceops: Migrate docker registry hosts to bullseye - https://phabricator.wikimedia.org/T332016#9929673 (10Clement_Goubert) If we jump right to bookworm, we need to copy the `python3-docker-report` package to bookworm. A migration plan would look like: # Run `httpbb /srv/deployment/httpbb-tests/docker-regis... [09:53:54] 06serviceops, 10MW-on-K8s, 10TimedMediaHandler, 13Patch-For-Review, 07Video: Create a deployment for `shellbox-timedmedia` - https://phabricator.wikimedia.org/T357309#9929744 (10hnowlan) 05In progress→03Resolved a:05kamila→03hnowlan I'm sure there'll be some tweaks further down the road, but... [10:39:56] ended up doing https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1049587 for mcrouter [10:40:09] <_joe_> taking a look [10:55:49] answered thanks! [12:11:20] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416#9930200 (10Jclark-ctr) @akosiaris please update Site.pp file for this server [12:27:12] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: prometheus-apache-exporter in buster does not support -log.format json - https://phabricator.wikimedia.org/T283861#9930255 (10Clement_Goubert) 05In progress→03Resolved New version has been deployed. [12:32:37] 06serviceops, 10LPL Technical Support, 06SRE, 10Wikimedia-Site-requests, 07Russian-Sites: Increase $wgMaxArticleSize to 4MB for ruwikisource - https://phabricator.wikimedia.org/T308893#9930305 (10MaryMunyoki) [12:32:53] 06serviceops, 10LPL Technical Support, 06SRE, 10Wikimedia-Site-requests, 13Patch-For-Review: Change $wgMaxArticleSize limit from byte-based to character-based - https://phabricator.wikimedia.org/T275319#9930306 (10MaryMunyoki) [12:42:37] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416#9930347 (10Jclark-ctr) [12:43:01] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416#9930352 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host deploy1003.eqiad.wmnet with OS bullseye [13:38:48] 06serviceops, 10MW-on-K8s, 10TimedMediaHandler, 07Video: Log filename in shellbox-video httpd - https://phabricator.wikimedia.org/T368619 (10hnowlan) 03NEW [13:46:13] 06serviceops, 06Infrastructure-Foundations, 10netops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9930546 (10CDanis) Could be convinced otherwise, but I'm generally in favor of the MSS clamping option -- we know it works and the tradeoff... [14:00:13] 06serviceops, 06Infrastructure-Foundations, 10netops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9930584 (10Joe) I'd go ahead and take a step back: why do we need to switch to IPIP encapsulation for backend services? Is there a compell... [14:03:29] 06serviceops, 06Infrastructure-Foundations, 10netops, 06Traffic: weighted maglev viability for low-traffic services - https://phabricator.wikimedia.org/T368545#9930604 (10Joe) It is pretty clear to me that the only way to have fair load balancing with `maglev` is if we do the consistent hashing using the r... [14:08:42] 06serviceops, 13Patch-For-Review: Migrate docker registry hosts to bullseye - https://phabricator.wikimedia.org/T332016#9930632 (10JMeybohm) >>! In T332016#9929672, @Clement_Goubert wrote: > @JMeybohm could you check the `httpbb` tests are still relevant and returning the expected results? Almost. I've upload... [14:20:47] 06serviceops, 06Infrastructure-Foundations, 10netops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9930681 (10akosiaris) T352956 is related (possibly a duplicate) and I 've mulling over it for a few months now. I think we need to have a l... [14:31:24] 06serviceops, 06Content-Transform-Team-WIP, 10RESTBase, 10RESTBase Sunsetting, and 2 others: Enable PCS to send resource change events to handle URL purges - https://phabricator.wikimedia.org/T366819#9930713 (10Jgiannelos) After this patch I was expecting staging to publish events on staging eventgate. htt... [14:35:22] 06serviceops, 06Infrastructure-Foundations, 10netops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9930740 (10Vgutierrez) >>! In T368544#9930584, @Joe wrote: > I'd go ahead and take a step back: why do we need to switch to IPIP encapsulat... [14:45:26] effie: o/ about mcrouter - is there a way to test https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1049587 before hitting production? [14:45:49] we shouldn't have a lot of surprises since others are using it, but the version jumps from the current one [14:46:40] 06serviceops, 06Infrastructure-Foundations, 10netops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9930773 (10Joe) >>! In T368544#9930740, @Vgutierrez wrote: >>>! In T368544#9930584, @Joe wrote: >> I'd go ahead and take a step back: why d... [14:51:49] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9930798 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw1359 to wikikube-worker1022 completed: - mw1359 (**PASS*... [14:52:20] 06serviceops, 06Infrastructure-Foundations, 10netops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9930800 (10BBlack) For more context: eventually our Katran-based Liberica balancer will replace pybal/LVS. The Katran one has to use IPIP,... [14:53:19] 06serviceops, 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416#9930805 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host deploy1003.eqiad.wmnet with OS bullseye executed with error... [14:57:10] 06serviceops, 06Infrastructure-Foundations, 10netops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9930822 (10Vgutierrez) theoretically speaking we could keep low-traffic on liberica/IPVS (instead of liberica/Katran) to be able to get rid... [14:58:10] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9930840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw1365 to wikikube-worker1023 completed: - mw1365 (**PASS*... [15:10:33] 06serviceops, 10Prod-Kubernetes, 13Patch-For-Review: PodSecurityPolicies will be deprecated with Kubernetes 1.21 - https://phabricator.wikimedia.org/T273507#9930894 (10JMeybohm) [15:15:56] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9930938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw1404 to wikikube-worker1026 completed: - mw1404 (**PASS*... [15:17:40] 06serviceops, 06Infrastructure-Foundations, 10netops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9930949 (10cmooney) >>! In T368544#9930584, @Joe wrote: > I'd go ahead and take a step back: why do we need to switch to IPIP encapsulation... [15:19:53] 06serviceops, 06Infrastructure-Foundations, 10netops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9930977 (10Joe) >>! In T368544#9930822, @Vgutierrez wrote: > theoretically speaking we could keep low-traffic on liberica/IPVS (instead of... [15:21:32] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9931012 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw1366 to wikikube-worker1024 completed: - mw1366 (**PASS*... [15:26:08] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9931059 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1022.eqiad.wmnet with OS bullseye [15:27:00] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9931071 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1023.eqiad.wmnet with OS bullseye [15:27:50] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9931075 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1024.eqiad.wmnet with OS bullseye [15:28:49] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9931079 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1026.eqiad.wmnet with OS bullseye [15:29:13] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9931082 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw1373 to wikikube-worker1025 completed: - mw1373 (**PASS*... [15:29:56] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9931085 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1025.eqiad.wmnet with OS bullseye [15:35:38] elukey: I am off till monday, but we can make sure fot this not to hit production without us being on stand by [15:36:20] super sorry for the ping, let's talk about it next week! [15:36:47] it is alright, no problem [15:41:31] 06serviceops, 06Infrastructure-Foundations, 10netops, 06Traffic: IPIP encapsulation considerations for low-traffic services - https://phabricator.wikimedia.org/T368544#9931142 (10Vgutierrez) >>! In T368544#9930977, @Joe wrote: > oh I agree 100% with this. My doubts were specifically for switching to katran... [16:03:36] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9931252 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1022.eqiad.wmnet with OS bullseye completed: - wikikube-work... [16:05:38] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9931256 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1024.eqiad.wmnet with OS bullseye completed: - wikikube-work... [16:09:45] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9931263 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1023.eqiad.wmnet with OS bullseye completed: - wikikube-work... [16:12:07] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9931274 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1025.eqiad.wmnet with OS bullseye completed: - wikikube-work... [16:13:57] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9931281 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1026.eqiad.wmnet with OS bullseye completed: - wikikube-work... [16:34:19] 06serviceops, 06Infrastructure-Foundations, 06SRE: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy - https://phabricator.wikimedia.org/T354855#9931347 (10Clement_Goubert) 05Resolved→03Open p:05Medium→03High This issue is biting us again, the time between a pu... [16:36:03] 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, 07Kubernetes: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T368639 (10Clement_Goubert) 03NEW [16:51:48] 06serviceops, 10MW-on-K8s, 10Observability-Logging: glogger produces invalid JSON when given input with non-printable characters - https://phabricator.wikimedia.org/T368640 (10kamila) 03NEW [16:51:58] 06serviceops, 10MW-on-K8s, 10Observability-Logging: glogger produces invalid JSON when given input with non-printable characters - https://phabricator.wikimedia.org/T368640#9931461 (10kamila) p:05Triage→03Low [16:54:37] 06serviceops, 10MW-on-K8s, 10Observability-Logging: Some apache access logs are invalid json - https://phabricator.wikimedia.org/T340935#9931467 (10kamila) 05Open→03Resolved Closing this and opening T368640, as what I'm seeing now is a tiny sub-problem of the original problem [17:11:07] 06serviceops, 06Infrastructure-Foundations, 06SRE: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy - https://phabricator.wikimedia.org/T354855#9931559 (10MoritzMuehlenhoff) >>! In T354855#9931347, @Clement_Goubert wrote: > This issue is biting us again, the time betw... [18:14:49] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9931860 (10mforns) > @SGupta-WMF or @mforns - One additional request: if one o... [18:24:10] 06serviceops, 10Prod-Kubernetes, 13Patch-For-Review: Update all helm modules and charts to be compatible with the restricted PSS - https://phabricator.wikimedia.org/T362978#9931914 (10Scott_French) mw-debug and canaries were updated around 17:23 UTC (alas, I hit enter before adding the message on that `scap`... [18:24:25] 06serviceops, 10Prod-Kubernetes, 13Patch-For-Review: Update all helm modules and charts to be compatible with the restricted PSS - https://phabricator.wikimedia.org/T362978#9931927 (10Scott_French) [19:14:28] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9932254 (10Scott_French) Thanks, @mforns! Also, I see you hit retry on the fa... [19:39:25] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9932366 (10mforns) Thanks for the follow up @Scott_French! @SGupta-WMF, I trie... [19:43:15] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9932401 (10mforns) Also @SGupta-WMF I reviewed the swagger spec, and I found a... [21:05:33] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9932643 (10Scott_French) Thanks for giving that a try, @mforns ! Looking at `... [22:22:38] 06serviceops, 103D, 06Commons, 07Regression: STL 3D models broken: "Sorry, the file Undefined cannot be displayed since it is not present on the current page." - https://phabricator.wikimedia.org/T368301#9932888 (10TheDJ) logstash shows a big increase in thumbor errors for STL: `/opt/lib/python/site-packag...