[05:42:31] 06serviceops, 06Growth-Team, 10Growth-Team-Filtering, 10MW-on-K8s, 10Notifications: Broken (empty) cross-wiki notification when using $wgLocalHTTPProxy (e.g. on Kubernetes) - https://phabricator.wikimedia.org/T223413#9943077 (10Joe) 05Open→03Resolved Whenever you find a new occurence of a bug aft... [07:04:48] 06serviceops, 10MW-on-K8s, 10Observability-Logging: glogger produces invalid JSON when given input with non-printable characters - https://phabricator.wikimedia.org/T368640#9943204 (10Joe) 05Open→03Resolved This should be solved with the deployment in production. [07:05:25] 06serviceops, 10MW-on-K8s: glogger crashes regularly in mw-on-k8s containers - https://phabricator.wikimedia.org/T363342#9943208 (10Joe) 05In progress→03Resolved This should be solved with this morning's release. [08:33:22] 06serviceops, 10Prod-Kubernetes: kubernetes1051.eqiad.wmnet failed to pull mediawiki images - https://phabricator.wikimedia.org/T369011 (10JMeybohm) 03NEW [08:35:56] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9943442 (10SGupta-WMF) Thank you @scott_french for detailed explanation , I am... [08:43:48] effie: o/ when you have a moment for mcrouter ping me (anytime, no rush, even tomorrow) [08:45:48] 06serviceops, 10Prod-Kubernetes: kubernetes1051.eqiad.wmnet failed to pull mediawiki images - https://phabricator.wikimedia.org/T369011#9943462 (10JMeybohm) [08:49:14] elukey: yes dear [08:54:38] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9943487 (10mforns) > The service is up and running in staging, and can be reac... [09:33:02] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9943578 (10Sfaci) Great explanation @Scott_French!. I didn't know that. We'll... [09:48:21] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migrate wikikube control planes to hardware nodes - https://phabricator.wikimedia.org/T353464#9943649 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=59cc19c2-5b6b-4d0a-81ef-1bd409efc10c) set by jiji@cumin1002 for 2 days... [09:52:37] !log preparing retirement of kubemaster200[1-2] [10:54:09] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9943809 (10Clement_Goubert) >>! In T361835#9943486, @mforns wrote: >> The serv... [11:11:40] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migrate wikikube control planes to hardware nodes - https://phabricator.wikimedia.org/T353464#9943922 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1002 for hosts: `kubemaster[2001-2002].codfw.wmnet` - kubema... [11:22:14] 06serviceops, 10Prod-Kubernetes: kubernetes1051.eqiad.wmnet failed to pull mediawiki images - https://phabricator.wikimedia.org/T369011#9943955 (10ops-monitoring-bot) Host rebooted by jayme@cumin1002 with reason: None [11:24:15] 06serviceops, 10Prod-Kubernetes, 13Patch-For-Review: PodSecurityPolicies will be deprecated with Kubernetes 1.21 - https://phabricator.wikimedia.org/T273507#9943956 (10JMeybohm) [11:54:29] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9944063 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2307 to wikikube-worker2030 completed: - mw2307 (**PASS*... [11:59:15] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9944077 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2309 to wikikube-worker2031 completed: - mw2309 (**PASS*... [12:04:55] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9944093 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2365 to wikikube-worker2032 completed: - mw2365 (**PASS*... [12:10:02] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9944095 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2392 to wikikube-worker2033 completed: - mw2392 (**PASS*... [12:17:42] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9944107 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2393 to wikikube-worker2034 completed: - mw2393 (**PASS*... [12:18:01] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9944113 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2030.codfw.wmnet with OS bullseye [12:18:24] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9944115 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2031.codfw.wmnet with OS bullseye [12:18:46] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9944116 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2032.codfw.wmnet with OS bullseye [12:19:05] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9944120 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2033.codfw.wmnet with OS bullseye [12:19:20] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9944131 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2034.codfw.wmnet with OS bullseye [12:21:56] 06serviceops, 06Content-Transform-Team-WIP, 10RESTBase, 10RESTBase Sunsetting, and 2 others: Enable PCS to send resource change events to handle URL purges - https://phabricator.wikimedia.org/T366819#9944138 (10Jgiannelos) After a lot of debugging even running the same request using node i am still getting... [12:41:23] 06serviceops, 10Prod-Kubernetes: kubernetes1051.eqiad.wmnet failed to pull mediawiki images - https://phabricator.wikimedia.org/T369011#9944321 (10JMeybohm) Hey #dc-ops #ops-eqiad, Could you please check on this node? It does not come up after reboot and mgmt (ssh and webinterface) is not reachable to me. The... [12:41:45] 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes: kubernetes1051.eqiad.wmnet failed to pull mediawiki images - https://phabricator.wikimedia.org/T369011#9944322 (10JMeybohm) [12:46:36] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migrate wikikube control planes to hardware nodes - https://phabricator.wikimedia.org/T353464#9944359 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2aa77d1f-f420-475b-8769-b2c46d51c3fe) set by jiji@cumin1002 for 2 days... [12:56:45] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9944386 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2030.codfw.wmnet with OS bullseye completed: - wikikube-work... [12:59:29] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9944389 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2031.codfw.wmnet with OS bullseye completed: - wikikube-work... [12:59:52] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9944392 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2034.codfw.wmnet with OS bullseye completed: - wikikube-work... [13:04:47] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9944399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2033.codfw.wmnet with OS bullseye completed: - wikikube-work... [13:09:49] 06serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9944407 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2032.codfw.wmnet with OS bullseye completed: - wikikube-work... [13:21:34] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migrate wikikube control planes to hardware nodes - https://phabricator.wikimedia.org/T353464#9944494 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1002 for hosts: `kubemaster[1001-1002].eqiad.wmnet` - kubema... [13:30:27] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9944546 (10xcollazo) >>! In T361835#9943486, @mforns wrote: > ... >> The servi... [13:39:32] 06serviceops, 10MW-on-K8s, 10TimedMediaHandler, 07Video: Create maintenance script to execute jobs provided in json format from standard input - https://phabricator.wikimedia.org/T369048 (10Joe) 03NEW [13:41:18] does anybody know the td-agent-bit package? (fluentd IIUC) [13:55:41] claime: we're postponing today's switch maintenance for rack E1 until next Wed July 10th. [13:55:45] hope that's ok [13:55:55] topranks: sure [13:56:17] thanks for the head's up [14:12:29] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migrate wikikube control planes to hardware nodes - https://phabricator.wikimedia.org/T353464#9944969 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=53db6080-f00f-4a86-ae49-cafba7047a9d) set by jiji@cumin1002 for 2 days... [14:51:30] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migrate wikikube control planes to hardware nodes - https://phabricator.wikimedia.org/T353464#9945246 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1002 for hosts: `kubetcd[1004-1006].eqiad.wmnet` - kubetcd10... [14:54:00] * volans wearing a clinic duty hat... what tag do you suggest for this task? T368945 [14:58:26] 06serviceops, 10MW-on-K8s, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q1): Create a per-release deployment of statsd-exporter for mw-on-k8s - https://phabricator.wikimedia.org/T365265#9945278 (10Clement_Goubert) Done for `mw-misc` and `mw-wikifunctions`. [15:06:10] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migrate wikikube control planes to hardware nodes - https://phabricator.wikimedia.org/T353464#9945305 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1002 for hosts: `kubetcd[2004-2006].codfw.wmnet` - kubetcd20... [15:10:57] 06serviceops, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Migrate wikikube control planes to hardware nodes - https://phabricator.wikimedia.org/T353464#9945309 (10jijiki) 05In progress→03Resolved This is done, please reopen should something went wrong. Tx @JMeybohm for the excellent docs [15:35:43] <_joe_> elukey: I think it might be a crime of mine historically [15:35:49] <_joe_> elukey: what do you need to know? [15:36:55] _joe_ o/ I am reviewing buster images in the production-images repo, and the fluent-bit one installs that package that is only in buster-wikimedia [15:37:04] I was wondering if we could clean up or not [15:37:11] <_joe_> elukey: I think not [15:37:19] <_joe_> let me check for a sec [15:37:39] ah yes api-gateway sigh [15:37:51] do you know where I can find the repo for the package? If any [15:38:10] <_joe_> the api gateway uses it :/ [15:38:50] volans: #thumbor i guess [15:39:21] claime: ack thx, I was wondering if it could be something related to MW and the fact that is a private wiki [15:39:54] possible, I have just taken a quick look at the task [15:40:09] that's the fluentbit that is used for metrics for the api-gateway? [15:40:55] I think so [15:43:02] 06serviceops, 06Infrastructure-Foundations, 07Security: Upgrade K8s docker images to running in production on Buster with either Bullseye or Bookworm - https://phabricator.wikimedia.org/T368366#9945476 (10elukey) [15:43:05] <_joe_> elukey: I *think* we just imported the package [15:43:48] I am not even sure if it is currently used by someone tbh [15:44:30] <_joe_> what metrics was it extracting? [15:44:33] _joe_: I think it was jointly you and Petr [15:44:45] <_joe_> cdanis: I just imported the package :) [15:44:55] 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, 06SRE: kubernetes1051.eqiad.wmnet failed to pull mediawiki images - https://phabricator.wikimedia.org/T369011#9945483 (10Clement_Goubert) Host is flapping, setting downtime until tomorrow [15:45:02] 06serviceops, 06DC-Ops, 10ops-eqiad, 10Prod-Kubernetes, 06SRE: kubernetes1051.eqiad.wmnet failed to pull mediawiki images - https://phabricator.wikimedia.org/T369011#9945484 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1d5196ee-59a9-4e12-b2fc-c8c25de6ab16) set by cgoubert@cumin1002... [15:45:05] <_joe_> yes Petr was doing some stuff with it I erased from memory [15:45:07] <_joe_> :) [15:45:11] <_joe_> elukey: https://packages.fluentbit.io/debian/ [15:46:00] https://phabricator.wikimedia.org/T175527 for some history [15:46:32] although I think we managed to delete it afterwards and then it came back again, let me refresh my memory [15:47:13] ah, there we are: https://phabricator.wikimedia.org/T251812 [15:47:21] it was resurrected in that task. [15:47:49] so its upgrade to bookworm probably needs a separate task [15:49:16] https://phabricator.wikimedia.org/T251812#6532946 [15:49:20] so there are some data somewhere [15:49:29] but... who's using them? [15:49:33] 2020 [15:49:34] lol [15:50:36] <_joe_> I suggest we turn it off tbh :) [15:50:47] also this could easily end up into the neverending theme "api gateway - future requirements and usages" [15:50:50] that ^ has my vote too [15:51:03] let's just turn it off [15:51:26] <_joe_> we can turn it back on if someone needs it [15:51:35] <_joe_> and is willing to maintain it [15:51:38] yup [15:51:56] <_joe_> because we're the current maintainers of the api gateway, so I guess we do have authority [15:52:18] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9945514 (10Scott_French) Thanks for the sample data, @xcollazo. Using the fir... [15:52:28] yup, agreed [15:53:27] <_joe_> the data was never moved to turnilo either [15:54:28] I checked `kafkacat -b localhost:9092 -t codfw.api-gateway.request -C` (eqiad as well) on kafka jumbo, I see recent data but only from the same UA from localhost [15:54:33] that I assume it is a health heck [15:54:36] *check [15:54:46] {"user-agent":"Apache-HttpClient/4.5.6 (Java/1.8.0_412)"},"client_ip":"127.0.0.1"} [15:55:20] yes let's kill it [15:56:26] \o/ [15:56:34] removing parts from the infrastructure is the best part [15:56:45] !bashit ^ [15:57:39] <_joe_> is any analytics process consuming that data? [15:57:51] <_joe_> if so we should give DE a heads up I think [15:58:24] <_joe_> "on day X, we'll turn off production of events" [15:59:03] +1, even if my understanding is that we never got into a real productionization, judging from the events in kafka and the task that Alex linked [16:00:45] I 've already submitted the revert for the image in https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1051407 [16:00:56] I 'll comment on that task as well [16:01:50] super thanks :) [16:02:11] stepping afk, have a good rest of the day folks [16:22:16] 06serviceops: deploy1003 implementation tracking - https://phabricator.wikimedia.org/T364417#9945737 (10akosiaris) Apologies, I failed to anticipate that consequence, I 've merged a change to remove deploy1003 from the list of scap masters. [16:38:38] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9945824 (10xcollazo) @Scott_French : One odd thing I notice is that, even thou... [17:01:08] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9945867 (10Scott_French) Thanks for taking a look, @xcollazo. I'll defer to @m... [17:04:36] Hi folks! statsd-exporter is not using its mapping configuration and exporting quantiles rather than histograms. Can someone assist me in reconfiguring statsd-exporter to use its configuration? https://phabricator.wikimedia.org/T369080 [18:05:14] <_joe_> cwhite: hah it's my fault. [18:05:32] <_joe_> cwhite: question - are the data from mw-debug correctly formatted? [18:11:09] <_joe_> cwhite: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1051428 and followups should fix the issue [18:11:23] _joe_: afaict yes, but mwdebug is puppet-managed afaik [18:11:36] <_joe_> no I mean the k8s deployment [18:12:46] I'm not familiar enough with the k8s deployment to know how to check :( [18:13:51] <_joe_> cwhite: I meant in prom itself [18:14:06] <_joe_> you could look at the metrics and those from mw-debug should be correct [18:14:39] <_joe_> sorry, it's dinner time here :) Gotta bail [18:15:55] cwhite: it should be enough to look at the same metrics but with {kubernetes_namespace="mw-debug"} [18:17:48] thanks cdanis. If that label should select something I'd expect `mediawiki_action_executeTiming_seconds_bucket{kubernetes_namespace="mw-debug"}` to have data [18:18:53] if the metric is created lazily upon first execution you might need to make the right request to trigger it [18:19:57] we have instances of this `_bucket` suffixed metric from mwdebug1002: `mediawiki_action_executeTiming_seconds_bucket{instance="mwdebug1002:9112",le="+Inf",action="view"}` [18:24:27] I did a dummy edit on a wiki, with the WikimediaDebug browser extension on pointing at k8s-mwdebug, and that caused the summary metric to get some values in it: https://grafana.wikimedia.org/goto/1CwNZCwIR?orgId=1 [18:24:30] but not the buckets metric [18:26:45] ... ah but I don't think that's what joe was asking because none of his patches are merged :) [18:28:03] this is on the simpler end of changes to make on k8s cwhite if you wanted to do so yourself [18:28:48] just fell down a 20 minute rabbit hole trying to understand how this could have worked previously (before moving to the centralized exporter) then came back and found _j.oe_ already has patches :) [18:29:44] sure is something, how often "how did this *ever* work?!?" comes up mid-debugging [18:35:51] with any sufficiently complex system, the answer tends to be that it did work, just _subtly_ [18:40:43] cdanis: that dashboard link contains quantiles, not the expected histograms [18:41:04] so, no, it appears mw-debug is not configured to create histograms [18:41:10] cwhite: yeah, indeed, because joe's patches haven't been deployed [18:41:16] (and then mw-debug redeployed afterwards) [18:42:48] that's the part I was saying you could do yourself if so inclined [18:46:37] I feel I'm missing enough context that it'd be dangerous to try on my own. For context, I learned just a few minutes ago what the difference between "mwdebug" and "mw-debug". [18:46:46] I'm willing to learn, though. [19:36:56] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9946877 (10mforns) @Scott_French Would it be possible for us to make a last ho... [19:59:56] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9947026 (10Scott_French) @mforns sure, that's no problem at all! Just let me k... [21:20:46] 06serviceops, 07Kubernetes: sextant update should support a minimal change mode - https://phabricator.wikimedia.org/T369119 (10CDanis) 03NEW [22:30:39] 06serviceops, 10MW-on-K8s: Pipe stdin into one-off maintenance scripts on Kubernetes - https://phabricator.wikimedia.org/T368966#9947478 (10RLazarus) 05Open→03Resolved ` rzl@deploy1002:~$ echo 'https://office.wikimedia.org/wiki/User:RLazarus_(WMF)' | ./mwscript-k8s --attach -- purgeList.php ⏳ Starting...