[08:33:01] 06serviceops, 06Infrastructure-Foundations, 10Maps (Kartotherian), 13Patch-For-Review: Scale up Kartotherian on Wikikube and move live traffic to it - https://phabricator.wikimedia.org/T386926#10617246 (10elukey) Over the weekend I noticed a steady increase of the Kartotherian pods' memory consumption. I c... [09:31:01] 06serviceops, 10CirrusSearch, 10Data-Platform-SRE (2025.03.01 - 2025.03.21), 10Discovery-Search (2025.03.01 - 2025.03.21): Repartition [eqiad|codfw].cirrussearch.update_pipeline.update.v1 topics in kafka-main@[eqiad|codfw] - https://phabricator.wikimedia.org/T387863#10617486 (10dcausse) In the interest of... [09:32:34] 06serviceops, 10CirrusSearch, 10Data-Platform-SRE (2025.03.01 - 2025.03.21), 10Discovery-Search (2025.03.01 - 2025.03.21): Repartition [eqiad|codfw].cirrussearch.update_pipeline.update.v1 topics in kafka-main@[eqiad|codfw] - https://phabricator.wikimedia.org/T387863#10617493 (10dcausse) [09:51:43] 06serviceops, 10CirrusSearch, 10Data-Platform-SRE (2025.03.01 - 2025.03.21), 10Discovery-Search (2025.03.01 - 2025.03.21): Repartition [eqiad|codfw].cirrussearch.update_pipeline.update.v1 topics in kafka-main@[eqiad|codfw] - https://phabricator.wikimedia.org/T387863#10617602 (10brouberol) I can take care o... [09:59:30] o/ we're promoting the cirrus updater streams to "stable" and as part of that we're moving from using topics suffixed with "rc0" to ones with "v1", this should not cause extra usage on kafka-main (data is just going to flow to another topic). But for this we need to partition these "v1" topics the same way, if you don't have any objections to this could you give us the green lights on [09:59:32] https://phabricator.wikimedia.org/T387863#10617602? Thanks! [10:01:54] 06serviceops, 10CirrusSearch, 10Data-Platform-SRE (2025.03.01 - 2025.03.21), 10Discovery-Search (2025.03.01 - 2025.03.21): Repartition [eqiad|codfw].cirrussearch.update_pipeline.update.v1 topics in kafka-main@[eqiad|codfw] - https://phabricator.wikimedia.org/T387863#10617650 (10dcausse) [10:10:14] 06serviceops, 10CirrusSearch, 10Data-Platform-SRE (2025.03.01 - 2025.03.21), 10Discovery-Search (2025.03.01 - 2025.03.21): Repartition [eqiad|codfw].cirrussearch.update_pipeline.update.v1 topics in kafka-main@[eqiad|codfw] - https://phabricator.wikimedia.org/T387863#10617666 (10JMeybohm) AIUI you will be m... [10:12:03] 06serviceops, 10CirrusSearch, 10Data-Platform-SRE (2025.03.01 - 2025.03.21), 10Discovery-Search (2025.03.01 - 2025.03.21): Repartition [eqiad|codfw].cirrussearch.update_pipeline.update.v1 topics in kafka-main@[eqiad|codfw] - https://phabricator.wikimedia.org/T387863#10617675 (10dcausse) >>! In T387863#1061... [10:33:39] 06serviceops, 10Data-Engineering-Roadmap, 06Data-Platform-SRE, 10Dumps-Generation, and 3 others: Orchestrate dumps v1 from an airflow instance - https://phabricator.wikimedia.org/T388378 (10brouberol) 03NEW [11:19:00] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Remove outdated wikitech periodic jobs - https://phabricator.wikimedia.org/T388249#10617932 (10Clement_Goubert) 05Open→03Resolved [11:30:24] 06serviceops, 06collaboration-services, 10Prod-Kubernetes, 07Kubernetes: Replace k8s-controller-sidecars with built in Sidecar containers on k8s 1.31 - https://phabricator.wikimedia.org/T386694#10617966 (10JMeybohm) p:05Triage→03Medium [11:33:12] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: Update kube-state-metrics for k8s 1.31 - https://phabricator.wikimedia.org/T388387 (10JMeybohm) 03NEW [11:33:15] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: Update kube-state-metrics for k8s 1.31 - https://phabricator.wikimedia.org/T388387#10617999 (10JMeybohm) p:05Triage→03Medium [11:35:27] 06serviceops, 10MW-on-K8s: Make sure jobs are still defined on deployment-prep as systemd timers - https://phabricator.wikimedia.org/T385869#10618004 (10Clement_Goubert) 05Open→03Resolved Jobs will keep being defined on beta as systemd timers as long as the `interval` parameter is defined. [11:36:10] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: Ensure all required kubectl versions are installed on deploy hosts - https://phabricator.wikimedia.org/T388388 (10JMeybohm) 03NEW [11:37:01] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, and 2 others: Ensure all required kubectl versions are installed on deploy hosts - https://phabricator.wikimedia.org/T388388#10618021 (10JMeybohm) p:05Triage→03High [11:37:20] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: top-level config key environments must be defined before releases in helmfile.yaml - https://phabricator.wikimedia.org/T387836#10618023 (10JMeybohm) 05Open→03Resolved [11:39:17] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: Ensure the correct helm version is used for each cluster - https://phabricator.wikimedia.org/T388390 (10JMeybohm) 03NEW [11:39:38] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: Ensure the correct helm version is used for each cluster - https://phabricator.wikimedia.org/T388390#10618056 (10JMeybohm) p:05Triage→03High [11:58:08] 06serviceops: Migrate mw-cron to PHP 8.1 - https://phabricator.wikimedia.org/T387916#10618171 (10Clement_Goubert) [11:58:27] 06serviceops: Migrate mw-cron to PHP 8.1 - https://phabricator.wikimedia.org/T387916#10618172 (10Clement_Goubert) a:05Scott_French→03Clement_Goubert [12:02:21] 06serviceops, 06Release-Engineering-Team: deploy1003 reports helmfileAdminPendingChanges - https://phabricator.wikimedia.org/T387900#10618187 (10JMeybohm) 05Open→03Resolved a:03JMeybohm I'll resolve this, since I just ran apply against all clusters [12:04:08] 06serviceops, 13Patch-For-Review: Migrate mw-cron to PHP 8.1 - https://phabricator.wikimedia.org/T387916#10618194 (10Clement_Goubert) [12:16:07] 06serviceops, 07sre-alert-triage: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T388398 (10LSobanski) 03NEW [12:22:00] 06serviceops, 06MediaWiki-Engineering, 06Traffic, 07Upstream, 07Wikimedia-production-error: 503 error when edit large size pages on PHP 8.1 - https://phabricator.wikimedia.org/T385395#10618275 (10jijiki) 05Open→03Resolved a:03jijiki We are marking this as resolved, if you reckon there is someth... [12:31:15] 06serviceops, 07sre-alert-triage: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T388398#10618299 (10JMeybohm) →14Duplicate dup:03T384450 [12:31:22] 06serviceops, 06collaboration-services, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Update wikikube-staging-codfw to kubernetes 1.31 - https://phabricator.wikimedia.org/T384450#10618301 (10JMeybohm) [12:45:07] 06serviceops, 10Citoid, 06Content-Transform-Team, 06Data-Engineering, and 9 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118#10618386 (10kostajh) [12:48:27] 06serviceops: Improve readability of Switchover documentation - https://phabricator.wikimedia.org/T361113#10618442 (10jijiki) 05Open→03Resolved Before * https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacenter&oldid=2114128 * https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacenter/Co... [12:56:19] 06serviceops, 10ChangeProp, 10RESTBase: Error spike on RESTBase from changeprop requests - https://phabricator.wikimedia.org/T349324#10618480 (10jijiki) 05Open→03Declined Bluntly closing this, please reopen if you think there is something to be done. [13:05:47] 06serviceops, 06Commons, 06Traffic, 10Wikimedia-Site-requests: Enforce upload rate limits for bots on commons - https://phabricator.wikimedia.org/T248177#10618520 (10jijiki) 05Open→03Resolved a:03jijiki Closing this as many things have changed since then, ie we have various kinds of ratelimits in... [13:11:15] 06serviceops, 10StructuredDataOnCommons, 10Wikidata, 10Wikidata-Query-Service: WCQS authentication(?) issue - https://phabricator.wikimedia.org/T332289#10618565 (10jijiki) 05Open→03Invalid Please reopen if there is something actionable for #serviceops, marking as invalid for now. [13:11:55] 06serviceops: Api-gateway access log streaming erros in staging - https://phabricator.wikimedia.org/T283382#10618584 (10jijiki) 05Open→03Invalid [13:26:59] 06serviceops, 10RESTBase Sunsetting, 07User-notice-archive: Switchover plan from RESTbase to REST Gateway for rest_v1/page/html and rest_v1/page/title endpoints - https://phabricator.wikimedia.org/T374683#10618733 (10HCoplin-WMF) @Legoktm leaving out an incident report was unintentional. I was frankly no... [13:57:59] 06serviceops, 10MW-on-K8s, 06SRE, 07Python3-Porting: mwgrep cannot be used from a deployment host - https://phabricator.wikimedia.org/T384764#10618827 (10Reedy) [15:04:49] 06serviceops, 10Page Content Service, 10RESTBase Sunsetting, 10Content-Transform-Team (Work In Progress), and 2 others: Add time jitter on TTL when invalidating caches on PCS - https://phabricator.wikimedia.org/T387472#10619234 (10Jgiannelos) 05Open→03Resolved [15:04:54] 06serviceops, 10Page Content Service, 10Content-Transform-Team (Work In Progress), 13Patch-For-Review: Rollout more wikis after week 1 of testing with production traffic - https://phabricator.wikimedia.org/T387277#10619236 (10Jgiannelos) 05Open→03Resolved [15:23:30] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10619361 (10Jhancock.wm) [15:23:50] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10619365 (10Jhancock.wm) a:05Kappakayala→03Jhancock.wm [15:26:30] 06serviceops, 06SRE, 07Kubernetes: Remove `.cluster.local.` suffix in PTR responses - https://phabricator.wikimedia.org/T376762#10619396 (10joanna_borun) [15:27:25] 06serviceops, 06SRE, 07Kubernetes: Remove `.cluster.local.` suffix in PTR responses - https://phabricator.wikimedia.org/T376762#10619404 (10cmooney) p:05Triage→03Low [15:36:08] 06serviceops, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: top-level config key environments must be defined before releases in helmfile.yaml - https://phabricator.wikimedia.org/T387836#10619465 (10Jelto) [15:51:32] hnowlan: I drafted some docs for caching+PCS now that we migrate more traffic away from RB https://www.mediawiki.org/wiki/Page_Content_Service#Caching [15:52:14] nemo-yiannis: nice, thank you! [15:59:24] 06serviceops, 07Kubernetes: Add pod ip address blocks to staging-eqiad - https://phabricator.wikimedia.org/T386232#10619590 (10akosiaris) >>! In T386232#10558448, @JMeybohm wrote: > Ah, I was pretty sure I was reading netbox wrong :) > 10.64.80.0/20 and 10.192.80.0/20 should be fine as well though... Apparent... [16:01:10] 06serviceops, 10RESTBase, 13Patch-For-Review: Add support in REST gateway to simplify edge logic - https://phabricator.wikimedia.org/T344358#10619620 (10jijiki) @hnowlan is this task still valid? [16:11:25] hnowlan: deploying the changeprop patch now: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1125225 [16:12:28] thanks [16:13:06] nemo-yiannis: looks like a scap sync is underway [16:13:39] oh,ok i checked the deployments calendar and i didn't see anything [16:14:05] i will deploy after [16:16:59] 06serviceops: mw-api-ext unavailability 2024-05-22 18:30 UTC - https://phabricator.wikimedia.org/T365655#10619653 (10jijiki) 05Open→03Resolved a:03jijiki Closing this, if you feel there is work to be done here, or we need to add more information/comments, please reopen:) [16:23:36] 06serviceops, 07Kubernetes: Add pod ip address blocks to staging-eqiad - https://phabricator.wikimedia.org/T386232#10619698 (10JMeybohm) Let's go with the /21's then which should give us ample room for staging [16:28:29] hnowlan: is now ok ? i don't see anything happening but just to be on the safe side [16:29:16] looks okay [16:30:17] 06serviceops, 13Patch-For-Review: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845#10619748 (10jijiki) [16:30:18] ugh, version bump [16:33:32] 06serviceops, 07Kubernetes: Add pod ip address blocks to staging-eqiad - https://phabricator.wikimedia.org/T386232#10619785 (10JMeybohm) [16:51:41] hnowlan: done [16:52:13] looking [16:57:07] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10619941 (10Jhancock.wm) @Clement_Goubert could you update the site.pp file to include the wikikube-workker servers? thank you! [16:58:21] no surge in purges so far anyway [16:59:18] heads up: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1126109 [16:59:38] this was my oversight, i confused the restbase setup with our current state [17:01:43] hm, we need to use a different var, revert won't work. We need both the old definition to keep restbase up-to-date and a new list with the new endpoints [17:10:30] hnowlan: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1126110 [17:20:12] 06serviceops, 10Scap, 10Release-Engineering-Team (Radar): Set scap minimum python version to 3.7 - https://phabricator.wikimedia.org/T302086#10620079 (10dancy) 05Stalled→03Resolved a:03dancy Scap has been 3.7+ since Wed Feb 22 13:20:10 2023. [17:32:28] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10620139 (10Clement_Goubert) >>! In T384970#10619941, @Jhancock.wm wrote: > @Clement_Goubert could you update the site.pp file to include the wikiku... [17:40:52] thanks hnowlan for the review should i go ahead and deploy ? [17:42:11] sw.french is doing a deploy at the moment, maybe wait a minute or two [17:42:19] busy day :) [17:42:42] ok [17:46:09] 06serviceops, 06Release-Engineering-Team: deploy1003 reports helmfileAdminPendingChanges - https://phabricator.wikimedia.org/T387900#10620207 (10jijiki) This might be related to T381417? [17:47:27] nemo-yiannis: I think you're okay to go [17:47:41] 👍 [18:02:34] 06serviceops, 13Patch-For-Review: Migrate mw-cron to PHP 8.1 - https://phabricator.wikimedia.org/T387916#10620282 (10Clement_Goubert) 05Open→03Resolved `mw-cron` now uses PHP 8.1 base images. We will track specific job issues in their migration task. [18:16:22] hnowlan: I just did some testing on plwiki thats already switchedover to PCS/REST-gateway and an edit on mediawiki shows up on mobile-html API after I refresh [18:17:03] we might need some improvements for wikidata changes, but at least for now i stopped my scrappy workaround [18:17:08] i will take a look tomorrow for the rest [18:20:18] great, thank you! [18:23:09] 06serviceops, 07Datacenter-Switchover, 13Patch-For-Review: Investigate burst of DBReadOnlyError during switchover test - https://phabricator.wikimedia.org/T387509#10620355 (10hnowlan) 05Open→03Resolved >>! In T387509#10602665, @Krinkle wrote: >>>! @hnowlan wrote in the task description: >> During the... [18:30:03] 06serviceops, 13Patch-For-Review: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845#10620366 (10Scott_French) [19:14:22] 06serviceops, 10MW-on-K8s, 10Observability-Metrics, 13Patch-For-Review: Update Benthos chart and image for k8s deployments - https://phabricator.wikimedia.org/T385210#10620649 (10kamila) 05Open→03Resolved [20:50:22] 06serviceops, 13Patch-For-Review: Migrate mw-script to PHP 8.1 - https://phabricator.wikimedia.org/T387917#10621044 (10Scott_French) 05Open→03In progress p:05Triage→03Medium Many thanks to Ahmon, the scap changes to support non-deploy helmfile releases are live. The three pending puppet patches to dec... [21:04:00] 06serviceops, 13Patch-For-Review, 07PHP 8.1 support: Update PCRE in PHP 8.1 images to PCRE 10.39 or newer - https://phabricator.wikimedia.org/T386006#10621095 (10Scott_French) I just realized there's a typo in my SAL log in T386006#10619559 - that should have read "in component/php81 from apt-staging" (oops). [21:25:11] apropos of T388460, can anyone tell me what the envoys that sit between mediawiki & sessionstore (kask) log, and how to look at those? [21:35:11] 06serviceops, 06Content-Transform-Team: Parsoid extension no longer loaded on parsoidtest1001 - https://phabricator.wikimedia.org/T388465 (10ssastry) 03NEW [21:37:10] 06serviceops, 06Content-Transform-Team: Parsoid extension no longer loaded on parsoidtest1001 - https://phabricator.wikimedia.org/T388465#10621208 (10ssastry) Unrelated to this task, we could remove the 'scandium' reference from the config file - I'll submit a patch. [22:00:33] 06serviceops, 13Patch-For-Review: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845#10621357 (10jijiki) [22:25:20] urandom: I don't think we have much envoy logging configured, but whatever there is, you can read it in the kubernetes app logs (https://logstash.wikimedia.org/app/dashboards#/view/7f883390-fe76-11ea-b848-090a7444f26c) with a container name ending in `-tls-proxy` [22:25:58] urandom: fwiw though, those errors in the task aren't from envoy per se [22:26:39] what do you mean? [22:27:14] "(curl error: 7) Couldn't connect to server" just isn't an error that envoy emits [22:27:23] no no, that's from mediawiki [22:27:30] that's the php curl lib [22:27:52] cool okay :) just wanted to make sure [22:27:54] 06serviceops, 06Content-Transform-Team: Parsoid extension no longer loaded on parsoidtest1001 - https://phabricator.wikimedia.org/T388465#10621433 (10Scott_French) @ssastry - Can you recall when was the last time the `parse.php` script worked on parsoidtest1001? Specifically, RT-testing should be fine, since... [22:28:38] it's from the RESTBagOStuff, which is configured to talk to an envoy listener that forward to Kask [22:28:45] forwards [22:29:31] nod [22:29:49] I don't think we'll have enough per-request logging at Envoy to get the data you want, but I'd be happy to be wrong :/ [22:30:00] if timeseries data is enough to get what you need, there's https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1 [22:30:55] I'm not seeing any in sessionstore that correlates with what mediawiki is logging, so I'm hoping to fill in the gaps there somehow :) [22:31:06] yeah for sure [22:33:49] hrmm... yeah, I don't think telemetry is going to tell anything either [22:34:02] 06serviceops, 06Content-Transform-Team: Parsoid extension no longer loaded on parsoidtest1001 - https://phabricator.wikimedia.org/T388465#10621451 (10ssastry) As recently as T387608#10592938, and I am certain I ran other additional tests on Monday .. so, probably March 3 even. [22:34:29] it's basically connection errors [22:35:58] well, there are upstream errors (...or rather, there *aren't*), which is something [22:40:50] 06serviceops, 06Content-Transform-Team: Parsoid extension no longer loaded on parsoidtest1001 - https://phabricator.wikimedia.org/T388465#10621476 (10ssastry) > For the maintenance script use case, I'm not so sure. I see this [0], which suggests there's a special case for setting `SERVERGROUP` for that? Is it... [22:55:27] 06serviceops, 06Content-Transform-Team: Parsoid extension no longer loaded on parsoidtest1001 - https://phabricator.wikimedia.org/T388465#10621487 (10Scott_French) Thanks for the follow-up, @ssastry. The timing is really helpful. So, I think I have a theory: When https://gerrit.wikimedia.org/r/c/operations/m...