[03:09:44] FIRING: LiftWingServiceErrorRate: ... [03:09:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=ptwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [03:29:44] RESOLVED: LiftWingServiceErrorRate: ... [03:29:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=ptwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [04:55:38] Deploying rec-api in staging (ie https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1131128) [06:54:42] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10677093 (10elukey) [06:55:48] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10677098 (10elukey) To keep archives happy - me and Tobias discovered that after a VLAN move a Pybal restart is needed to allow ipvs (on the lb hosts) to pick up... [07:30:00] Good morning folks o/ [08:06:42] morning morniiing [08:12:07] morning folks [08:12:30] isaranto: o/ - as FYI me and Tobias have completed the migration of the codfw prod worker nodes to containerd [08:12:48] httpbb tests are good afaics, and later on serviceops will repool codfw for services [08:13:23] please keep it in mind, ideally we shouldn't see any trouble but better aware/safe than sorry :) [08:13:40] (containerd on Bookworm, so we changed the Linux kernel too) [08:19:09] awesome! thank you [09:00:21] (03CR) 10AikoChou: [C:03+1] events: log events as JSON serialized output [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1130883 (https://phabricator.wikimedia.org/T389768) (owner: 10Kevin Bazira) [09:00:30] morning! [09:04:29] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1130883 (https://phabricator.wikimedia.org/T389768) (owner: 10Kevin Bazira) [09:13:23] (03Merged) 10jenkins-bot: events: log events as JSON serialized output [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1130883 (https://phabricator.wikimedia.org/T389768) (owner: 10Kevin Bazira) [09:49:44] klausman: o/ https://gerrit.wikimedia.org/r/c/operations/puppet/+/1131282 [09:49:59] Morning! [10:04:54] elukey: LGTM@ [10:04:57] s/@/!/ [10:08:30] (03PS1) 10Kevin Bazira: RRLA: update predictions field to array as expected by event schema [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131287 (https://phabricator.wikimedia.org/T326179) [10:48:05] (03PS5) 10Gkyziridis: inference-services: edit-check GPU version for batch prediction. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) [12:20:06] (03CR) 10Ilias Sarantopoulos: [C:03+1] RRLA: update predictions field to array as expected by event schema [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131287 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [12:29:34] Fixing typo in the version, let's see if that fixes failure in staging for rec-api :) [12:32:02] (03CR) 10Kevin Bazira: [C:03+2] "ευχαριστώ :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131287 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [12:39:55] rec-api - success on staging! [12:41:06] \o/ [12:41:10] (03Merged) 10jenkins-bot: RRLA: update predictions field to array as expected by event schema [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131287 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [12:59:25] (03CR) 10Ilias Sarantopoulos: "Thanks for the work! I did a first pass." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [13:53:12] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Edit-Review-Improvements-RC-Page, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team: Expose ORES topics in recent changes filters - https://phabricator.wikimedia.org/T245906#10678506 (10Samwalton9-WMF) [13:53:22] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Growth-Team, 06Wikipedia-Android-App-Backlog: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298#10678508 (10Samwalton9-WMF) [13:55:48] I made the 2 changes we discussed earlier: [13:55:48] api-gw change for edit-check - https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1131323 [13:56:08] increase limitranges/decrease #of maxreplicas for revision models https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1131327 [15:31:57] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 13Patch-For-Review: Investigate options for providing beta cluster / patchdemo access to liftwing staging - https://phabricator.wikimedia.org/T388269#10679198 (10isarantopoulos) Allowing anonymous requests to the edit check service on ml-staging caused a l... [15:33:43] klausman: could you deploy this if I merge it? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1131327 [15:33:49] codfw has been repooled [15:33:55] nothing on fire so I'd say we are good :) [15:34:05] I am going to reimage the codfw ctrl nodes, one at the time [15:34:12] (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1131282) [15:38:08] great! [15:38:20] ack! [15:39:15] (03CR) 10Ilias Sarantopoulos: [V:03+2 C:03+2] reference-quality: build/run services with docker compose [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128832 (owner: 10Ilias Sarantopoulos) [15:39:24] (03PS6) 10Ilias Sarantopoulos: reference-quality: build/run services with docker compose [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128832 [15:39:27] (03CR) 10Ilias Sarantopoulos: [V:03+2 C:03+2] reference-quality: build/run services with docker compose [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128832 (owner: 10Ilias Sarantopoulos) [15:49:26] Updating rec-api in production now.. [16:06:20] 'Error: UPGRADE FAILED: an error occurred while finding last successful release. original upgrade error: could not get information about the resource: Get "https://ml-ctrl.svc.codfw.wmnet:6443/apis/networking.istio.io/v1beta1/namespaces/recommendation-api-ng/virtualservices/recommendation-api-ng-main": dial tcp 10.2.1.39:6443: connect: connection refused: Kubernetes cluster unreachable: Get [16:06:20] "https://ml-ctrl.svc.codfw.wmnet:6443/version": dial tcp 10.2.1.39:6443: connect: connection refused' -- anything wrong with codfw? [16:06:30] elukey: ^ [16:08:09] oh. Seems it is in progress? elukey [16:08:26] kart_: o/ [16:08:42] it is strange since we have two hosts pooled, I am working on one [16:08:48] lemme see [16:09:48] ah it is completing, maybe it got into a state that passes health checks but don't reply to queries [16:11:12] kart_: can you retry? [16:11:48] I explicitly depooled ml-serve-ctrl2001 [16:12:17] https://ml-ctrl.svc.codfw.wmnet:6443/version [16:12:23] works fine and no hiccups [16:12:23] sure. [16:12:53] Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress [16:13:49] right, helm doesn't like us anymore [16:13:51] checking [16:16:55] kart_: so I see docker-registry.discovery.wmnet/wikimedia/research-recommendation-api:2025-03-25-091801-production deployed 10 mins ago [16:17:01] does it look the right one? [16:17:37] ah. Seems good, but why it failed? :/ [16:18:27] all pods seems OK [16:18:43] (03PS1) 10Ilias Sarantopoulos: edit-check: add docker compose file for local run [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131383 (https://phabricator.wikimedia.org/T386100) [16:18:45] where did you get the UPGRADE failed? [16:18:52] after a hemfile apply? [16:20:04] I see, this is the helm status [16:20:04] 20 Tue Dec 17 14:15:58 2024 deployed python-webapp-0.0.9 Upgrade complete [16:20:07] 21 Wed Mar 26 16:05:11 2025 pending-upgrade python-webapp-0.0.9 Preparing upgrade [16:20:29] oh [16:20:46] Yes. after helmfile apply [16:21:26] While pods says they are running 28 minutes ago. Strange. [16:22:31] so I think that since helm got interrupted, we should rollback manually (via helm) to the 20 version, and then re-deploy [16:22:41] cc: klausman: --^ [16:22:44] lemme know your thoughts [16:24:24] so basically `heml3 rollback -n recommendation-api-ng main 20` [16:24:34] inmeeting, but feel free to go ahead [16:24:35] (please do not execute until we decide :D) [16:24:44] okok [16:24:47] kart_: trying [16:26:03] elukey: can you do that for me? :D [16:26:14] yes yes I am doing it :) [16:26:19] :) [16:26:24] so the pods are getting back to their previous version [16:26:52] kart_: all right, can you re-deploy? [16:26:57] I see the diff now [16:27:01] hopefully it will work [16:27:33] checking diff. [16:27:44] FIRING: LiftWingServiceErrorRate: ... [16:27:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=kowiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [16:28:18] syncing.. [16:29:02] success! [16:29:11] Thanks elukey [16:31:22] kart_: nice! [16:38:47] * isaranto afk [16:42:44] RESOLVED: LiftWingServiceErrorRate: ... [16:42:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=kowiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [19:17:07] 06Machine-Learning-Team, 10EditCheck: Investigate using SHAP values to highlight peacock words - https://phabricator.wikimedia.org/T387984#10680383 (10ppelberg) [19:17:41] 06Machine-Learning-Team, 10EditCheck: Investigate using SHAP values to highlight peacock words - https://phabricator.wikimedia.org/T387984#10680386 (10ppelberg) [23:11:14] 06Machine-Learning-Team, 10EditCheck: Retrain peacock detection model for production use - https://phabricator.wikimedia.org/T388211#10681150 (10SSalgaonkar-WMF)