[00:03:11] (03PS2) 10RLazarus: deployment_server: Add a script for mass-deploying helmfile services [puppet] - 10https://gerrit.wikimedia.org/r/1188456 (https://phabricator.wikimedia.org/T380211) [00:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:04:38] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/ipoid: apply [00:04:46] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/ipoid: apply [00:05:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:08:05] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1188478 [00:08:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1188478 (owner: 10TrainBranchBot) [00:08:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:08:50] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/kartotherian: apply [00:08:59] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/kartotherian: apply [00:09:14] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply [00:11:50] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [00:13:23] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/media-analytics: apply [00:13:35] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [00:13:53] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [00:13:59] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:17:15] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [00:19:46] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [00:19:50] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [00:20:31] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/page-analytics: apply [00:20:40] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [00:20:52] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/proton: apply [00:21:00] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: apply [00:21:11] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/push-notifications: apply [00:21:35] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [00:22:06] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [00:22:10] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [00:22:19] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/recommendation-api: apply [00:22:27] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply [00:22:36] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/sessionstore: apply [00:22:52] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [00:23:08] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply [00:23:13] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply [00:23:31] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [00:23:35] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [00:23:53] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [00:23:57] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [00:24:06] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply [00:24:10] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [00:24:21] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [00:24:51] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [00:28:05] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1188478 (owner: 10TrainBranchBot) [00:28:23] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/toolhub: apply [00:28:31] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/toolhub: apply [00:29:20] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [00:29:35] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [00:29:45] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/zotero: apply [00:29:58] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/zotero: apply [00:42:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:52:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:57:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [01:08:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.19 [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1188486 (https://phabricator.wikimedia.org/T396380) [01:08:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.19 [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1188486 (https://phabricator.wikimedia.org/T396380) (owner: 10TrainBranchBot) [01:23:32] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.19 [core] (wmf/1.45.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1188486 (https://phabricator.wikimedia.org/T396380) (owner: 10TrainBranchBot) [01:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:33:59] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:44:56] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183385 (10phaultfinder) [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T0200) [02:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:13:59] FIRING: [14x] CertAlmostExpired: Certificate for service cloudsw1-b1-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:22:08] (03PS1) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 [02:22:38] (03CR) 10CI reject: [V:04-1] varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 (owner: 10Krinkle) [02:23:31] (03PS2) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 [02:23:57] (03CR) 10CI reject: [V:04-1] varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 (owner: 10Krinkle) [02:24:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183413 (10phaultfinder) [02:26:01] (03PS3) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 [02:28:54] (03PS4) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 [02:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:30:06] (03PS5) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 [02:31:04] (03PS6) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 [02:31:23] (03PS7) 10Krinkle: varnish: add support for vtc_file_glob to docker_run.sh [puppet] - 10https://gerrit.wikimedia.org/r/1188491 [02:32:26] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11183419 (10Papaul) @cmooney we have the spare PEM on site. I need to get on a call with Juniper to troubleshooting this. Do you think Thursd... [02:34:08] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:34:08] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:35:17] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#11183420 (10Papaul) [02:36:29] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#11183421 (10Papaul) 05Open→03Resolved a:03Papaul The BIO reader is installed now and working. so closing this task [02:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:37:15] (03PS1) 10Superpes15: Throttle exemption for Editathon by Wikimedistas en Cruce - 26 September 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188493 (https://phabricator.wikimedia.org/T404592) [02:39:28] (03PS2) 10Superpes15: Throttle exemption for Editathon by Wikimedistas en Cruce - 26 September 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188493 (https://phabricator.wikimedia.org/T404592) [02:43:58] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:43:58] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:48:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:50:10] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183429 (10phaultfinder) [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T0300) [03:23:59] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:24:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183455 (10phaultfinder) [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T0400) [04:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:04:18] !log mwpresync@deploy1003 Pruned MediaWiki: 1.45.0-wmf.16 (duration: 04m 08s) [04:05:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [04:13:59] FIRING: ProbeDown: Service install3004:8080 has failed probes (http_squid_ip6) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:24:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183464 (10phaultfinder) [04:48:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:53:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:02:43] FIRING: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:02:53] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:02:58] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:03:08] FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:07:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:08:59] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:17:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:24:07] FIRING: HelmReleaseBadStatus: Helm release airflow-dev/file-export-test-instance on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-dev - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:25:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:27:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:30:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:32:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:32:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:33:59] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:34:00] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:37:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:38:59] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:46:39] (03PS1) 10Huei Tan: xLab: Update the PageVisit target wiki for MinT readers [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188509 (https://phabricator.wikimedia.org/T404420) [05:47:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188509 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan) [05:47:20] (03Restored) 10Huei Tan: XLab\ResourceLoader\Hooks: Add stream to XLAB_STREAMS [extensions/MetricsPlatform] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188398 (owner: 10Huei Tan) [05:47:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188398 (owner: 10Huei Tan) [05:48:54] Hi, i have 2 patches for later backport, Kartik is not available, can you someone help with the deployment? [05:54:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T404626#11183555 (10phaultfinder) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T0600). [06:13:59] FIRING: [14x] CertAlmostExpired: Certificate for service cloudsw1-b1-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:29:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:47:14] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, but this changes an existing sudo rule, so needs SRE IF meeting approval" [puppet] - 10https://gerrit.wikimedia.org/r/1188408 (https://phabricator.wikimedia.org/T404630) (owner: 10CDanis) [06:52:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:59:56] I can deploy these backports. [07:00:00] o/ [07:00:03] thanks [07:00:04] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250916T0700). nyaa~ [07:00:04] hueitan: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188509 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan) [07:03:14] (03Merged) 10jenkins-bot: xLab: Update the PageVisit target wiki for MinT readers [extensions/WikimediaEvents] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188509 (https://phabricator.wikimedia.org/T404420) (owner: 10Huei Tan) [07:03:40] !log awight@deploy1003 Started scap sync-world: Backport for [[gerrit:1188509|xLab: Update the PageVisit target wiki for MinT readers (T404420)]] [07:03:45] T404420: Enable 13 wikis for MinT for Wiki Readers A/A test - https://phabricator.wikimedia.org/T404420 [07:07:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:09:20] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:09:40] !log awight@deploy1003 awight, hueitan: Backport for [[gerrit:1188509|xLab: Update the PageVisit target wiki for MinT readers (T404420)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:09:45] T404420: Enable 13 wikis for MinT for Wiki Readers A/A test - https://phabricator.wikimedia.org/T404420 [07:11:39] hueitan: Please check on mwdebug [07:11:48] awight: thanks for the deployments! :] [07:12:33] awight checked, see it live now on mwdebug [07:12:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:12:43] FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:12:45] hashar: My pleasure—Spider Pig has not let me down [07:12:51] hueitan: ty [07:12:58] !log awight@deploy1003 awight, hueitan: Continuing with sync [07:13:05] awight: yeah it is quite rad! Maybe one day we will have an equivalent to run Quibble from a web interface! :b [07:13:13] the bacula alert will get fixed soon [07:15:03] (03PS2) 10Slyngshede: Bump CAS container to 7.2.2 [software/bitu] - 10https://gerrit.wikimedia.org/r/1151178 [07:15:09] (03CR) 10Slyngshede: [C:03+2] Bump CAS container to 7.2.2 [software/bitu] - 10https://gerrit.wikimedia.org/r/1151178 (owner: 10Slyngshede) [07:18:15] (03Merged) 10jenkins-bot: Bump CAS container to 7.2.2 [software/bitu] - 10https://gerrit.wikimedia.org/r/1151178 (owner: 10Slyngshede) [07:18:16] !log awight@deploy1003 Finished scap sync-world: Backport for [[gerrit:1188509|xLab: Update the PageVisit target wiki for MinT readers (T404420)]] (duration: 14m 35s) [07:18:20] T404420: Enable 13 wikis for MinT for Wiki Readers A/A test - https://phabricator.wikimedia.org/T404420 [07:18:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy1003 using scap backport" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188398 (owner: 10Huei Tan) [07:18:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-ulsfo and Arelion (2001:2035:0:a9a::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [07:18:45] Finished. On to the second patch... [07:18:59] FIRING: [2x] OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:19:21] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:21:04] (03Merged) 10jenkins-bot: XLab\ResourceLoader\Hooks: Add stream to XLAB_STREAMS [extensions/MetricsPlatform] (wmf/1.45.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1188398 (owner: 10Huei Tan) [07:21:21] !log awight@deploy1003 Started scap sync-world: Backport for [[gerrit:1188398|XLab\ResourceLoader\Hooks: Add stream to XLAB_STREAMS]] [07:21:25] FIRING: SystemdUnitFailed: logrotate.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:21:59] (03CR) 10Arnaudb: [C:03+2] mailman: add a local disk cache [puppet] - 10https://gerrit.wikimedia.org/r/1188320 (https://phabricator.wikimedia.org/T353891) (owner: 10Arnaudb) [07:25:47] hueitan: is this one testable? [07:25:55] let me check [07:26:09] maybe I kafkacat or... [07:26:43] hueitan: sorry, it's not quite ready to test yet [07:27:01] I was confusingly asking ahead of time [07:27:40] !log awight@deploy1003 hueitan, awight: Backport for [[gerrit:1188398|XLab\ResourceLoader\Hooks: Add stream to XLAB_STREAMS]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:27:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:27:43] RESOLVED: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:27:58] FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:28:02] PROBLEM - mailman3-web on lists1004 is CRITICAL: PROCS CRITICAL: 14 processes with UID = 33 (www-data), regex args /usr/bin/uwsgi https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:28:11] awight i see it live now [07:28:44] !log awight@deploy1003 hueitan, awight: Continuing with sync [07:28:47] hueitan: ack [07:28:55] Thank you, all good