[00:08:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1182965 [00:08:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1182965 (owner: 10TrainBranchBot) [00:10:17] 06SRE, 06Infrastructure-Foundations, 10netops: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845#11130589 (10Papaul) @ayounsi thank you I will get back on this next week since i am off tomorrow and Monday is a U.S holiday. [00:13:21] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance [00:13:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2149 (T402925)', diff saved to https://phabricator.wikimedia.org/P82072 and previous config saved to /var/cache/conftool/dbconfig/20250829-001328-ladsgroup.json [00:13:34] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [00:19:35] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:27] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1182965 (owner: 10TrainBranchBot) [00:41:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:48:39] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:00:43] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:12:30] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 11m 47s) [01:17:25] (03PS10) 10Krinkle: [WIP] varnish: Improve 08-mobile-hostnames-rewrite.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1180969 [01:19:56] (03PS11) 10Krinkle: varnish: Improve 08-mobile-hostnames-rewrite.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1180969 [01:25:34] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T402925)', diff saved to https://phabricator.wikimedia.org/P82073 and previous config saved to /var/cache/conftool/dbconfig/20250829-012534-ladsgroup.json [01:25:39] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [01:33:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:40:42] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P82074 and previous config saved to /var/cache/conftool/dbconfig/20250829-014041-ladsgroup.json [01:44:35] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:51:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [01:55:49] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P82075 and previous config saved to /var/cache/conftool/dbconfig/20250829-015549-ladsgroup.json [01:59:42] (03PS12) 10Krinkle: varnish: Improve 08-mobile-hostnames-rewrite.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1180969 (https://phabricator.wikimedia.org/T401595) [02:09:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:10:57] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T402925)', diff saved to https://phabricator.wikimedia.org/P82076 and previous config saved to /var/cache/conftool/dbconfig/20250829-021056-ladsgroup.json [02:11:12] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [02:11:13] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance [02:11:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2156 (T402925)', diff saved to https://phabricator.wikimedia.org/P82077 and previous config saved to /var/cache/conftool/dbconfig/20250829-021120-ladsgroup.json [02:26:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:46:10] FIRING: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:51:10] RESOLVED: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:04:35] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:04:35] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:23:04] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T402925)', diff saved to https://phabricator.wikimedia.org/P82078 and previous config saved to /var/cache/conftool/dbconfig/20250829-032304-ladsgroup.json [03:23:10] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [03:29:35] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:38:12] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P82079 and previous config saved to /var/cache/conftool/dbconfig/20250829-033811-ladsgroup.json [03:53:20] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P82080 and previous config saved to /var/cache/conftool/dbconfig/20250829-035319-ladsgroup.json [04:02:25] (03PS1) 10Ryan Kemper: wdqs: (step 3) shift service state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1182975 (https://phabricator.wikimedia.org/T395772) [04:02:49] (03PS1) 10Ryan Kemper: wdqs: (step 2) remove wdqs discovery dns records [dns] - 10https://gerrit.wikimedia.org/r/1182976 (https://phabricator.wikimedia.org/T395772) [04:02:51] (03CR) 10CI reject: [V:04-1] wdqs: (step 3) shift service state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1182975 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [04:03:35] (03CR) 10CI reject: [V:04-1] wdqs: (step 2) remove wdqs discovery dns records [dns] - 10https://gerrit.wikimedia.org/r/1182976 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [04:03:44] (03PS2) 10Ryan Kemper: wdqs: (step 3) shift service state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1182975 (https://phabricator.wikimedia.org/T395772) [04:08:27] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T402925)', diff saved to https://phabricator.wikimedia.org/P82081 and previous config saved to /var/cache/conftool/dbconfig/20250829-040826-ladsgroup.json [04:08:33] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [04:08:42] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance [04:08:50] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2177 (T402925)', diff saved to https://phabricator.wikimedia.org/P82082 and previous config saved to /var/cache/conftool/dbconfig/20250829-040849-ladsgroup.json [04:15:25] (03PS1) 10Ryan Kemper: wdqs: (step 4) remove from LBs and wdqs backends [puppet] - 10https://gerrit.wikimedia.org/r/1182977 (https://phabricator.wikimedia.org/T395772) [04:15:27] (03PS1) 10Ryan Kemper: wdqs: (steps 5,6) => final removal [puppet] - 10https://gerrit.wikimedia.org/r/1182978 (https://phabricator.wikimedia.org/T395772) [04:20:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [04:49:35] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:55:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [04:59:23] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:06:17] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 06 Oct 2025 08:56:14 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:20:59] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T402925)', diff saved to https://phabricator.wikimedia.org/P82083 and previous config saved to /var/cache/conftool/dbconfig/20250829-052059-ladsgroup.json [05:21:05] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [05:28:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:33:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:36:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P82084 and previous config saved to /var/cache/conftool/dbconfig/20250829-053606-ladsgroup.json [05:44:35] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:51:14] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P82085 and previous config saved to /var/cache/conftool/dbconfig/20250829-055113-ladsgroup.json [05:52:24] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877#11130890 (10ayounsi) > the idea is that static routes should help save us in that situation That would only be the case for lvs1016, 1018, 1019 and 1020... [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250829T0600) [06:06:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T402925)', diff saved to https://phabricator.wikimedia.org/P82086 and previous config saved to /var/cache/conftool/dbconfig/20250829-060621-ladsgroup.json [06:06:27] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [06:06:36] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2190.codfw.wmnet with reason: Maintenance [06:06:44] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2190 (T402925)', diff saved to https://phabricator.wikimedia.org/P82087 and previous config saved to /var/cache/conftool/dbconfig/20250829-060644-ladsgroup.json [06:13:37] !log arnaudb@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Security Update [06:13:40] (03CR) 10Ayounsi: Nokia: module for interface configuration (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1180925 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [06:19:11] (03PS15) 10Ayounsi: Nokia: module for interface configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180925 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [06:19:11] (03PS9) 10Ayounsi: Nokia: module for network-instance configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [06:19:11] (03PS8) 10Ayounsi: Nokia: module for OSPF configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1181132 (owner: 10Cathal Mooney) [06:30:46] (03CR) 10Ayounsi: [C:03+2] Nokia: Add initial Python files for nokia switch system config (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1180562 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [06:31:00] (03CR) 10Ayounsi: [C:03+2] Nokia: module for interface configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180925 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [06:32:04] (03Merged) 10jenkins-bot: Nokia: Add initial Python files for nokia switch system config [homer/public] - 10https://gerrit.wikimedia.org/r/1180562 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [06:32:20] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Phase out DSA keys for SSH access (ssh-dss) - https://phabricator.wikimedia.org/T177371#11130917 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff I think we can close this task. Any place where DSA keys are acti... [06:32:23] (03Merged) 10jenkins-bot: Nokia: module for interface configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180925 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [06:33:13] (03CR) 10Ayounsi: [C:03+2] Nokia: Add initial Python files for nokia switch system config (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1180562 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [06:36:04] (03CR) 10Ayounsi: Nokia: module for network-instance configuration (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [06:38:25] (03PS1) 10Muehlenhoff: Remove LDAP access for lgaulia [puppet] - 10https://gerrit.wikimedia.org/r/1182993 [06:41:31] (03CR) 10Muehlenhoff: [C:03+2] Remove LDAP access for lgaulia [puppet] - 10https://gerrit.wikimedia.org/r/1182993 (owner: 10Muehlenhoff) [06:43:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:48:54] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402835#11130928 (10phaultfinder) [06:53:56] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11130929 (10phaultfinder) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250829T0700) [07:04:35] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:04:35] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:09:38] (03PS1) 10Majavah: P:doc: Add backwards compat redirect for terraform-cloudvps [puppet] - 10https://gerrit.wikimedia.org/r/1182995 (https://phabricator.wikimedia.org/T403178) [07:13:38] (03PS1) 10Majavah: dynamicproxy: Add NEL headers [puppet] - 10https://gerrit.wikimedia.org/r/1182996 (https://phabricator.wikimedia.org/T403178) [07:15:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11130970 (10VRiley-WMF) [07:15:50] (03PS10) 10Ayounsi: Nokia: module for network-instance configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [07:15:50] (03PS9) 10Ayounsi: Nokia: module for OSPF configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1181132 (owner: 10Cathal Mooney) [07:16:31] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T402925)', diff saved to https://phabricator.wikimedia.org/P82088 and previous config saved to /var/cache/conftool/dbconfig/20250829-071630-ladsgroup.json [07:16:37] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [07:18:58] (03CR) 10Ozge: [C:03+1] ml-services: update revscoring production image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182770 (https://phabricator.wikimedia.org/T400350) (owner: 10Kevin Bazira) [07:19:43] (03PS2) 10Majavah: dynamicproxy: Add NEL headers [puppet] - 10https://gerrit.wikimedia.org/r/1182996 (https://phabricator.wikimedia.org/T400994) [07:29:35] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:31:39] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P82089 and previous config saved to /var/cache/conftool/dbconfig/20250829-073138-ladsgroup.json [07:37:58] (03CR) 10Filippo Giunchedi: "I'm +1 on the alerts for projects we manage. For the rest, how would the alerts/task notification to users work? I'm asking because I worr" [alerts] - 10https://gerrit.wikimedia.org/r/1182900 (https://phabricator.wikimedia.org/T402932) (owner: 10David Caro) [07:38:02] (03CR) 10Tiziano Fogli: [C:03+1] "Thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1180642 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [07:39:11] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update revscoring production image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182770 (https://phabricator.wikimedia.org/T400350) (owner: 10Kevin Bazira) [07:39:56] (03CR) 10Tiziano Fogli: [C:03+2] icinga/audit: add script to dump defined checks [software] - 10https://gerrit.wikimedia.org/r/1182571 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [07:41:39] (03Merged) 10jenkins-bot: ml-services: update revscoring production image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182770 (https://phabricator.wikimedia.org/T400350) (owner: 10Kevin Bazira) [07:46:20] !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [07:46:37] (03CR) 10Arnaudb: [C:03+2] Revert^3 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1182825 (owner: 10Arnaudb) [07:46:46] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P82090 and previous config saved to /var/cache/conftool/dbconfig/20250829-074645-ladsgroup.json [07:48:54] (03CR) 10Ayounsi: "Added a few inline comments, I can take care of tackling them if you're ok." [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [07:49:21] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [07:55:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [07:58:39] (03CR) 10Cathal Mooney: Nokia: module for network-instance configuration (035 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [08:01:54] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T402925)', diff saved to https://phabricator.wikimedia.org/P82091 and previous config saved to /var/cache/conftool/dbconfig/20250829-080153-ladsgroup.json [08:01:59] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [08:02:09] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2194.codfw.wmnet with reason: Maintenance [08:02:17] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2194 (T402925)', diff saved to https://phabricator.wikimedia.org/P82092 and previous config saved to /var/cache/conftool/dbconfig/20250829-080216-ladsgroup.json [08:07:20] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11131038 (10Jgiannelos) There actually is a problem. Here is a screenshot from a jupyter notebook when querying the service directly: {F65929908} [08:09:24] (03CR) 10Volans: "reply inline" [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [08:10:52] (03CR) 10Ayounsi: Nokia: module for network-instance configuration (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [08:11:47] (03PS11) 10Ayounsi: Nokia: module for network-instance configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [08:11:47] (03PS10) 10Ayounsi: Nokia: module for OSPF configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1181132 (owner: 10Cathal Mooney) [08:15:01] (03CR) 10Ayounsi: Nokia: module for OSPF configuration (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1181132 (owner: 10Cathal Mooney) [08:19:13] PROBLEM - Squid on install2004 is CRITICAL: connect to address 208.80.153.105 and port 8080: Connection refused https://wikitech.wikimedia.org/wiki/HTTP_proxy [08:19:50] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on install2004.wikimedia.org with reason: being replaced by install2005 [08:20:45] (03CR) 10Cathal Mooney: Nokia: module for OSPF configuration (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1181132 (owner: 10Cathal Mooney) [08:22:24] !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [08:24:47] (03CR) 10Ayounsi: Nokia: module for network-instance configuration (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [08:25:29] (03PS1) 10Muehlenhoff: Blacklist jffs2 [puppet] - 10https://gerrit.wikimedia.org/r/1183072 [08:27:25] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [08:32:45] FIRING: Traffic bill over quota: Alert for device cr3-ulsfo.wikimedia.org - Traffic bill over quota Has improved - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [08:34:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11131072 (10VRiley-WMF) [08:36:23] (03CR) 10Ayounsi: Nokia: module for network-instance configuration (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [08:36:49] (03PS12) 10Ayounsi: Nokia: module for network-instance configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [08:36:49] (03PS11) 10Ayounsi: Nokia: module for OSPF configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1181132 (owner: 10Cathal Mooney) [08:38:04] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [08:40:12] (03CR) 10Ayounsi: [C:03+2] Nokia: module for network-instance configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [08:40:24] !log vriley@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [08:40:31] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [08:40:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:42:14] (03Merged) 10jenkins-bot: Nokia: module for network-instance configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [08:43:24] (03PS1) 10Muehlenhoff: Add failoid[12]003 [puppet] - 10https://gerrit.wikimedia.org/r/1183073 (https://phabricator.wikimedia.org/T402406) [08:44:07] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt maps1012 - vriley@cumin1003" [08:44:09] 06SRE, 10SRE-Access-Requests: Update SSH key for Connie Chen - https://phabricator.wikimedia.org/T403242 (10cchen) 03NEW [08:44:11] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt maps1012 - vriley@cumin1003" [08:44:11] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:44:14] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [08:47:26] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:47:41] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host maps1011 [08:48:05] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [08:48:54] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host maps1011 [08:49:01] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host maps1012 [08:49:28] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [08:49:35] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:50:15] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host maps1012 [08:50:52] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:51:10] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host maps1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:51:30] !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [08:52:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11131111 (10VRiley-WMF) [08:52:46] RESOLVED: Traffic bill over quota: Alert for device cr3-ulsfo.wikimedia.org - Traffic bill over quota Has improved - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [08:53:07] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host maps1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:54:18] (03CR) 10Muehlenhoff: [C:03+2] Add failoid[12]003 [puppet] - 10https://gerrit.wikimedia.org/r/1183073 (https://phabricator.wikimedia.org/T402406) (owner: 10Muehlenhoff) [08:58:52] (03CR) 10Ayounsi: "change lgtm other than the current comment." [homer/public] - 10https://gerrit.wikimedia.org/r/1182797 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [08:59:31] 10ops-codfw, 06DC-Ops: PSU issue on es2055 - https://phabricator.wikimedia.org/T403243 (10FCeratto-WMF) 03NEW [09:00:18] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on Wikidata for Firefox (Browser extension) - https://phabricator.wikimedia.org/T398588#11131131 (10TheDJ) Do extensions set a referrer header ? As volunteer im supportive in spirit. i don't think when the policy was designed any of us had considered brow... [09:02:10] vriley@cumin1003 provision (PID 51164) is awaiting input [09:02:39] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on Wikidata for Firefox (Browser extension) - https://phabricator.wikimedia.org/T398588#11131135 (10TheDJ) [09:04:03] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on Wikidata for Firefox (Browser extension) - https://phabricator.wikimedia.org/T398588#11131140 (10TheDJ) > The extension can also display maps from Wikibase instances other than Wikidata. In those cases I can still display the standard openstreetmap tile... [09:08:49] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host failoid2003.codfw.wmnet [09:08:51] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:11:09] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T402925)', diff saved to https://phabricator.wikimedia.org/P82093 and previous config saved to /var/cache/conftool/dbconfig/20250829-091108-ladsgroup.json [09:11:15] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [09:11:34] !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [09:13:14] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM failoid2003.codfw.wmnet - jmm@cumin2002" [09:13:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM failoid2003.codfw.wmnet - jmm@cumin2002" [09:13:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:13:20] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache failoid2003.codfw.wmnet on all recursors [09:13:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) failoid2003.codfw.wmnet on all recursors [09:13:55] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM failoid2003.codfw.wmnet - jmm@cumin2002" [09:14:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM failoid2003.codfw.wmnet - jmm@cumin2002" [09:15:14] (03CR) 10Ayounsi: [C:03+1] "logic lgtm, one suggestion inline, feel free to ignore it" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1182796 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [09:15:23] (03CR) 10Ayounsi: [C:03+1] JunOS IBGP: adjust template to work with updated data from plugin [homer/public] - 10https://gerrit.wikimedia.org/r/1182797 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [09:16:01] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [09:17:03] jmm@cumin2002 makevm (PID 1888098) is awaiting input [09:19:48] (03PS8) 10Ayounsi: WIP: use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 [09:20:14] (03CR) 10FNegri: [C:03+1] dynamicproxy: Add NEL headers [puppet] - 10https://gerrit.wikimedia.org/r/1182996 (https://phabricator.wikimedia.org/T400994) (owner: 10Majavah) [09:21:12] (03CR) 10Majavah: [C:03+2] dynamicproxy: Add NEL headers [puppet] - 10https://gerrit.wikimedia.org/r/1182996 (https://phabricator.wikimedia.org/T400994) (owner: 10Majavah) [09:22:57] (03CR) 10Ayounsi: WIP: use Homer to configure the network (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 (owner: 10Ayounsi) [09:24:11] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host maps1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:24:13] !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:24:53] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add failoid2003 - jmm@cumin2002" [09:24:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add failoid2003 - jmm@cumin2002" [09:25:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host failoid2003.codfw.wmnet with OS trixie [09:25:15] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host maps1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:25:16] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Move failoid to trixie - https://phabricator.wikimedia.org/T402406#11131157 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host failoid2003.codfw.wmnet with OS trixie [09:25:21] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host failoid2003.codfw.wmnet with OS trixie [09:25:21] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host failoid2003.codfw.wmnet [09:25:32] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Move failoid to trixie - https://phabricator.wikimedia.org/T402406#11131158 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host failoid2003.codfw.wmnet with OS trixie executed with errors: - failoid2003 (**FAIL... [09:26:16] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P82094 and previous config saved to /var/cache/conftool/dbconfig/20250829-092615-ladsgroup.json [09:27:11] (03PS1) 10Fabfur: profile:cache: remove varnishkafka (webrequest) from cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1183081 (https://phabricator.wikimedia.org/T393772) [09:27:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host failoid1003.eqiad.wmnet [09:27:17] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:29:34] (03CR) 10FNegri: [C:03+1] "Finally took the time to review this, apologies for the long delay!" [puppet] - 10https://gerrit.wikimedia.org/r/1153563 (owner: 10Majavah) [09:29:47] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1183081 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [09:30:08] (03CR) 10Majavah: [C:03+2] maintain-dbusers: harvest: Do not create PAWS account on ToolsDB [puppet] - 10https://gerrit.wikimedia.org/r/1153563 (owner: 10Majavah) [09:30:15] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:31:15] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM failoid1003.eqiad.wmnet - jmm@cumin2002" [09:31:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM failoid1003.eqiad.wmnet - jmm@cumin2002" [09:31:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:31:21] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache failoid1003.eqiad.wmnet on all recursors [09:31:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) failoid1003.eqiad.wmnet on all recursors [09:31:26] (03PS9) 10Ayounsi: WIP: use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 [09:31:37] (03CR) 10FNegri: [C:03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1182034 (https://phabricator.wikimedia.org/T402778) (owner: 10Filippo Giunchedi) [09:31:55] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM failoid1003.eqiad.wmnet - jmm@cumin2002" [09:32:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM failoid1003.eqiad.wmnet - jmm@cumin2002" [09:32:01] (03CR) 10Ayounsi: WIP: use Homer to configure the network (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 (owner: 10Ayounsi) [09:33:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host failoid1003.eqiad.wmnet with OS trixie [09:33:09] 06SRE, 06Infrastructure-Foundations: Move failoid to trixie - https://phabricator.wikimedia.org/T402406#11131195 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host failoid1003.eqiad.wmnet with OS trixie [09:33:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:52] (03PS3) 10Clément Goubert: rest-gateway: Introduce rest-gateway-ro [puppet] - 10https://gerrit.wikimedia.org/r/1182852 (https://phabricator.wikimedia.org/T400131) [09:36:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 01 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182798 (https://phabricator.wikimedia.org/T280532) (owner: 10Mszwarc) [09:37:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 01 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182692 (https://phabricator.wikimedia.org/T403148) (owner: 10Mszwarc) [09:37:31] (03PS3) 10Clément Goubert: wmnet: Introduce rest-gateway-ro [dns] - 10https://gerrit.wikimedia.org/r/1182853 (https://phabricator.wikimedia.org/T400131) [09:39:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11131205 (10VRiley-WMF) [09:39:48] (03PS4) 10Clément Goubert: wmnet: Introduce rest-gateway-ro [dns] - 10https://gerrit.wikimedia.org/r/1182853 (https://phabricator.wikimedia.org/T400131) [09:41:24] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P82095 and previous config saved to /var/cache/conftool/dbconfig/20250829-094123-ladsgroup.json [09:43:45] !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:44:31] (03PS1) 10Clément Goubert: rest-gateway: Switch rest-gateway to A/P [puppet] - 10https://gerrit.wikimedia.org/r/1183084 (https://phabricator.wikimedia.org/T400131) [09:44:35] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:45:04] (03PS1) 10Clément Goubert: wmnet: Switch rest-gateway to metafo [dns] - 10https://gerrit.wikimedia.org/r/1183085 (https://phabricator.wikimedia.org/T402412) [09:47:27] (03CR) 10Clément Goubert: "No, thank you for providing a comprehensive answer to the question I've been asking myself since I posted this patch 😄" [dns] - 10https://gerrit.wikimedia.org/r/1182853 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert) [09:47:44] (03CR) 10Clément Goubert: "Done" [dns] - 10https://gerrit.wikimedia.org/r/1182853 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert) [09:48:13] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:49:00] (03CR) 10FNegri: [C:03+1] P:doc: Add backwards compat redirect for terraform-cloudvps [puppet] - 10https://gerrit.wikimedia.org/r/1182995 (https://phabricator.wikimedia.org/T403178) (owner: 10Majavah) [09:49:36] (03CR) 10Majavah: [C:03+2] P:doc: Add backwards compat redirect for terraform-cloudvps [puppet] - 10https://gerrit.wikimedia.org/r/1182995 (https://phabricator.wikimedia.org/T403178) (owner: 10Majavah) [09:51:18] !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:53:09] (03PS1) 10Clément Goubert: rest-gateway: Add rest-gateway-ro domain matchers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183086 (https://phabricator.wikimedia.org/T400131) [09:53:49] (03PS4) 10Clément Goubert: rest-gateway: Introduce rest-gateway-ro [puppet] - 10https://gerrit.wikimedia.org/r/1182852 (https://phabricator.wikimedia.org/T400131) [09:53:54] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on failoid1003.eqiad.wmnet with reason: host reimage [09:55:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:55:56] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:55:58] !log taavi@doc2003 ~ $ sudo rm -rf /srv/doc/cloud/cloud-vps/terraform-cloudvps/ # T403178 [09:56:01] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host maps1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:05] T403178: Rename terraform-cloudvps repo to tofu-cloudvps - https://phabricator.wikimedia.org/T403178 [09:56:31] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T402925)', diff saved to https://phabricator.wikimedia.org/P82096 and previous config saved to /var/cache/conftool/dbconfig/20250829-095631-ladsgroup.json [09:56:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11131275 (10VRiley-WMF) [09:56:37] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [09:56:46] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2209.codfw.wmnet with reason: Maintenance [09:56:54] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2209 (T402925)', diff saved to https://phabricator.wikimedia.org/P82097 and previous config saved to /var/cache/conftool/dbconfig/20250829-095653-ladsgroup.json [09:58:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on failoid1003.eqiad.wmnet with reason: host reimage [10:04:55] (03CR) 10Muehlenhoff: "I wasn't sure whether the binaries provided already support trixie, but I've just tested it in an nspawn container and they seem to work j" [puppet] - 10https://gerrit.wikimedia.org/r/1182835 (https://phabricator.wikimedia.org/T393437) (owner: 10Muehlenhoff) [10:11:47] (03Abandoned) 10Muehlenhoff: profile::toolforge::docker::image_builder: No longer use docker::baseimages [puppet] - 10https://gerrit.wikimedia.org/r/911331 (https://phabricator.wikimedia.org/T335282) (owner: 10Muehlenhoff) [10:13:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host failoid1003.eqiad.wmnet with OS trixie [10:13:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host failoid1003.eqiad.wmnet [10:13:46] 06SRE, 06Infrastructure-Foundations: Move failoid to trixie - https://phabricator.wikimedia.org/T402406#11131336 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host failoid1003.eqiad.wmnet with OS trixie completed: - failoid1003 (**PASS**) - Removed from Puppet and P... [10:27:11] (03CR) 10Vgutierrez: profile:cache: remove varnishkafka (webrequest) from cp hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183081 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [10:28:18] (03PS2) 10Muehlenhoff: Add repository sync definition for node 22 [puppet] - 10https://gerrit.wikimedia.org/r/1182835 (https://phabricator.wikimedia.org/T393437) [10:28:32] (03CR) 10Vgutierrez: profile:cache: remove varnishkafka (webrequest) from cp hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183081 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [10:33:06] (03CR) 10Fabfur: profile:cache: remove varnishkafka (webrequest) from cp hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183081 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [10:37:17] (03CR) 10Vgutierrez: profile:cache: remove varnishkafka (webrequest) from cp hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183081 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [10:40:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:45:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:52:12] (03CR) 10Jforrester: "That should be fine, Trixie as the base will be great. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1182835 (https://phabricator.wikimedia.org/T393437) (owner: 10Muehlenhoff) [10:53:53] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402835#11131428 (10phaultfinder) [10:54:17] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:58:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host failoid2003.codfw.wmnet with OS trixie [10:58:54] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11131442 (10phaultfinder) [10:58:55] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host failoid2003.codfw.wmnet with OS trixie [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250829T0700) [11:00:05] jelto, arnoldokoth, and mutante: GitLab version upgrades (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250829T1100). Please do the needful. [11:02:58] (03CR) 10Urbanecm: [C:04-1] [Growth] enwiki: Deploy "Add a link" to 100% of users (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524) (owner: 10Cyndywikime) [11:03:05] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T402925)', diff saved to https://phabricator.wikimedia.org/P82104 and previous config saved to /var/cache/conftool/dbconfig/20250829-110304-ladsgroup.json [11:03:10] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [11:03:16] arnaudb@cumin1003 arnaudb: The backup on gitlab1004 is complete, ready to proceed with upgrade. [11:04:35] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:04:35] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:06:16] arnaudb@cumin1003 upgrade (PID 29024) is awaiting input [11:09:53] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11131449 (10Ladsgroup) These are refreshes so they should replace es1026-es1034. Feel free to rack these where the old ones are. [11:10:04] (03CR) 10Peter Fischer: [C:03+2] SUP: upgrade to flink 1.20.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182503 (https://phabricator.wikimedia.org/T398159) (owner: 10Peter Fischer) [11:11:44] (03Merged) 10jenkins-bot: SUP: upgrade to flink 1.20.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182503 (https://phabricator.wikimedia.org/T398159) (owner: 10Peter Fischer) [11:16:55] !log pfischer@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:17:50] (03CR) 10Ladsgroup: "Would you mind doing it one by one as I asked?" [puppet] - 10https://gerrit.wikimedia.org/r/1182592 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [11:18:08] !log pfischer@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:18:13] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P82105 and previous config saved to /var/cache/conftool/dbconfig/20250829-111812-ladsgroup.json [11:19:44] (03CR) 10Ladsgroup: [C:03+1] es2049.yaml, site.pp: Prepare es2049 to replace es2026 [puppet] - 10https://gerrit.wikimedia.org/r/1182593 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [11:25:45] FIRING: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [11:26:45] (03PS1) 10Peter Fischer: SUP: upgrade to flink 1.20.1 (use latest, existing image) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183095 (https://phabricator.wikimedia.org/T398159) [11:26:59] (03CR) 10Peter Fischer: [C:03+2] SUP: upgrade to flink 1.20.1 (use latest, existing image) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183095 (https://phabricator.wikimedia.org/T398159) (owner: 10Peter Fischer) [11:28:39] (03Merged) 10jenkins-bot: SUP: upgrade to flink 1.20.1 (use latest, existing image) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183095 (https://phabricator.wikimedia.org/T398159) (owner: 10Peter Fischer) [11:29:30] !log pfischer@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:29:35] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:29:40] !log pfischer@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:32:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [11:33:20] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P82106 and previous config saved to /var/cache/conftool/dbconfig/20250829-113320-ladsgroup.json [11:35:45] RESOLVED: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [11:38:08] (03PS13) 10Slyngshede: P:puppetserver::volatile generate datacenter database [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) [11:41:22] (03PS2) 10Federico Ceratto: preseed.yaml: Remove es2049 [puppet] - 10https://gerrit.wikimedia.org/r/1182592 (https://phabricator.wikimedia.org/T402859) [11:41:23] (03PS2) 10Federico Ceratto: es2049.yaml, site.pp: Prepare es2049 to replace es2026 [puppet] - 10https://gerrit.wikimedia.org/r/1182593 (https://phabricator.wikimedia.org/T402859) [11:41:46] !log pfischer@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:42:07] !log pfischer@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:42:27] !log pfischer@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:42:33] !log pfischer@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:42:35] (03PS1) 10Muehlenhoff: Set install2004 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1183097 (https://phabricator.wikimedia.org/T396487) [11:42:43] (03CR) 10Federico Ceratto: "Ok, I also updated the preseed change to remove only one host." [puppet] - 10https://gerrit.wikimedia.org/r/1182592 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [11:42:48] (03PS2) 10Muehlenhoff: Set install2004 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1183097 (https://phabricator.wikimedia.org/T396487) [11:43:28] (03CR) 10Muehlenhoff: [C:03+2] Set install2004 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1183097 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [11:46:21] (03CR) 10Stevemunene: "I think we might need to also remove" [dns] - 10https://gerrit.wikimedia.org/r/1182976 (https://phabricator.wikimedia.org/T395772) (owner: 10Ryan Kemper) [11:48:28] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T402925)', diff saved to https://phabricator.wikimedia.org/P82107 and previous config saved to /var/cache/conftool/dbconfig/20250829-114827-ladsgroup.json [11:48:34] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [11:48:43] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2227.codfw.wmnet with reason: Maintenance [11:48:51] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2227 (T402925)', diff saved to https://phabricator.wikimedia.org/P82108 and previous config saved to /var/cache/conftool/dbconfig/20250829-114850-ladsgroup.json [11:49:20] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6793/co" [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [11:49:37] !log pfischer@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:49:58] !log pfischer@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:52:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [11:55:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host failoid2003.codfw.wmnet with OS trixie [11:58:24] 06SRE, 06Traffic, 10API Platform (RESTBase Deprecation Roadmap): Block non-browser requests that use generic user agent (UA) headers - https://phabricator.wikimedia.org/T319423#11131560 (10hashar) In deed, as part of T400119 that was done by https://gerrit.wikimedia.org/r/c/operations/puppet/+/1181639 . The... [11:58:40] 06SRE, 06Traffic, 10API Platform (RESTBase Deprecation Roadmap): Block non-browser requests that use generic user agent (UA) headers - https://phabricator.wikimedia.org/T319423#11131565 (10hashar) →14Duplicate dup:03T400119 [11:58:53] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11131562 (10hashar) [12:01:11] (03CR) 10Hnowlan: [C:03+1] rest-gateway: Add rest-gateway-ro domain matchers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183086 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert) [12:06:23] (03PS14) 10Slyngshede: P:puppetserver::volatile generate datacenter database [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) [12:08:06] (03CR) 10Muehlenhoff: [C:03+2] Add repository sync definition for node 22 [puppet] - 10https://gerrit.wikimedia.org/r/1182835 (https://phabricator.wikimedia.org/T393437) (owner: 10Muehlenhoff) [12:09:01] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Move install servers to Bookworm - https://phabricator.wikimedia.org/T396487#11131585 (10MoritzMuehlenhoff) [12:12:48] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on failoid2003.codfw.wmnet with reason: host reimage [12:18:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on failoid2003.codfw.wmnet with reason: host reimage [12:22:18] !log pfischer@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:22:37] (03CR) 10Elukey: [C:03+1] Blacklist jffs2 [puppet] - 10https://gerrit.wikimedia.org/r/1183072 (owner: 10Muehlenhoff) [12:22:39] !log pfischer@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:25:32] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephosd1004-10015 - https://phabricator.wikimedia.org/T402881#11131654 (10Jclark-ctr) a:03Jclark-ctr [12:28:09] !log pfischer@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:28:25] !log pfischer@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:28:36] (03PS2) 10Muehlenhoff: Provide Node 22 image, using thirdparty/node22 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1182837 (https://phabricator.wikimedia.org/T393437) (owner: 10Jforrester) [12:29:14] (03CR) 10Ladsgroup: [C:03+1] preseed.yaml: Remove es2049 [puppet] - 10https://gerrit.wikimedia.org/r/1182592 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [12:29:52] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephosd1004-10015 - https://phabricator.wikimedia.org/T402881#11131690 (10Jclark-ctr) [12:30:45] (03PS1) 10Ayounsi: Nokia OSPF: different proposal [homer/public] - 10https://gerrit.wikimedia.org/r/1183099 [12:31:09] (03PS2) 10Ayounsi: Nokia OSPF: different proposal [homer/public] - 10https://gerrit.wikimedia.org/r/1183099 [12:32:32] (03CR) 10CI reject: [V:04-1] Nokia OSPF: different proposal [homer/public] - 10https://gerrit.wikimedia.org/r/1183099 (owner: 10Ayounsi) [12:33:37] (03PS3) 10Ayounsi: Nokia OSPF: different proposal [homer/public] - 10https://gerrit.wikimedia.org/r/1183099 [12:33:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host failoid2003.codfw.wmnet with OS trixie [12:33:43] !log pfischer@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:33:52] !log pfischer@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:34:32] (03CR) 10Ayounsi: Nokia: module for OSPF configuration (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1181132 (owner: 10Cathal Mooney) [12:35:59] !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad [12:36:53] (03CR) 10Vgutierrez: [C:03+1] P:puppetserver::volatile generate datacenter database [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [12:38:15] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 2353 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [12:39:15] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 114904 bytes in 1.092 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [12:40:49] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [12:41:35] (03PS1) 10Muehlenhoff: Assign failoid role to failoid2003 [puppet] - 10https://gerrit.wikimedia.org/r/1183100 (https://phabricator.wikimedia.org/T402406) [12:41:37] (03PS1) 10Muehlenhoff: Failover failoid in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1183101 (https://phabricator.wikimedia.org/T402406) [12:41:44] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Security Update [12:48:54] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephosd1004-10015 - https://phabricator.wikimedia.org/T402881#11131743 (10Jclark-ctr) [12:49:11] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephosd1004-10015 - https://phabricator.wikimedia.org/T402881#11131744 (10Jclark-ctr) 05Open→03Resolved [12:49:35] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:55:48] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11131751 (10elukey) @TheDJ thanks for the ping! I repooled eqiad after the test, I see some differences before and after, but I am not 100% sure why. [12:55:49] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [12:57:35] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11131759 (10elukey) It seems that Virtualization cannot be disabled in the BIOS processor settings, see the Web UI: {F65930889} This seems to follow something... [12:57:46] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T402925)', diff saved to https://phabricator.wikimedia.org/P82109 and previous config saved to /var/cache/conftool/dbconfig/20250829-125745-ladsgroup.json [12:57:52] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [12:59:25] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11131766 (10Jclark-ctr) @Jhancock.wm These servers are still failing to image. @elukey @MoritzMuehlenhoff , could the older maps* entries in preseed.yaml be causing the issue? 'ma... [13:05:31] (03CR) 10Federico Ceratto: [C:03+2] preseed.yaml: Remove es2049 [puppet] - 10https://gerrit.wikimedia.org/r/1182592 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [13:05:40] (03CR) 10Federico Ceratto: [C:03+2] es2049.yaml, site.pp: Prepare es2049 to replace es2026 [puppet] - 10https://gerrit.wikimedia.org/r/1182593 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [13:06:53] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [13:10:22] (03PS2) 10Reedy: InitialiseSettings-labs.php: Don't enable hCaptcha [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183104 [13:12:28] !log elukey@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw [13:12:53] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P82110 and previous config saved to /var/cache/conftool/dbconfig/20250829-131253-ladsgroup.json [13:13:04] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11131802 (10elukey) I have depooled codfw again, until we'll solve the issue that Yiannis highlighted. [13:13:34] (03CR) 10Reedy: [C:03+2] InitialiseSettings-labs.php: Don't enable hCaptcha [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183104 (owner: 10Reedy) [13:14:37] (03Merged) 10jenkins-bot: InitialiseSettings-labs.php: Don't enable hCaptcha [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183104 (owner: 10Reedy) [13:15:16] (03CR) 10Elukey: WIP - sre.hosts.provision: fix PXE settings for Dell iDRAC 10 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [13:16:41] (03PS1) 10Vgutierrez: hiera: Enable JA3N on cp7009 [puppet] - 10https://gerrit.wikimedia.org/r/1183105 (https://phabricator.wikimedia.org/T400270) [13:17:06] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [13:17:30] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1183105 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez) [13:18:00] (03PS1) 10Ayounsi: Nokia: /routing-policy [homer/public] - 10https://gerrit.wikimedia.org/r/1183108 [13:18:54] (03CR) 10Ayounsi: [C:03+2] Nokia: Add initial Python files for nokia switch system config (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1180562 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [13:18:58] (03PS1) 10Kosta Harlan: hCaptcha: Disable hCaptcha for API contexts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183109 (https://phabricator.wikimedia.org/T403263) [13:19:23] (03CR) 10CI reject: [V:04-1] Nokia: /routing-policy [homer/public] - 10https://gerrit.wikimedia.org/r/1183108 (owner: 10Ayounsi) [13:20:33] (03PS2) 10Vgutierrez: hiera: Enable JA3N on cp7009 [puppet] - 10https://gerrit.wikimedia.org/r/1183105 (https://phabricator.wikimedia.org/T400270) [13:20:34] (03PS2) 10Ayounsi: Nokia: /routing-policy [homer/public] - 10https://gerrit.wikimedia.org/r/1183108 [13:20:39] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1183105 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez) [13:21:27] (03CR) 10CDanis: [C:03+1] hiera: Enable JA3N on cp7009 [puppet] - 10https://gerrit.wikimedia.org/r/1183105 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez) [13:22:50] (03CR) 10Fabfur: [C:03+1] hiera: Enable JA3N on cp7009 [puppet] - 10https://gerrit.wikimedia.org/r/1183105 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez) [13:23:15] (03CR) 10Vgutierrez: [C:03+2] hiera: Enable JA3N on cp7009 [puppet] - 10https://gerrit.wikimedia.org/r/1183105 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez) [13:26:14] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11131825 (10MoritzMuehlenhoff) Can you please retry? I fixed a problem with the codfw install server setup earlier the (Euro) day, so it might simply work now. [13:28:01] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P82112 and previous config saved to /var/cache/conftool/dbconfig/20250829-132800-ladsgroup.json [13:28:32] (03PS2) 10Fabfur: profile:cache: remove varnishkafka (webrequest) from cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1183081 (https://phabricator.wikimedia.org/T393772) [13:30:03] (03CR) 10CI reject: [V:04-1] profile:cache: remove varnishkafka (webrequest) from cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1183081 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [13:30:15] (03PS3) 10Fabfur: profile:cache: remove varnishkafka (webrequest) from cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1183081 (https://phabricator.wikimedia.org/T393772) [13:30:52] (03CR) 10Fabfur: profile:cache: remove varnishkafka (webrequest) from cp hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183081 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [13:30:55] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1183081 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [13:32:00] (03CR) 10CI reject: [V:04-1] profile:cache: remove varnishkafka (webrequest) from cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1183081 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [13:32:53] (03PS1) 10Kosta Harlan: hCaptcha: Provide label/help in authmanagerinfo API calls [extensions/ConfirmEdit] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183112 (https://phabricator.wikimedia.org/T403253) [13:33:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 01 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183112 (https://phabricator.wikimedia.org/T403253) (owner: 10Kosta Harlan) [13:33:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:34:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [13:36:48] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [13:37:28] (03CR) 10Vgutierrez: profile:cache: remove varnishkafka (webrequest) from cp hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183081 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [13:38:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:39:40] (03PS4) 10Fabfur: profile:cache: remove varnishkafka (webrequest) from cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1183081 (https://phabricator.wikimedia.org/T393772) [13:39:51] (03CR) 10Fabfur: profile:cache: remove varnishkafka (webrequest) from cp hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183081 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [13:39:52] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1183081 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [13:43:09] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T402925)', diff saved to https://phabricator.wikimedia.org/P82113 and previous config saved to /var/cache/conftool/dbconfig/20250829-134308-ladsgroup.json [13:43:14] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [13:43:24] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2239.codfw.wmnet with reason: Maintenance [13:43:58] !log restarting blazegraph wdqs on wdqs1022 (stuck) [13:43:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:35] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:44:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [13:46:13] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11131899 (10elukey) I found the following in the codfw logs: ` {"name":"kartotherian","hostname":"kartotherian-main-7b9966894b-xq7jh","pid":1,"level":"WARN","levelPath":"warn","reque... [13:47:00] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [13:49:17] RESOLVED: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:54:16] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11131901 (10TheDJ) Autopositioning, is where it takes coordinates from the OSM object and focuses around the object, instead of any coordinates given directly. At least if im not mis... [13:56:36] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11131904 (10CKoerner_WMF) @joe Could I ask for a two week exemption for diff.wikimedia.org until we have our next sprint with our devs? Right now folks can... [14:07:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:08:29] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11131947 (10elukey) Another very relevant log: ` {"name":"kartotherian","hostname":"kartotherian-main-7b9966894b-xq7jh","pid":1,"level":"WARN","levelPath":"warn","msg":"No results fr... [14:11:57] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11131955 (10MoritzMuehlenhoff) >>! In T400637#11131765, @Jclark-ctr wrote: > @Jhancock.wm These servers are still failing to image. @elukey @MoritzMuehlenhoff , could the older maps*... [14:13:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:17:44] RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:17:53] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402835#11131984 (10RobH) 05Open→03Resolved a:03RobH This was due to power work in the site and expected, resolving. [14:17:57] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11131988 (10RobH) 05Open→03Resolved a:03RobH This was due to power work in the site and expected, resolving. [14:18:02] 10ops-magru: Unresponsive management for cp7005.mgmt:22 - https://phabricator.wikimedia.org/T402128#11131992 (10RobH) 05Open→03Resolved a:03RobH This was due to power work in the site and expected, resolving. [14:18:09] 10ops-magru: Unresponsive management for lvs7001.mgmt:22 - https://phabricator.wikimedia.org/T402127#11131996 (10RobH) 05Open→03Resolved a:03RobH [14:18:11] 10ops-magru: Unresponsive management for cp7008.mgmt:22 - https://phabricator.wikimedia.org/T402126#11132001 (10RobH) 05Open→03Resolved a:03RobH [14:18:16] 10ops-magru: Unresponsive management for cp7009.mgmt:22 - https://phabricator.wikimedia.org/T402125#11132003 (10RobH) 05Open→03Resolved a:03RobH [14:18:19] 10ops-magru: Unresponsive management for lvs7003.mgmt:22 - https://phabricator.wikimedia.org/T402124#11132005 (10RobH) 05Open→03Resolved a:03RobH [14:18:24] 10ops-magru: Unresponsive management for lvs7002.mgmt:22 - https://phabricator.wikimedia.org/T402123#11132007 (10RobH) 05Open→03Resolved a:03RobH [14:18:28] 10ops-magru: Unresponsive management for ganeti7003.mgmt:22 - https://phabricator.wikimedia.org/T402122#11132009 (10RobH) 05Open→03Resolved a:03RobH [14:18:34] 10ops-magru: Unresponsive management for dns7002.mgmt:22 - https://phabricator.wikimedia.org/T402121#11132011 (10RobH) 05Open→03Resolved a:03RobH [14:18:41] 10ops-magru: Unresponsive management for cp7002.mgmt:22 - https://phabricator.wikimedia.org/T402120#11132013 (10RobH) 05Open→03Resolved a:03RobH [14:18:47] 10ops-magru: Unresponsive management for ganeti7001.mgmt:22 - https://phabricator.wikimedia.org/T402119#11132016 (10RobH) 05Open→03Resolved a:03RobH [14:18:52] 10ops-magru: Unresponsive management for cp7004.mgmt:22 - https://phabricator.wikimedia.org/T402118#11132018 (10RobH) 05Open→03Resolved a:03RobH [14:18:56] 10ops-magru: Unresponsive management for cp7013.mgmt:22 - https://phabricator.wikimedia.org/T402117#11132020 (10RobH) 05Open→03Resolved a:03RobH [14:19:01] 10ops-magru: Unresponsive management for cp7011.mgmt:22 - https://phabricator.wikimedia.org/T402116#11132022 (10RobH) 05Open→03Resolved a:03RobH [14:19:07] 10ops-magru: Unresponsive management for ganeti7004.mgmt:22 - https://phabricator.wikimedia.org/T402115#11132024 (10RobH) 05Open→03Resolved a:03RobH [14:19:11] 10ops-magru: Unresponsive management for cp7007.mgmt:22 - https://phabricator.wikimedia.org/T402114#11132026 (10RobH) 05Open→03Resolved a:03RobH [14:19:17] 10ops-magru: Unresponsive management for cp7001.mgmt:22 - https://phabricator.wikimedia.org/T402113#11132038 (10RobH) 05Open→03Resolved a:03RobH [14:19:19] 10ops-magru: Unresponsive management for cp7012.mgmt:22 - https://phabricator.wikimedia.org/T402112#11132040 (10RobH) 05Open→03Resolved a:03RobH [14:19:26] (03PS1) 10Vgutierrez: haproxy: Temporary UA policy exemption for Automattic [puppet] - 10https://gerrit.wikimedia.org/r/1183118 (https://phabricator.wikimedia.org/T400119) [14:20:04] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1183118 (https://phabricator.wikimedia.org/T400119) (owner: 10Vgutierrez) [14:20:08] 10ops-magru: Power Supply - PS Redundancy - issue on ganeti7001:9290 - https://phabricator.wikimedia.org/T399525#11132045 (10RobH) 05Open→03Resolved a:03RobH This was due to power work in the site and expected, resolving. [14:20:10] 10ops-magru: Power Supply - Status - issue on dns7002:9290 - https://phabricator.wikimedia.org/T399549#11132049 (10RobH) 05Open→03Resolved a:03RobH This was due to power work in the site and expected, resolving. [14:20:14] 10ops-magru: Power Supply - Status - issue on cp7010:9290 - https://phabricator.wikimedia.org/T402096#11132053 (10RobH) 05Open→03Resolved a:03RobH This was due to power work in the site and expected, resolving. [14:20:15] 10ops-magru: Unresponsive management for dns7001.mgmt:22 - https://phabricator.wikimedia.org/T402105#11132057 (10RobH) 05Open→03Resolved a:03RobH This was due to power work in the site and expected, resolving. [14:20:20] 10ops-magru: Unresponsive management for cp7015.mgmt:22 - https://phabricator.wikimedia.org/T402107#11132065 (10RobH) 05Open→03Resolved a:03RobH This was due to power work in the site and expected, resolving. [14:20:26] 10ops-magru: Unresponsive management for cp7014.mgmt:22 - https://phabricator.wikimedia.org/T402108#11132069 (10RobH) 05Open→03Resolved a:03RobH This was due to power work in the site and expected, resolving. [14:20:30] 10ops-magru: Unresponsive management for cp7003.mgmt:22 - https://phabricator.wikimedia.org/T402109#11132074 (10RobH) 05Open→03Resolved a:03RobH This was due to power work in the site and expected, resolving. [14:20:32] 10ops-magru: Unresponsive management for ganeti7002.mgmt:22 - https://phabricator.wikimedia.org/T402111#11132078 (10RobH) 05Open→03Resolved a:03RobH This was due to power work in the site and expected, resolving. [14:20:35] 10ops-magru: Unresponsive management for cp7010.mgmt:22 - https://phabricator.wikimedia.org/T402110#11132081 (10RobH) 05Open→03Resolved a:03RobH [14:20:55] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11132084 (10Jhancock.wm) @Clement_Goubert i think their might be a mismatch regarding the site.pp on this one. I am not sure exactly what it is. we had a similar issue in T400195. Could... [14:21:05] (03CR) 10CDanis: [C:03+1] haproxy: Temporary UA policy exemption for Automattic [puppet] - 10https://gerrit.wikimedia.org/r/1183118 (https://phabricator.wikimedia.org/T400119) (owner: 10Vgutierrez) [14:21:58] (03PS3) 10BryanDavis: wmcs: Update URL in comment in maintain_dbusers.py [puppet] - 10https://gerrit.wikimedia.org/r/1172434 (https://phabricator.wikimedia.org/T400421) [14:22:21] (03CR) 10CDanis: [C:03+1] haproxy: Temporary UA policy exemption for Automattic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183118 (https://phabricator.wikimedia.org/T400119) (owner: 10Vgutierrez) [14:22:38] (03CR) 10CDanis: haproxy: Temporary UA policy exemption for Automattic [puppet] - 10https://gerrit.wikimedia.org/r/1183118 (https://phabricator.wikimedia.org/T400119) (owner: 10Vgutierrez) [14:22:45] (03CR) 10Ssingh: haproxy: Temporary UA policy exemption for Automattic (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1183118 (https://phabricator.wikimedia.org/T400119) (owner: 10Vgutierrez) [14:23:03] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host maps2011.codfw.wmnet with OS bookworm [14:23:15] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11132086 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host maps2011.codfw.wmnet with OS bookworm [14:25:37] (03CR) 10Majavah: [C:03+2] wmcs: Update URL in comment in maintain_dbusers.py [puppet] - 10https://gerrit.wikimedia.org/r/1172434 (https://phabricator.wikimedia.org/T400421) (owner: 10BryanDavis) [14:29:55] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11132111 (10ssingh) >>! In T392851#11131759, @elukey wrote: > It seems that Virtualization cannot be disabled in the BIOS processor settings, see the Web UI: >... [14:30:31] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host maps2012.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:32:26] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11132121 (10elukey) @wiki_willy hi! We stumbled upon an issue in the Dell BIOS configs, namely it doesn't seem possible in this configuration to disable the Pro... [14:36:04] (03PS2) 10Vgutierrez: haproxy: Temporary UA policy exemption for Automattic [puppet] - 10https://gerrit.wikimedia.org/r/1183118 (https://phabricator.wikimedia.org/T400119) [14:36:14] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host maps2011.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:36:18] (03CR) 10Vgutierrez: haproxy: Temporary UA policy exemption for Automattic (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1183118 (https://phabricator.wikimedia.org/T400119) (owner: 10Vgutierrez) [14:38:24] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11132140 (10Jclark-ctr) Mine would not even reach Debian installer [14:39:07] (03CR) 10Ssingh: [C:03+1] "Thanks for the quick fix!" [puppet] - 10https://gerrit.wikimedia.org/r/1183118 (https://phabricator.wikimedia.org/T400119) (owner: 10Vgutierrez) [14:39:36] (03PS1) 10Clément Goubert: site.pp: Fix insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/1183124 (https://phabricator.wikimedia.org/T400485) [14:40:35] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host maps2012.codfw.wmnet with OS bookworm [14:40:59] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11132161 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host maps2012.codfw.wmnet with OS bookworm [14:41:21] (03CR) 10Clément Goubert: [C:03+2] site.pp: Fix insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/1183124 (https://phabricator.wikimedia.org/T400485) (owner: 10Clément Goubert) [14:41:22] (03CR) 10Scott French: [C:03+1] site.pp: Fix insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/1183124 (https://phabricator.wikimedia.org/T400485) (owner: 10Clément Goubert) [14:42:05] (03CR) 10Vgutierrez: [C:03+2] haproxy: Temporary UA policy exemption for Automattic [puppet] - 10https://gerrit.wikimedia.org/r/1183118 (https://phabricator.wikimedia.org/T400119) (owner: 10Vgutierrez) [14:43:08] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11132177 (10Clement_Goubert) >>! In T400485#11132084, @Jhancock.wm wrote: > @Clement_Goubert i think their might be a mismatch regarding the site.pp on this one. I... [14:46:37] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host maps2012.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:46:43] (03CR) 10Elukey: opensearch-operator: Add chart for review (2/3) (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [14:46:51] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11132219 (10Jclark-ctr) 2012 looks to be working now [14:48:25] 06SRE, 10DNS, 06Traffic, 10wikimediafoundation.org, 07IPv6: wikimediafoundation.org does not support IPv6 - https://phabricator.wikimedia.org/T403269#11132222 (10ssingh) `wikimediafoundation.org`'s `A` record is set in our zone files, while `techblog` is a `CNAME` to `techblog-wikimedia-org.go-vip.net.`... [14:50:47] (03PS1) 10D3r1ck01: session: Enable MultiBackendSessionStore on `group0` wikis only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183132 (https://phabricator.wikimedia.org/T402808) [14:52:48] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host maps2011.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:55:26] jhancock@cumin1003 reimage (PID 89112) is awaiting input [14:56:11] (03PS1) 10Bking: refinery: parameterize systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/1183136 (https://phabricator.wikimedia.org/T401116) [14:56:27] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1183136 (https://phabricator.wikimedia.org/T401116) (owner: 10Bking) [14:58:19] (03CR) 10Scott French: [C:03+1] rest-gateway: Introduce rest-gateway-ro [puppet] - 10https://gerrit.wikimedia.org/r/1182852 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert) [14:58:22] (03CR) 10Scott French: [C:03+1] "Yeah, this might be a good time to add a subsection on unusual changes like this to [0]. More generally, these docs could use some helpful" [dns] - 10https://gerrit.wikimedia.org/r/1182853 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert) [14:58:24] (03CR) 10Scott French: [C:03+1] rest-gateway: Switch rest-gateway to A/P [puppet] - 10https://gerrit.wikimedia.org/r/1183084 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert) [14:58:26] (03CR) 10Scott French: [C:03+1] wmnet: Switch rest-gateway to metafo [dns] - 10https://gerrit.wikimedia.org/r/1183085 (https://phabricator.wikimedia.org/T402412) (owner: 10Clément Goubert) [14:58:51] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273 (10phaultfinder) 03NEW [14:59:22] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps2012.codfw.wmnet with reason: host reimage [14:59:39] (03CR) 10Bking: opensearch-operator: Add chart for review (2/3) (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [15:00:34] (03PS2) 10Tiziano Fogli: monitoring services: add migration task T315866 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1183130 (https://phabricator.wikimedia.org/T395443) [15:00:47] (03PS2) 10Tiziano Fogli: monitoring services: add migration task T370157 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1183129 (https://phabricator.wikimedia.org/T395443) [15:01:04] (03PS2) 10Tiziano Fogli: pdb_resource_exporter: add check_prometheus tasks query [puppet] - 10https://gerrit.wikimedia.org/r/1183131 (https://phabricator.wikimedia.org/T395442) [15:01:12] (03PS1) 10Tiziano Fogli: icinga/audit: fix Check_prometheus query [software] - 10https://gerrit.wikimedia.org/r/1183133 (https://phabricator.wikimedia.org/T395443) [15:01:14] (03CR) 10Tiziano Fogli: [C:03+2] icinga/audit: fix Check_prometheus query [software] - 10https://gerrit.wikimedia.org/r/1183133 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [15:01:25] (03CR) 10Bking: opensearch-operator: Add chart for review (2/3) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [15:01:34] (03PS1) 10Tiziano Fogli: check_prometheus: add migration task param [puppet] - 10https://gerrit.wikimedia.org/r/1183126 (https://phabricator.wikimedia.org/T395443) [15:01:38] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1183136 (https://phabricator.wikimedia.org/T401116) (owner: 10Bking) [15:01:46] (03PS1) 10Tiziano Fogli: monitoring services: add migration task T309012 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1183128 (https://phabricator.wikimedia.org/T395443) [15:01:58] (03PS1) 10Tiziano Fogli: monitoring services: add migration task T370153 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1183127 (https://phabricator.wikimedia.org/T395443) [15:03:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps2012.codfw.wmnet with reason: host reimage [15:03:57] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275 (10phaultfinder) 03NEW [15:04:35] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:04:35] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:06:40] (03CR) 10Xcollazo: [C:03+1] refinery: parameterize systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/1183136 (https://phabricator.wikimedia.org/T401116) (owner: 10Bking) [15:08:15] (03CR) 10Bking: [C:03+2] refinery: parameterize systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/1183136 (https://phabricator.wikimedia.org/T401116) (owner: 10Bking) [15:08:19] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11132283 (10Vgutierrez) >>! In T400119#11131904, @CKoerner_WMF wrote: > @joe Could I ask for a two week exemption for diff.wikimedia.org until we have our... [15:09:44] (03CR) 10Elukey: "I like it! One nit and we should be good to go!" [puppet] - 10https://gerrit.wikimedia.org/r/1182886 (https://phabricator.wikimedia.org/T349521) (owner: 10Herron) [15:11:19] (03CR) 10Elukey: [C:03+1] "Adding also @mvolz@wikimedia.org for awareness. We are going to test a new Pyrra config for Citoid, and this will clear past data from the" [puppet] - 10https://gerrit.wikimedia.org/r/1182898 (https://phabricator.wikimedia.org/T400073) (owner: 10Herron) [15:12:01] (03CR) 10Samuel (WMF): [C:03+1] hCaptcha: Disable hCaptcha for API contexts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183109 (https://phabricator.wikimedia.org/T403263) (owner: 10Kosta Harlan) [15:17:36] (03PS3) 10Jforrester: Provide Node 22 image, using thirdparty/node22, based on Trixie [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1182837 (https://phabricator.wikimedia.org/T393437) [15:18:20] (03CR) 10Jforrester: "PS3: Removed the `ln -s /bin/node /bin/nodejs` step (already provided upstream); explicitly list the versions created as of current built " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1182837 (https://phabricator.wikimedia.org/T393437) (owner: 10Jforrester) [15:18:31] (03CR) 10Jforrester: [C:03+1] "Built successfully locally. This is good to go from my end." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1182837 (https://phabricator.wikimedia.org/T393437) (owner: 10Jforrester) [15:19:37] (03PS8) 10Herron: profile::pyrra::filesystem::slo: add new slo define [puppet] - 10https://gerrit.wikimedia.org/r/1182886 (https://phabricator.wikimedia.org/T349521) [15:19:59] (03CR) 10Herron: profile::pyrra::filesystem::slo: add new slo define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182886 (https://phabricator.wikimedia.org/T349521) (owner: 10Herron) [15:21:14] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:21:16] (03CR) 10Herron: [C:03+1] monitoring: disable icinga disk space check [puppet] - 10https://gerrit.wikimedia.org/r/1180642 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [15:21:26] (03CR) 10Herron: [C:03+1] monitoring services: add migration task T315866 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1183130 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [15:21:33] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:21:34] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps2012.codfw.wmnet with OS bookworm [15:21:38] (03CR) 10Herron: [C:03+1] monitoring services: add migration task T370157 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1183129 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [15:21:43] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11132365 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host maps2012.codfw.wmnet with OS bookworm completed: - maps2012 (**PASS**) - Rem... [15:22:02] (03CR) 10Herron: [C:03+1] monitoring services: add migration task T309012 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1183128 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [15:22:15] (03CR) 10Herron: [C:03+1] monitoring services: add migration task T370153 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1183127 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [15:22:32] (03PS1) 10Muehlenhoff: Line-wrap Homer diffs [puppet] - 10https://gerrit.wikimedia.org/r/1183140 [15:22:50] (03PS2) 10Muehlenhoff: Line-wrap Homer diffs [puppet] - 10https://gerrit.wikimedia.org/r/1183140 [15:22:53] (03CR) 10Herron: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1183126 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [15:23:40] (03CR) 10Elukey: [C:03+1] profile::pyrra::filesystem::slo: add new slo define [puppet] - 10https://gerrit.wikimedia.org/r/1182886 (https://phabricator.wikimedia.org/T349521) (owner: 10Herron) [15:24:10] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host maps2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:24:48] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host maps2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:27:05] (03CR) 10Herron: [C:03+1] pdb_resource_exporter: add check_prometheus tasks query [puppet] - 10https://gerrit.wikimedia.org/r/1183131 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli) [15:29:35] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:30:34] (03CR) 10Muehlenhoff: [C:03+2] Provide Node 22 image, using thirdparty/node22, based on Trixie [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1182837 (https://phabricator.wikimedia.org/T393437) (owner: 10Jforrester) [15:30:35] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Provide Node 22 image, using thirdparty/node22, based on Trixie [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1182837 (https://phabricator.wikimedia.org/T393437) (owner: 10Jforrester) [15:34:21] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on Wikidata for Firefox (Browser extension) - https://phabricator.wikimedia.org/T398588#11132405 (10ssingh) Thanks for the feedback, @TheDJ. It's been a while since I wrote a Firefox extension so I may be mistaken if a referrer header can be set (that is,... [15:35:27] (03PS1) 10Jforrester: tables-catalogue: List wikifunctionsclient_usage [puppet] - 10https://gerrit.wikimedia.org/r/1183141 [15:35:44] (03CR) 10CI reject: [V:04-1] tables-catalogue: List wikifunctionsclient_usage [puppet] - 10https://gerrit.wikimedia.org/r/1183141 (owner: 10Jforrester) [15:35:45] (03Abandoned) 10Jforrester: tables-catalogue: List wikifunctionsclient_usage [puppet] - 10https://gerrit.wikimedia.org/r/1183141 (owner: 10Jforrester) [15:36:02] (03PS1) 10Jforrester: tables-catalog: Add table for (deprecated) ShortUrl extension [puppet] - 10https://gerrit.wikimedia.org/r/1183142 (https://phabricator.wikimedia.org/T399302) [15:38:52] (03CR) 10CI reject: [V:04-1] tables-catalog: Add table for (deprecated) ShortUrl extension [puppet] - 10https://gerrit.wikimedia.org/r/1183142 (https://phabricator.wikimedia.org/T399302) (owner: 10Jforrester) [15:40:15] (03PS1) 10Daimona Eaytoy: Configure high-risk countries for CampaignEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183144 (https://phabricator.wikimedia.org/T402353) [15:40:54] 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11132420 (10Jgreen) a:05Jgreen→03Jhancock.wm [15:41:27] (03PS2) 10Jforrester: tables-catalog: Add table for (deprecated) ShortUrl extension [puppet] - 10https://gerrit.wikimedia.org/r/1183142 (https://phabricator.wikimedia.org/T399302) [15:41:33] (03CR) 10CI reject: [V:04-1] Configure high-risk countries for CampaignEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183144 (https://phabricator.wikimedia.org/T402353) (owner: 10Daimona Eaytoy) [15:42:41] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host maps2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:43:24] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host maps2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:44:30] (03CR) 10CI reject: [V:04-1] tables-catalog: Add table for (deprecated) ShortUrl extension [puppet] - 10https://gerrit.wikimedia.org/r/1183142 (https://phabricator.wikimedia.org/T399302) (owner: 10Jforrester) [15:44:44] !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host maps2013.codfw.wmnet with OS bookworm [15:44:54] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11132455 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host maps2013.codfw.wmnet with OS bookworm [15:45:01] !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host maps2014.codfw.wmnet with OS bookworm [15:45:10] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11132457 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host maps2014.codfw.wmnet with OS bookworm [15:45:29] (03PS3) 10Jforrester: tables-catalog: Add table for (deprecated) ShortUrl extension [puppet] - 10https://gerrit.wikimedia.org/r/1183142 (https://phabricator.wikimedia.org/T399302) [15:47:17] (03PS2) 10Daimona Eaytoy: Configure high-risk countries for CampaignEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183144 (https://phabricator.wikimedia.org/T402353) [15:50:05] 06SRE, 06Traffic: Setting up Wikimedia Trust and Safety Help Center with Zendesk product: Seeking Guidance on host mapping - https://phabricator.wikimedia.org/T400952#11132472 (10ssingh) Hi @JAbrams: Sorry for the lack of a response. This and a bunch of other related tickets were not addressed and we are makin... [15:54:14] just for the record: I’m running a maintenance script (CentralAuth:FixRenameUserLocalLogs, for T398177) which turned out to be more inefficient than expected [15:54:15] T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177 [15:54:16] if it seems to be causing any performance issues, feel absolutely free to stop it (it’s the mw-script.eqiad.vl951qet k8s job) – it’s a dry-run so should be safe to kill in any way [15:54:36] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host maps2011.codfw.wmnet with OS bookworm [15:54:36] (it will probably not finish before I get back to my desk next week) [15:54:45] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11132480 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host maps2011.codfw.wmnet with OS bookworm executed with errors: - maps2011 (**FA... [15:57:13] (the script mainly reads the logging table, and maybe actor/comment/ other stuff that gets joined to it, so that would be the place to look for slow queries. but so far logstash looks fine to me) [16:01:00] 06SRE, 06cloud-services-team, 06DC-Ops: cloudcephosd10[48-52] service implementation - https://phabricator.wikimedia.org/T395910#11132490 (10Andrew) [16:02:55] !log jhancock@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps2013.codfw.wmnet with reason: host reimage [16:03:28] !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host maps2011.codfw.wmnet with OS bookworm [16:03:36] !log jhancock@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps2014.codfw.wmnet with reason: host reimage [16:03:38] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11132509 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host maps2011.codfw.wmnet with OS bookworm [16:05:33] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: cloudcephosd10[48-52] service implementation - https://phabricator.wikimedia.org/T395910#11132514 (10taavi) [16:05:35] (03CR) 10BCornwall: [V:03+2 C:03+2] "0 tests failed, 0 tests skipped, 40 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/1180969 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [16:05:57] (03CR) 10BCornwall: [V:03+2 C:03+1] varnish: Improve 08-mobile-hostnames-rewrite.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1180969 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [16:08:50] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps2013.codfw.wmnet with reason: host reimage [16:12:48] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps2014.codfw.wmnet with reason: host reimage [16:13:40] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:14:08] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#11132525 (10Krinkle) [16:16:14] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking), 07User-notice: RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#11132527 (10Krinkle) [16:16:26] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11132528 (10elukey) I tried to test all the db connections for maps-test hosts, and all the postgres replicas (so all the hosts except maps-test2001 that is the master node) have thei... [16:20:14] (03PS13) 10Krinkle: varnish: Improve 08-mobile-hostnames-rewrite.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1180969 [16:22:10] !log jhancock@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps2011.codfw.wmnet with reason: host reimage [16:22:57] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11132534 (10Rohitverma9625) Hello everyone, I'm a contributor to the Wikimedia Commons Android App. The Commons app recently started facing some API issues... [16:24:12] !log jhancock@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1002" [16:24:21] (03CR) 10Dreamy Jazz: [C:03+1] "Logic makes sense to me, though I've not tested it locally to be sure." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183109 (https://phabricator.wikimedia.org/T403263) (owner: 10Kosta Harlan) [16:26:28] !log jhancock@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1002" [16:26:29] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps2013.codfw.wmnet with OS bookworm [16:26:31] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11132540 (10CDanis) >>! In T400119#11132534, @Rohitverma9625 wrote: > Hello everyone, I'm a contributor to the Wikimedia Commons Android App. The Commons a... [16:26:39] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11132541 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host maps2013.codfw.wmnet with OS bookworm completed: - maps2013 (**PASS**) - R... [16:27:07] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps2011.codfw.wmnet with reason: host reimage [16:30:18] !log jhancock@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1002" [16:31:19] !log jhancock@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1002" [16:31:20] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps2014.codfw.wmnet with OS bookworm [16:31:30] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11132558 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host maps2014.codfw.wmnet with OS bookworm completed: - maps2014 (**PASS**) - R... [16:35:59] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11132563 (10Rohitverma9625) Hi @CDanis, thanks for the quick response. We are facing this issue for some URLs as other parts just fetch data correctly. One... [16:37:39] (03PS3) 10Bernard Wang: Remove deprecated search config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182875 (https://phabricator.wikimedia.org/T402208) [16:42:39] (03CR) 10Jdlrobson: Remove deprecated search config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182875 (https://phabricator.wikimedia.org/T402208) (owner: 10Bernard Wang) [16:44:47] !log jhancock@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1002" [16:45:23] !log jhancock@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1002" [16:45:24] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps2011.codfw.wmnet with OS bookworm [16:45:39] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11132640 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host maps2011.codfw.wmnet with OS bookworm completed: - maps2011 (**PASS**) - R... [16:45:49] (03CR) 10Cwhite: [C:03+2] monitoring: disable icinga disk space check [puppet] - 10https://gerrit.wikimedia.org/r/1180642 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [16:46:50] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11132642 (10Rohitverma9625) This is one more URL that fails with same error: <-- 403 https://upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Beautiful_C... [16:46:52] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11132643 (10Jhancock.wm) [16:47:57] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11132646 (10Jhancock.wm) 05Open→03Resolved @MoritzMuehlenhoff looks like we got everything cleared up. all yours! [16:49:26] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host deploy2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:49:29] (03CR) 10Ladsgroup: [C:04-1] "This table exists only in a subset of wikis: https://phabricator.wikimedia.org/P77827$8" [puppet] - 10https://gerrit.wikimedia.org/r/1183142 (https://phabricator.wikimedia.org/T399302) (owner: 10Jforrester) [16:49:40] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11132651 (10CKoerner_WMF) >>! In T400119#11132283, @Vgutierrez wrote: >>>! In T400119#11131904, @CKoerner_WMF wrote: >> @joe Could I ask for a two week exe... [16:50:22] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host deploy2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:51:53] !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host deploy2003.codfw.wmnet with OS bookworm [16:52:05] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11132654 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host deploy2003.codfw.wmnet with OS bookworm [16:52:27] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11132655 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1002 for host deploy2003.codfw.wmnet with OS bookworm executed with errors: - deploy2003 (**... [17:04:11] (03PS14) 10Krinkle: varnish: Improve 08-mobile-hostnames-rewrite.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1180969 [17:05:42] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11132702 (10CDanis) Thank you, that helps a lot. I notice that there's two different HTTP clients used inside the app -- and only one of them sets User-Ag... [17:06:05] (03PS1) 10Jdlrobson: Restore ext.visualEditor.track module [extensions/VisualEditor] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183154 (https://phabricator.wikimedia.org/T403127) [17:10:46] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11132723 (10Rohitverma9625) > I notice that there's two different HTTP clients used inside the app -- and only one of them sets User-Agent on outbound requ... [17:14:32] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you." [puppet] - 10https://gerrit.wikimedia.org/r/1183131 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli) [17:14:54] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11132747 (10RobH) I'll open a ticket with support about this next week and followup with both support and our account team. [17:17:50] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/1183126 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [17:18:32] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1183127 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [17:18:52] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1183128 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [17:19:05] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1183129 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [17:19:18] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1183130 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [17:22:26] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11132767 (10wiki_willy) Thanks @RobH. Our account team has changed quite a bit, but you can follow up with Hossam and Dawn after creating the support ticket >... [17:29:52] 06SRE, 06Traffic, 10Wikidata, 10Wikidata-Query-Service: Find a solution for SPARQL federation that is blocked by stricter user agent policy enforcement - https://phabricator.wikimedia.org/T402959#11132802 (10CDanis) Hi @Lydia_Pintscher , SRE can make some exception here. It seems warranted given the statu... [17:33:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:44:35] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:53:43] (03PS1) 10CDanis: Exempt query.wikidata.org from U-A policy [puppet] - 10https://gerrit.wikimedia.org/r/1183161 (https://phabricator.wikimedia.org/T400119) [17:55:14] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11132908 (10MoritzMuehlenhoff) Thanks! [18:04:36] (03CR) 10Scott French: [C:03+1] Exempt query.wikidata.org from U-A policy [puppet] - 10https://gerrit.wikimedia.org/r/1183161 (https://phabricator.wikimedia.org/T400119) (owner: 10CDanis) [18:07:40] (03PS2) 10CDanis: Exempt query.wikidata.org from U-A policy [puppet] - 10https://gerrit.wikimedia.org/r/1183161 (https://phabricator.wikimedia.org/T400119) [18:14:09] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:20:39] (03CR) 10Scott French: [C:03+1] Exempt query.wikidata.org from U-A policy [puppet] - 10https://gerrit.wikimedia.org/r/1183161 (https://phabricator.wikimedia.org/T400119) (owner: 10CDanis) [18:46:16] (03CR) 10Eevans: [C:03+2] fix cookbook names in example text [cookbooks] - 10https://gerrit.wikimedia.org/r/1165161 (owner: 10Eevans) [18:49:21] (03PS3) 10CDanis: Exempt query.wikidata.org from U-A policy [puppet] - 10https://gerrit.wikimedia.org/r/1183161 (https://phabricator.wikimedia.org/T400119) [18:49:32] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1183161 (https://phabricator.wikimedia.org/T400119) (owner: 10CDanis) [18:53:08] (03Merged) 10jenkins-bot: fix cookbook names in example text [cookbooks] - 10https://gerrit.wikimedia.org/r/1165161 (owner: 10Eevans) [18:57:39] 06SRE, 06Wikimedia Enterprise: Provide auth-less access to Enterprise APIs from WMF Analytics cluster - https://phabricator.wikimedia.org/T403298 (10Urbanecm) 03NEW [18:58:43] 06SRE, 06Wikimedia Enterprise: Provide auth-less access to Enterprise APIs from WMF Analytics cluster - https://phabricator.wikimedia.org/T403298#11133016 (10Urbanecm) #sre: Would you mind confirming the appropriate IPs or ranges that would need to be allowlisted to include analytics cluster? #wikimedia_enterp... [18:59:38] 06SRE, 06Wikimedia Enterprise: Provide auth-less access to Enterprise APIs from WMF Analytics cluster - https://phabricator.wikimedia.org/T403298#11133018 (10Urbanecm) This task was created based on a discussion with @prabhat. [19:03:55] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11133022 (10phaultfinder) [19:04:35] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:04:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:08:55] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11133028 (10phaultfinder) [19:29:35] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:47:46] (03CR) 10CDanis: [C:03+1] webperf: Move NavtimingStaleBeacon alert from per-dc to global [alerts] - 10https://gerrit.wikimedia.org/r/1105087 (owner: 10Krinkle) [19:48:16] (03PS4) 10CDanis: haproxy: Exempt query.wikidata.org from U-A policy [puppet] - 10https://gerrit.wikimedia.org/r/1183161 (https://phabricator.wikimedia.org/T400119) [19:48:21] (03CR) 10CDanis: [C:03+2] haproxy: Exempt query.wikidata.org from U-A policy [puppet] - 10https://gerrit.wikimedia.org/r/1183161 (https://phabricator.wikimedia.org/T400119) (owner: 10CDanis) [19:48:24] (03PS1) 10Cwhite: profile: remove disk space check [puppet] - 10https://gerrit.wikimedia.org/r/1183174 (https://phabricator.wikimedia.org/T332764) [19:49:00] (03CR) 10CDanis: [C:03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1183161 (https://phabricator.wikimedia.org/T400119) (owner: 10CDanis) [19:51:22] (03CR) 10Cwhite: [C:03+2] profile: remove disk space check [puppet] - 10https://gerrit.wikimedia.org/r/1183174 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [19:51:37] cwhite: okay to merge yours? [19:51:46] cdanis: yes please [19:51:58] 👍 [19:53:04] done [19:55:03] thanks! [20:10:43] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11133146 (10tfmorris) Note that returning a plain text error message to a JSON API request is going to make it very likely to get swallowed by a JSON parse... [20:14:35] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:28:04] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11133180 (10Tgr) >>! In T400119#11133146, @tfmorris wrote: > Note that returning a plain text error message to a JSON API request is going to make it very... [21:12:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [21:15:35] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host maps1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:17:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [21:19:50] (03CR) 10Dr0ptp4kt: "Question for @aotto@wikimedia.org and @kherron@wikimedia.org ." [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [21:23:38] vriley@cumin1003 provision (PID 128296) is awaiting input [21:24:13] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [21:26:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:30:25] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host maps1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:33:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:34:27] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host maps1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:37:32] vriley@cumin1003 provision (PID 130973) is awaiting input [21:39:40] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host maps1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:40:22] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host maps1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:42:13] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host maps1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:42:45] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [21:44:28] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:44:35] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:47:13] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [21:48:58] vriley@cumin1003 netbox (PID 131234) is awaiting input [21:50:10] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt maps1013 - vriley@cumin1003" [21:50:15] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt maps1013 - vriley@cumin1003" [21:50:15] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:53:06] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host maps1013 [21:54:11] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host maps1013 [21:55:19] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host maps1013.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:03:58] vriley@cumin1003 provision (PID 133656) is awaiting input [22:04:08] (03CR) 10Cwhite: [C:03+1] pdb_resource_exporter: add check_prometheus tasks query [puppet] - 10https://gerrit.wikimedia.org/r/1183131 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli) [22:05:10] (03CR) 10Cwhite: [C:03+1] check_prometheus: add migration task param [puppet] - 10https://gerrit.wikimedia.org/r/1183126 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [22:05:27] (03CR) 10Cwhite: [C:03+1] monitoring services: add migration task T370153 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1183127 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [22:05:43] (03CR) 10Cwhite: [C:03+1] monitoring services: add migration task T309012 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1183128 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [22:06:08] (03CR) 10Cwhite: [C:03+1] monitoring services: add migration task T370157 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1183129 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [22:06:31] (03CR) 10Cwhite: [C:03+1] monitoring services: add migration task T315866 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1183130 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [22:14:09] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:18:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:19:30] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11133390 (10VRiley-WMF) es1049 - rack A5, U06 es1050 - rack B6, U30 es1051 - rack D1, U11 es1052 - rack D3, U08 es1053 - rack D6, U09 es1054 - rack A1, U09 es1055 - rack A3, U10... [22:20:25] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11133391 (10VRiley-WMF) [22:23:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11133397 (10VRiley-WMF) [22:32:08] vriley@cumin1003 provision (PID 133656) is awaiting input [22:35:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:41:26] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [22:41:44] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [22:41:52] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1158 (T402925)', diff saved to https://phabricator.wikimedia.org/P82119 and previous config saved to /var/cache/conftool/dbconfig/20250829-224151-ladsgroup.json [22:41:57] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [22:42:40] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on es2049.codfw.wmnet with reason: Being provisioned [22:45:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:48:02] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T402925)', diff saved to https://phabricator.wikimedia.org/P82120 and previous config saved to /var/cache/conftool/dbconfig/20250829-224802-ladsgroup.json [22:48:13] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [23:03:10] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P82121 and previous config saved to /var/cache/conftool/dbconfig/20250829-230309-ladsgroup.json [23:04:35] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:04:35] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:08:55] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11133438 (10phaultfinder) [23:12:20] (03CR) 10Bartosz Dziewoński: [C:03+1] session: Enable MultiBackendSessionStore on `group0` wikis only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183132 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [23:13:51] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11133441 (10phaultfinder) [23:18:18] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P82122 and previous config saved to /var/cache/conftool/dbconfig/20250829-231817-ladsgroup.json [23:19:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:29:35] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [23:33:25] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T402925)', diff saved to https://phabricator.wikimedia.org/P82123 and previous config saved to /var/cache/conftool/dbconfig/20250829-233324-ladsgroup.json [23:33:31] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [23:33:41] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [23:33:48] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1170 (T402925)', diff saved to https://phabricator.wikimedia.org/P82124 and previous config saved to /var/cache/conftool/dbconfig/20250829-233348-ladsgroup.json [23:38:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1183204 [23:38:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1183204 (owner: 10TrainBranchBot) [23:51:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:53:26] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1183204 (owner: 10TrainBranchBot)