[00:08:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1192993 [00:08:19] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1192993 (owner: 10TrainBranchBot) [00:13:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [extensions/Translate] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192992 (https://phabricator.wikimedia.org/T402967) (owner: 10MusikAnimal) [00:14:40] (03Merged) 10jenkins-bot: Increase timeout for MessageIndex lock [extensions/Translate] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192992 (https://phabricator.wikimedia.org/T402967) (owner: 10MusikAnimal) [00:15:45] !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1192992|Increase timeout for MessageIndex lock (T402967)]] [00:15:52] T402967: Deploy CommunityRequests extension to prod - https://phabricator.wikimedia.org/T402967 [00:20:07] RECOVERY - Ubuntu mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/ubuntu is over 4 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [00:22:17] !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1192992|Increase timeout for MessageIndex lock (T402967)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:22:23] T402967: Deploy CommunityRequests extension to prod - https://phabricator.wikimedia.org/T402967 [00:22:49] !log musikanimal@deploy2002 musikanimal: Continuing with sync [00:23:25] RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [00:27:58] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1192993 (owner: 10TrainBranchBot) [00:28:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:29:15] !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192992|Increase timeout for MessageIndex lock (T402967)]] (duration: 13m 30s) [00:29:22] T402967: Deploy CommunityRequests extension to prod - https://phabricator.wikimedia.org/T402967 [00:36:26] (03PS1) 10MusikAnimal: WishStore: don't use virtual domain when querying for actor ID [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192999 (https://phabricator.wikimedia.org/T402967) [00:36:45] FIRING: Traffic bill over quota: Alert for device cr2-magru.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:43:13] 10ops-eqiad, 06SRE, 06DC-Ops: RMA Damaged Pdu E14 - https://phabricator.wikimedia.org/T395971#11235436 (10VRiley-WMF) Shipped back the PDU with packing slip{F66718095} [00:43:21] 10ops-eqiad, 06SRE, 06DC-Ops: RMA Damaged Pdu E14 - https://phabricator.wikimedia.org/T395971#11235437 (10VRiley-WMF) 05Open→03Resolved [00:44:21] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11235440 (10VRiley-WMF) All fibers for ssw1-d8-eqiad have been updated with CableIDs in Netbox. Also plugged them into the switch in D8 and will connect the other sides soon. Will also need to... [00:48:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192999 (https://phabricator.wikimedia.org/T402967) (owner: 10MusikAnimal) [00:50:15] (03Merged) 10jenkins-bot: WishStore: don't use virtual domain when querying for actor ID [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1192999 (https://phabricator.wikimedia.org/T402967) (owner: 10MusikAnimal) [00:50:53] !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1192999|WishStore: don't use virtual domain when querying for actor ID (T402967)]] [00:51:00] T402967: Deploy CommunityRequests extension to prod - https://phabricator.wikimedia.org/T402967 [00:53:13] (03CR) 10Scardenasmolinar: [C:03+1] Enable AutoModerator on Italian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192921 (https://phabricator.wikimedia.org/T405152) (owner: 10Kgraessle) [00:56:45] RESOLVED: Traffic bill over quota: Alert for device cr2-magru.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:57:20] !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1192999|WishStore: don't use virtual domain when querying for actor ID (T402967)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:57:27] T402967: Deploy CommunityRequests extension to prod - https://phabricator.wikimedia.org/T402967 [00:57:45] !log musikanimal@deploy2002 musikanimal: Continuing with sync [01:02:07] !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192999|WishStore: don't use virtual domain when querying for actor ID (T402967)]] (duration: 11m 14s) [01:02:40] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:09:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192326 (https://phabricator.wikimedia.org/T405999) (owner: 10EggRoll97) [01:13:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [01:16:18] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 38s) [01:34:34] (03PS1) 10MusikAnimal: FocusAreaStore: use virtual DB connection when counting wishes [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193002 (https://phabricator.wikimedia.org/T402967) [01:36:25] FIRING: [28x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:44:52] FIRING: [27x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:46:25] FIRING: [28x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:47:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193002 (https://phabricator.wikimedia.org/T402967) (owner: 10MusikAnimal) [01:49:22] (03Merged) 10jenkins-bot: FocusAreaStore: use virtual DB connection when counting wishes [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193002 (https://phabricator.wikimedia.org/T402967) (owner: 10MusikAnimal) [01:49:59] !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1193002|FocusAreaStore: use virtual DB connection when counting wishes (T402967)]] [01:50:05] T402967: Deploy CommunityRequests extension to prod - https://phabricator.wikimedia.org/T402967 [01:56:34] !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1193002|FocusAreaStore: use virtual DB connection when counting wishes (T402967)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:56:41] T402967: Deploy CommunityRequests extension to prod - https://phabricator.wikimedia.org/T402967 [01:57:26] !log musikanimal@deploy2002 musikanimal: Continuing with sync [02:01:55] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2035:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2035 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:02:24] !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193002|FocusAreaStore: use virtual DB connection when counting wishes (T402967)]] (duration: 12m 25s) [02:02:30] T402967: Deploy CommunityRequests extension to prod - https://phabricator.wikimedia.org/T402967 [02:04:45] (03PS1) 10MusikAnimal: Enable debug logging for CommunityRequests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193003 (https://phabricator.wikimedia.org/T402967) [02:06:41] (03PS2) 10MusikAnimal: Enable debug logging for CommunityRequests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193003 (https://phabricator.wikimedia.org/T402967) [02:08:26] (03CR) 10Samwilson: [C:03+1] Enable debug logging for CommunityRequests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193003 (https://phabricator.wikimedia.org/T402967) (owner: 10MusikAnimal) [02:12:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193003 (https://phabricator.wikimedia.org/T402967) (owner: 10MusikAnimal) [02:13:16] (03Merged) 10jenkins-bot: Enable debug logging for CommunityRequests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193003 (https://phabricator.wikimedia.org/T402967) (owner: 10MusikAnimal) [02:13:50] !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1193003|Enable debug logging for CommunityRequests (T402967)]] [02:13:57] T402967: Deploy CommunityRequests extension to prod - https://phabricator.wikimedia.org/T402967 [02:14:52] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [02:19:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:20:47] !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1193003|Enable debug logging for CommunityRequests (T402967)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [02:20:54] T402967: Deploy CommunityRequests extension to prod - https://phabricator.wikimedia.org/T402967 [02:22:46] !log musikanimal@deploy2002 musikanimal: Continuing with sync [02:24:52] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:27:37] !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193003|Enable debug logging for CommunityRequests (T402967)]] (duration: 13m 47s) [02:27:45] T402967: Deploy CommunityRequests extension to prod - https://phabricator.wikimedia.org/T402967 [02:55:38] !log musikanimal@deploy2002 mwscript-k8s job started: extensions/CommunityRequests/maintenance/migrateFromGadget.php --wiki=metawiki --status-csv=wishes-status-migration.csv --wishes # T402967 [02:55:45] T402967: Deploy CommunityRequests extension to prod - https://phabricator.wikimedia.org/T402967 [02:58:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:59:03] FIRING: PuppetFailure: Puppet has failed on ml-serve1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:01:59] (03CR) 10Clare Ming: EventStreamConfig and stream registration for watchlist click tracking (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192861 (https://phabricator.wikimedia.org/T401575) (owner: 10Samtar) [03:02:13] (03CR) 10Clare Ming: [C:03+1] EventStreamConfig and stream registration for watchlist click tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192861 (https://phabricator.wikimedia.org/T401575) (owner: 10Samtar) [03:06:41] (03PS2) 10Herron: vopsbot: switch rotation for 247 oncall [puppet] - 10https://gerrit.wikimedia.org/r/1192957 [03:08:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:09:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [03:13:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:39:52] FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:43:51] !log musikanimal@deploy2002 mwscript-k8s job started: extensions/CommunityRequests/maintenance/migrateFromGadget.php --wiki=metawiki --status-csv=wishes-status-migration.csv --wishes # T402967 [03:43:58] T402967: Deploy CommunityRequests extension to prod - https://phabricator.wikimedia.org/T402967 [03:46:10] 06SRE, 06Traffic-Icebox, 10MobileFrontend (Tracking): QA features on the new mobile URLs - https://phabricator.wikimedia.org/T403638#11235597 (10Koavf) This evidently causes some issues for editors trying to get around Mainland China censorship: https://en.wiktionary.org/wiki/Wiktionary:Feedback#What_happene... [04:08:49] PROBLEM - Backup freshness on backup1014 is CRITICAL: All failures: 1 (gitlab1004), Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:23:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:26:06] (03PS1) 10Awight: Nasty fix for main ref change in main+details [extensions/Cite] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193009 (https://phabricator.wikimedia.org/T406002) [04:36:25] FIRING: [28x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:41:25] FIRING: [28x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:58:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:08:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:09:10] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:17:33] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:18:23] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30039 bytes in 0.212 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:25:55] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [05:27:55] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [05:35:55] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [05:35:55] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [05:39:11] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:44:52] FIRING: [27x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:55:39] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] Nasty fix for main ref change in main+details [extensions/Cite] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193009 (https://phabricator.wikimedia.org/T406002) (owner: 10Awight) [05:55:55] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [05:58:26] (03PS3) 10Krinkle: Disable wmgUseMdotRouting on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192279 (https://phabricator.wikimedia.org/T403510) [05:58:27] (03PS2) 10Krinkle: Disable wmgUseMdotRouting on id, fr, de, es, ru, and ja.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192653 (https://phabricator.wikimedia.org/T403510) [05:58:55] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T0600) [06:00:05] marostegui, Amir1, and federico3: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T0600). [06:01:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192279 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [06:01:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192653 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [06:01:55] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2035:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2035 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:02:22] (03Merged) 10jenkins-bot: Disable wmgUseMdotRouting on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192279 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [06:02:25] (03Merged) 10jenkins-bot: Disable wmgUseMdotRouting on id, fr, de, es, ru, and ja.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192653 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [06:02:31] PROBLEM - Ensure traffic_manager is running for instance backend on cp4043 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [06:03:01] !log krinkle@deploy2002 Started scap sync-world: Backport for [[gerrit:1192279|Disable wmgUseMdotRouting on Commons (T403510)]], [[gerrit:1192653|Disable wmgUseMdotRouting on id, fr, de, es, ru, and ja.wikipedia (T403510)]] [06:03:07] T403510: [Main Rollout] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [06:03:31] RECOVERY - Ensure traffic_manager is running for instance backend on cp4043 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [06:05:55] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [06:05:55] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [06:09:55] !log krinkle@deploy2002 krinkle: Backport for [[gerrit:1192279|Disable wmgUseMdotRouting on Commons (T403510)]], [[gerrit:1192653|Disable wmgUseMdotRouting on id, fr, de, es, ru, and ja.wikipedia (T403510)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [06:10:01] T403510: [Main Rollout] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [06:11:03] !incidents [06:11:04] 6820 (UNACKED) wmf - metamonitoring - thanos - notified - vip is now DOWN [06:11:04] 6819 (RESOLVED) Manual (paged) by LSobanski (lsobanski@wikimedia.org): Another test page [06:11:04] 6818 (RESOLVED) Manual (paged) by LSobanski (lsobanski@wikimedia.org): Test page [06:11:04] 6815 (RESOLVED) [25x] ProbeDown sre (ip4 probes/service eqiad) [06:11:05] 6817 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [06:11:05] 6814 (RESOLVED) wmf - metamonitoring - prometheus - notified - vip is now DOWN [06:11:05] 6816 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [06:11:05] 6813 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [06:11:05] 6810 (RESOLVED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [06:11:06] 6812 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [06:11:06] 6811 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [06:11:17] !ack 6820 [06:11:18] 6820 (ACKED) wmf - metamonitoring - thanos - notified - vip is now DOWN [06:12:57] * Emperor tries to find anything about this metamonitoring [06:13:46] I think that might be related to the pa.ge yesterday during the wikikube upgrade and the downtime just ended, lets see [06:14:03] https://metamonitoring.wikimedia.org/thanos/deadmanswitchnotified says No instances with outdated timestamp have been detected. [06:14:52] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [06:15:27] so there is an active silence for the DeadManSwitch alert, which expires in 9 days [06:17:17] * Emperor finds https://wikitech.wikimedia.org/wiki/Prometheus#Troubleshooting [06:18:17] !incidents [06:18:17] 6820 (ACKED) wmf - metamonitoring - thanos - notified - vip is now DOWN [06:18:17] 6819 (RESOLVED) Manual (paged) by LSobanski (lsobanski@wikimedia.org): Another test page [06:18:18] 6818 (RESOLVED) Manual (paged) by LSobanski (lsobanski@wikimedia.org): Test page [06:18:18] 6815 (RESOLVED) [25x] ProbeDown sre (ip4 probes/service eqiad) [06:18:18] 6817 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [06:18:18] 6814 (RESOLVED) wmf - metamonitoring - prometheus - notified - vip is now DOWN [06:18:19] 6816 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [06:18:19] 6813 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [06:18:19] 6810 (RESOLVED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [06:18:20] 6812 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [06:18:20] 6811 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [06:19:34] there is a stacktrace on alert1002 in the log of [06:19:34] journalctl -u metamonitoring_public_endpoint.service [06:20:30] jelto: from when? [06:20:45] I see it stopping and starting around 06:19:45 [06:21:01] exactly at the time of the page [06:21:01] Oct 02 06:09:41 alert1002 gunicorn[3038713]: Traceback (most recent call last): [06:21:14] and it resolved, also tappof is looking in -observability [06:21:40] !log krinkle@deploy2002 krinkle: Continuing with sync [06:21:41] ah, right, got it [06:22:23] !incidents [06:22:23] 6820 (RESOLVED) wmf - metamonitoring - thanos - notified - vip is now DOWN [06:22:24] 6819 (RESOLVED) Manual (paged) by LSobanski (lsobanski@wikimedia.org): Another test page [06:22:24] 6818 (RESOLVED) Manual (paged) by LSobanski (lsobanski@wikimedia.org): Test page [06:22:24] 6815 (RESOLVED) [25x] ProbeDown sre (ip4 probes/service eqiad) [06:22:24] 6817 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [06:22:24] 6814 (RESOLVED) wmf - metamonitoring - prometheus - notified - vip is now DOWN [06:22:25] 6816 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [06:22:25] 6813 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [06:22:26] 6810 (RESOLVED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [06:22:26] 6812 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [06:22:26] 6811 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [06:23:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:23:58] I'll continue with my breakfast and poke a bit in -observability, it might be related to the silence 0fcc6f25-3881-4073-a0c5-cdec04290212 which silences the DeadManSwitch [06:24:13] Emperor: everything alright? [06:24:31] Krinkle: alert resolved, I gather tappof is looking at the deadmanswitch [06:24:52] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:25:24] k [06:25:39] ugh, this is rather before my alarm normally goes off, I'm going to lie down again for a smidge unless something else fires. [06:25:55] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [06:26:01] !log krinkle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192279|Disable wmgUseMdotRouting on Commons (T403510)]], [[gerrit:1192653|Disable wmgUseMdotRouting on id, fr, de, es, ru, and ja.wikipedia (T403510)]] (duration: 23m 01s) [06:26:08] T403510: [Main Rollout] Enable unified mobile routing on remaining wikis - https://phabricator.wikimedia.org/T403510 [06:26:53] (03CR) 10Joal: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1192576 (https://phabricator.wikimedia.org/T405783) (owner: 10CDanis) [06:26:55] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [06:33:51] Yes sure, I'm around [06:35:00] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11235768 (10Krinkle) [06:35:55] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [06:35:55] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [06:42:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-drmrs (2620:0:860:fe0a::2) - group Confed_drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_drmrs&var-bgp_neighbor=cr2-drmrs - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:47:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-drmrs (2620:0:860:fe0a::2) - group Confed_drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_drmrs&var-bgp_neighbor=cr2-drmrs - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:54:55] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [06:55:26] (03PS1) 10Arnaudb: gerrit: reduce parallel mod_qos connections [puppet] - 10https://gerrit.wikimedia.org/r/1193013 [06:57:55] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [06:58:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:59:03] FIRING: PuppetFailure: Puppet has failed on ml-serve1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:00:05] Amir1, Urbanecm, and awight: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T0700) [07:00:05] EggRoll97: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:03] (03PS1) 10Elukey: charts: add separate requests/limits for tegola's cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193015 (https://phabricator.wikimedia.org/T381565) [07:03:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [07:04:01] !incidents [07:04:02] 6821 (RESOLVED) Manual (paged) by LSobanski (lsobanski@wikimedia.org): Yet another test page [07:04:02] 6820 (RESOLVED) wmf - metamonitoring - thanos - notified - vip is now DOWN [07:04:02] 6819 (RESOLVED) Manual (paged) by LSobanski (lsobanski@wikimedia.org): Another test page [07:04:02] 6818 (RESOLVED) Manual (paged) by LSobanski (lsobanski@wikimedia.org): Test page [07:04:02] 6815 (RESOLVED) [25x] ProbeDown sre (ip4 probes/service eqiad) [07:04:03] 6817 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [07:04:03] 6814 (RESOLVED) wmf - metamonitoring - prometheus - notified - vip is now DOWN [07:04:03] 6816 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [07:04:04] 6813 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [07:04:05] 6810 (RESOLVED) ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw) [07:04:05] 6812 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [07:04:05] 6811 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [07:05:33] (03PS5) 10Elukey: WIP: sre.hardware.upgrade-firmware: add support for IDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1192898 [07:05:55] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [07:05:55] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [07:06:54] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2048.codfw.wmnet'] [07:07:09] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2048.codfw.wmnet'] [07:07:23] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2049.codfw.wmnet'] [07:11:10] (03PS2) 10Elukey: charts: add separate requests/limits for tegola's cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193015 (https://phabricator.wikimedia.org/T381565) [07:11:29] (03PS1) 10Hashar: Add a banner for a Gerrit switch over maintenance [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1193017 (https://phabricator.wikimedia.org/T387833) [07:12:20] (03CR) 10CI reject: [V:04-1] Add a banner for a Gerrit switch over maintenance [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1193017 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar) [07:12:31] sorry im a bit late into the deployment window, are any of the deployers for the window still around [07:16:52] (03PS2) 10Hashar: Add a banner for a Gerrit switch over maintenance [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1193017 (https://phabricator.wikimedia.org/T387833) [07:19:04] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2049.codfw.wmnet'] [07:19:31] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1192815 (owner: 10Slyngshede) [07:21:50] EggRoll97: hi yes I am around [07:21:54] jouncebot: now [07:21:55] For the next 0 hour(s) and 38 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T0700) [07:23:17] (03PS6) 10Elukey: WIP: sre.hardware.upgrade-firmware: add support for IDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1192898 [07:26:55] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [07:27:10] (03CR) 10Slyngshede: [C:03+2] data.yaml: offboarding bvershbow [puppet] - 10https://gerrit.wikimedia.org/r/1192815 (owner: 10Slyngshede) [07:27:55] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [07:28:20] hashar: EggRoll97: I have a wmf.21 backport I would also do now, please ping me when you're done. [07:28:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192326 (https://phabricator.wikimedia.org/T405999) (owner: 10EggRoll97) [07:28:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet [07:28:59] awight: sure! please add it to the calendar at https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T0700 :) [07:29:14] (03Merged) 10jenkins-bot: Add abusefilter-modify-restricted to enwiki EFM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192326 (https://phabricator.wikimedia.org/T405999) (owner: 10EggRoll97) [07:29:50] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1192326|Add abusefilter-modify-restricted to enwiki EFM (T405999)]] [07:29:56] T405999: Add the abusefilter-modify-restricted right to enwiki EFMs - https://phabricator.wikimedia.org/T405999 [07:29:57] (03PS1) 10Awight: UX changes for reference context item [extensions/Cite] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193022 (https://phabricator.wikimedia.org/T404690) [07:30:37] (03CR) 10CI reject: [V:04-1] WIP: sre.hardware.upgrade-firmware: add support for IDRAC 10 [cookbooks] - 10https://gerrit.wikimedia.org/r/1192898 (owner: 10Elukey) [07:30:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/Cite] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193009 (https://phabricator.wikimedia.org/T406002) (owner: 10Awight) [07:30:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/Cite] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193022 (https://phabricator.wikimedia.org/T404690) (owner: 10Awight) [07:31:23] (03CR) 10Gehel: team-sre: cdn: ignore wdqs-main.discovery.wmnet in ATSBackendErrorsHigh (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1192940 (https://phabricator.wikimedia.org/T406141) (owner: 10Ssingh) [07:31:51] [{reqId}] {exception_url} Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'mgwiktionary.wikifunctionsclient_usage' doesn't exist Function: MediaWiki\Extension\WikiLambda\WikifunctionsClientStore::deleteWikifunctionsUsage Query: DELETE FROM `wikifunction [07:31:53] fun times [07:31:59] I guess I should have looked at the logs earlier [07:32:22] looks like WikiLambda code uses `wikifunctionsclient_usage` while the table has not be deployed on the database systems [07:32:23] fun [07:32:40] jmm@cumin2002 drain-node (PID 954237) is awaiting input [07:32:45] that is https://phabricator.wikimedia.org/T406185 [07:32:50] how fun :) [07:33:16] (03PS1) 10Elukey: cpufrequtils: add support for Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1193023 (https://phabricator.wikimedia.org/T405891) [07:34:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet [07:35:40] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1193023 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey) [07:36:23] !log hashar@deploy2002 eggroll97, hashar: Backport for [[gerrit:1192326|Add abusefilter-modify-restricted to enwiki EFM (T405999)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:36:23] (03CR) 10Arnaudb: [C:03+1] "good idea, thanks!" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1193017 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar) [07:36:30] T405999: Add the abusefilter-modify-restricted right to enwiki EFMs - https://phabricator.wikimedia.org/T405999 [07:36:55] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [07:36:55] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [07:36:57] EggRoll97: the patch is on the test server. Are you able to check it? [07:36:59] (03CR) 10Elukey: cpufrequtils: add support for Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1193023 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey) [07:37:18] (with the https://wikitech.wikimedia.org/wiki/WikimediaDebug extension) [07:37:23] hashar: Yes, doing so now [07:37:27] awesome! [07:37:41] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2049.codfw.wmnet'] [07:37:55] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] UX changes for reference context item [extensions/Cite] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193022 (https://phabricator.wikimedia.org/T404690) (owner: 10Awight) [07:39:52] FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:40:27] (03CR) 10Cappybaraa: "Jenkins bot has verified the patch, please review it and give +2 for ready to merge." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [07:40:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet [07:40:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet [07:41:01] hashar: lgtm [07:41:04] !log hashar@deploy2002 eggroll97, hashar: Continuing with sync [07:41:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet [07:41:11] thank you for the check [07:41:25] FIRING: [29x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:41:32] hashar: mine can go together—and it's also fine to wait for the afternoon window if we run out of time. [07:41:39] EggRoll97: and I guess once the config change has been fully deployed, you can mark the task resolved 🎉 [07:41:50] awight: it is fine, we can do your patch now [07:41:50] Also, I'm happy to self-deploy if you have other fires to put out. [07:41:55] yeah [07:42:09] I wanna sortout the issue with the missing DB table [07:42:24] so you get time to self deploy [07:42:25] +1 :) [07:42:29] hashar: ah great! [07:42:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet [07:43:21] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:43:30] hmm [07:44:01] I need to revert https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1172048 [07:45:28] hashar: kk I've used "interrupt job" so my backport doesn't take over [07:45:30] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192326|Add abusefilter-modify-restricted to enwiki EFM (T405999)]] (duration: 15m 40s) [07:45:36] T405999: Add the abusefilter-modify-restricted right to enwiki EFMs - https://phabricator.wikimedia.org/T405999 [07:45:44] oh [07:46:17] awight: yeah sorry you could have continued your backport [07:46:22] but hmm let me finish the revert [07:46:27] and we can process both at the same time [07:46:39] ack [07:46:43] (03PS1) 10Hashar: Revert "Enable Wikifunctions client mode on Wiktionaries, Part III, and Incubator" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193024 [07:47:05] (03PS2) 10Hashar: Revert "Enable Wikifunctions client mode on Wiktionaries, Part III, and Incubator" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193024 (https://phabricator.wikimedia.org/T406185) [07:48:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet [07:49:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet [07:49:25] awight: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1193024 :) [07:49:39] want me to deploy both or are you fine deploying your backport AND this config change? [07:50:02] (03PS2) 10Arnaudb: gerrit: reduce mod_qos parallel connections [puppet] - 10https://gerrit.wikimedia.org/r/1193013 [07:51:23] hashar: I can do both now, ty! [07:51:44] awesome thanks [07:51:47] I am around anyway [07:52:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193024 (https://phabricator.wikimedia.org/T406185) (owner: 10Hashar) [07:52:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy2002 using scap backport" [extensions/Cite] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193022 (https://phabricator.wikimedia.org/T404690) (owner: 10Awight) [07:52:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy2002 using scap backport" [extensions/Cite] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193009 (https://phabricator.wikimedia.org/T406002) (owner: 10Awight) [07:52:32] I *love* that patches can go together now, I don't know if this is new but I plan to use it regularly. [07:52:53] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet [07:52:56] that is handy indeed [07:53:15] so that if you are in the middle of the week, with two versions running concurrently and you need a patch to be applied to both branches [07:53:20] you can `scap backport A B` [07:53:24] and have them both deployed concurrently [07:53:31] (03Merged) 10jenkins-bot: Revert "Enable Wikifunctions client mode on Wiktionaries, Part III, and Incubator" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193024 (https://phabricator.wikimedia.org/T406185) (owner: 10Hashar) [07:54:02] I am going to grab a coffee [07:54:15] (03Merged) 10jenkins-bot: UX changes for reference context item [extensions/Cite] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193022 (https://phabricator.wikimedia.org/T404690) (owner: 10Awight) [07:54:30] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2049.codfw.wmnet'] [07:55:33] (03CR) 10EggRoll97: [C:03+1] diqwiki: Add namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [07:55:56] (03CR) 10A smart kitten: diqwiki: Add namespace aliases (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [07:56:55] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [07:57:17] jmm@cumin2002 drain-node (PID 964409) is awaiting input [07:58:47] hashar: Your coffee break is brought to you today by SpiderPig \o/ [07:59:08] thank you SpiderPig! [07:59:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet [07:59:55] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [08:00:05] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T0800) [08:00:28] (03PS1) 10Elukey: redfish: allow HTTP 204 responses in poll_task [software/spicerack] - 10https://gerrit.wikimedia.org/r/1193046 (https://phabricator.wikimedia.org/T392851) [08:00:50] (03CR) 10Federico Ceratto: [C:03+2] site.pp: Configure es2051 mariadb role [puppet] - 10https://gerrit.wikimedia.org/r/1192927 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [08:01:54] awight: the good thing with SpiderPig, is that I can watch your progress bar :] [08:02:12] and the logs [08:02:41] !log root@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2035.codfw.wmnet [08:04:16] yesss! Side request that I started whining about elsewhere btw: I'd like to have access to the docker build logs as well [08:05:06] I think it is redirected to a file [08:05:20] !log root@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host wikikube-worker2035.codfw.wmnet [08:05:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet [08:05:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet [08:05:55] and end up in my ~/scap-image-build-and-push-log [08:06:01] (03Merged) 10jenkins-bot: Nasty fix for main ref change in main+details [extensions/Cite] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193009 (https://phabricator.wikimedia.org/T406002) (owner: 10Awight) [08:06:15] but you can surely file a feature request to have the log relayed by scap to its console output [08:06:19] or maybe it is only in verbose mode [08:06:26] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1193023 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey) [08:06:37] !log awight@deploy2002 Started scap sync-world: Backport for [[gerrit:1193024|Revert "Enable Wikifunctions client mode on Wiktionaries, Part III, and Incubator" (T406185 T397401 T401682)]], [[gerrit:1193022|UX changes for reference context item (T404690)]], [[gerrit:1193009|Nasty fix for main ref change in main+details (T406002)]] [08:06:52] T406185: Wikimedia\Rdbms\DBQueryError: Error 1146: Table '[wiktionary].wikifunctionsclient_usage' doesn't exist - https://phabricator.wikimedia.org/T406185 [08:06:52] T397401: If we follow Parsoid’s rollout and integrate Wikifunctions on most Wiktionaries and some low-traffic Wikipedias, we will get the testing we need to confidently roll out to larger wikis. - https://phabricator.wikimedia.org/T397401 [08:06:53] T401682: Wikimania in-person request: Enable Wikifunctions client mode on the Wikimedia Incubator - https://phabricator.wikimedia.org/T401682 [08:06:54] T404690: Cosmetic UX changes for context item - https://phabricator.wikimedia.org/T404690 [08:06:54] T406002: Regression: Cannot change main content of main+details - https://phabricator.wikimedia.org/T406002 [08:06:55] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [08:06:55] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [08:07:22] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on wikikube-worker2035 - https://phabricator.wikimedia.org/T406060#11235959 (10JMeybohm) I'm able to ssh into the node but it does not look good: ` jayme@wikikube-worker2035:~$ ls ls: reading directory '.': Input/output error jayme@wikikube-worker2035:~$ ` I've... [08:09:11] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:10:08] (03CR) 10Muehlenhoff: osm_master: Create /etc/wikimedia directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192566 (https://phabricator.wikimedia.org/T381565) (owner: 10Ahmon Dancy) [08:10:47] !log failover Ganeti master in ulsfo to ganeti4008 [08:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:56] (03CR) 10Brouberol: Define airflow-wikidata PG cluster and airflow instance (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190975 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [08:11:49] (03CR) 10Brouberol: "Also, the patch title mentions PG but there's no associated helmfile for PG here." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190975 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [08:12:01] (03CR) 10Brouberol: [C:03+1] airflow-wikidata: define ATS mapping rules and cache settings [puppet] - 10https://gerrit.wikimedia.org/r/1191578 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [08:13:31] PROBLEM - ganeti-wconfd running on ganeti4005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [08:14:11] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:14:42] hashar: yes exactly, the default seems to be to log to 08:03 < awight> yesss! Side request that I started whining about elsewhere btw: I'd like to have access to the docker build logs as well [08:14:47] ooof sorry [08:15:03] hehe you can't copy and paste out of spiderpig output either <_< [08:15:33] anyway /var/lib/spiderpig/scap-image-build-and-push-log, +1 I'll look through the code and maybe suggest a small improvement [08:16:20] 08:07:57 K8s images build/push output redirected to /var/lib/spiderpig/scap-image-build-and-push-log [08:16:21] yeah [08:16:35] (03CR) 10Elukey: [C:03+2] cpufrequtils: add support for Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1193023 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey) [08:16:51] then most of the time it is a fast step [08:16:59] !log brouberol@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-druid1007.eqiad.wmnet with reason: Hosts are being decomissioned [08:18:46] hashar: My interest fwiw is to see which images are built, so they can be inspected and used to eg. reproduce bugs. But I also discovered that it's nontrivial to run a "restricted" image, for good reasons—maybe my entire wish is bogus, we could discuss on a task. [08:19:20] (03CR) 10Brouberol: [C:04-1] "I don't think this is necessary. This would prevent _any_ pod from running with less than 1G of memory. I think it would be better to set " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192987 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [08:21:45] (03CR) 10Cathal Mooney: [C:03+1] "Was ooo until today but just +1 to confirm the network-side is set to accept anything in this larger subnet already." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191647 (https://phabricator.wikimedia.org/T375845) (owner: 10Jelto) [08:22:24] I'm really stepping on train time here /o\ [08:26:25] FIRING: [30x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:27:01] it is ok [08:27:12] awkward "Waiting 300 seconds for swift" [08:27:24] yeah well [08:27:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:27:49] :-) it happens. But nicer if we could get a push notification back from the indexer or whatever [08:27:49] it is because at some point we had hit an issue with the docker images not being consistent [08:27:55] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [08:28:00] cause the docker-registry is backed up by Swift [08:28:11] FIRING: [2x] SystemdUnitFailed: cpupower.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:28:18] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562#11235985 (10cmooney) [08:28:18] Ahmon Dancy wrote some patches for the upstream system, but I don't think that got packaged/released/deployed etc [08:28:30] so the work around went to use `sleep(300)` :-\ [08:28:55] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [08:31:29] my guess is your patch triggerd a full rebuild of the l10n cache [08:32:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:32:53] (and the push to the registry took 10 minutes) [08:33:06] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1193013 (owner: 10Arnaudb) [08:33:11] FIRING: [2x] SystemdUnitFailed: cpupower.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:33:48] RESOLVED: PuppetFailure: Puppet has failed on ml-serve1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:34:11] !log awight@deploy2002 awight, hashar: Backport for [[gerrit:1193024|Revert "Enable Wikifunctions client mode on Wiktionaries, Part III, and Incubator" (T406185 T397401 T401682)]], [[gerrit:1193022|UX changes for reference context item (T404690)]], [[gerrit:1193009|Nasty fix for main ref change in main+details (T406002)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verif [08:34:11] ied there. [08:34:19] (03CR) 10Hashar: [C:03+2] Add a banner for a Gerrit switch over maintenance [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1193017 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar) [08:34:25] T406185: Wikimedia\Rdbms\DBQueryError: Error 1146: Table '[wiktionary].wikifunctionsclient_usage' doesn't exist - https://phabricator.wikimedia.org/T406185 [08:34:26] T397401: If we follow Parsoid’s rollout and integrate Wikifunctions on most Wiktionaries and some low-traffic Wikipedias, we will get the testing we need to confidently roll out to larger wikis. - https://phabricator.wikimedia.org/T397401 [08:34:27] T401682: Wikimania in-person request: Enable Wikifunctions client mode on the Wikimedia Incubator - https://phabricator.wikimedia.org/T401682 [08:34:27] T404690: Cosmetic UX changes for context item - https://phabricator.wikimedia.org/T404690 [08:34:28] T406002: Regression: Cannot change main content of main+details - https://phabricator.wikimedia.org/T406002 [08:34:35] (03PS5) 10Cappybaraa: Namespaces.php: Change Portal Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) [08:35:04] (03Merged) 10jenkins-bot: Add a banner for a Gerrit switch over maintenance [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1193017 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar) [08:35:34] !log hashar@deploy2002 Started deploy [gerrit/gerrit@3ef5714]: Add a banner for a Gerrit switch over maintenance - T387833 [08:35:35] !log hashar@deploy2002 deploy aborted: Add a banner for a Gerrit switch over maintenance - T387833 (duration: 00m 00s) [08:35:40] !log hashar@deploy2002 Started deploy [gerrit/gerrit@3ef5714]: Add a banner for a Gerrit switch over maintenance - T387833 [08:35:40] T387833: Gerrit failover process - https://phabricator.wikimedia.org/T387833 [08:35:52] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@3ef5714]: Add a banner for a Gerrit switch over maintenance - T387833 (duration: 00m 12s) [08:35:55] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [08:35:55] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [08:37:07] PROBLEM - SSH on ml-serve1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:37:44] RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:38:33] (03PS4) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) [08:38:57] RECOVERY - SSH on ml-serve1012 is OK: SSH OK - OpenSSH_10.0p2 Debian-7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:39:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet [08:39:26] (03PS4) 10Btullis: Add some WMF specific network policies to the spark-operator chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192919 (https://phabricator.wikimedia.org/T405490) [08:41:13] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [08:41:25] FIRING: [30x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:42:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet [08:42:25] (03PS5) 10Btullis: Add some WMF specific network policies to the spark-operator chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192919 (https://phabricator.wikimedia.org/T405490) [08:42:58] awight: are you testing the change? It is asking for confirmation onw [08:42:59] now [08:43:21] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:43:56] !log awight@deploy2002 awight, hashar: Continuing with sync [08:44:02] (03PS5) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) [08:44:36] (03CR) 10Hashar: "That looks good, thank you for the amendment Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/1191747 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [08:45:17] (03CR) 10Majavah: osm_master: Create /etc/wikimedia directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192566 (https://phabricator.wikimedia.org/T381565) (owner: 10Ahmon Dancy) [08:45:22] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [08:45:59] (03PS6) 10Btullis: Add some WMF specific network policies to the spark-operator chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192919 (https://phabricator.wikimedia.org/T405490) [08:48:00] (03CR) 10Muehlenhoff: "We can also delete modules/profile/templates/maps/grants-files.sql.erb, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1192877 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [08:48:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet [08:48:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet [08:49:45] (03PS6) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) [08:51:25] (03PS1) 10DCausse: cirrus: stop copying ores weighted_tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193052 (https://phabricator.wikimedia.org/T389053) [08:52:02] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [08:54:38] (03PS7) 10Btullis: Add some WMF specific network policies to the spark-operator chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192919 (https://phabricator.wikimedia.org/T405490) [08:54:55] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [08:55:30] !log awight@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193024|Revert "Enable Wikifunctions client mode on Wiktionaries, Part III, and Incubator" (T406185 T397401 T401682)]], [[gerrit:1193022|UX changes for reference context item (T404690)]], [[gerrit:1193009|Nasty fix for main ref change in main+details (T406002)]] (duration: 48m 54s) [08:55:45] T406185: Wikimedia\Rdbms\DBQueryError: Error 1146: Table '[wiktionary].wikifunctionsclient_usage' doesn't exist - https://phabricator.wikimedia.org/T406185 [08:55:45] T397401: If we follow Parsoid’s rollout and integrate Wikifunctions on most Wiktionaries and some low-traffic Wikipedias, we will get the testing we need to confidently roll out to larger wikis. - https://phabricator.wikimedia.org/T397401 [08:55:46] T401682: Wikimania in-person request: Enable Wikifunctions client mode on the Wikimedia Incubator - https://phabricator.wikimedia.org/T401682 [08:55:47] T404690: Cosmetic UX changes for context item - https://phabricator.wikimedia.org/T404690 [08:55:48] T406002: Regression: Cannot change main content of main+details - https://phabricator.wikimedia.org/T406002 [08:56:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5005.eqsin.wmnet [08:56:46] hashar: All done with backports [08:56:56] (03PS1) 10Elukey: cpufrequtils: improve cpupower's config [puppet] - 10https://gerrit.wikimedia.org/r/1193053 (https://phabricator.wikimedia.org/T405891) [08:57:11] cool ! [08:57:22] (03CR) 10CI reject: [V:04-1] cpufrequtils: improve cpupower's config [puppet] - 10https://gerrit.wikimedia.org/r/1193053 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey) [08:57:29] (03PS7) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) [08:57:55] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [08:59:06] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1193053 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey) [08:59:41] (03PS2) 10Elukey: cpufrequtils: improve cpupower's config [puppet] - 10https://gerrit.wikimedia.org/r/1193053 (https://phabricator.wikimedia.org/T405891) [08:59:56] Good luck with the next steps [09:00:11] (03PS8) 10Btullis: Add some WMF specific network policies to the spark-operator chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192919 (https://phabricator.wikimedia.org/T405490) [09:00:13] jmm@cumin2002 drain-node (PID 995622) is awaiting input [09:00:47] (03PS1) 10Arnaudb: gerrit: typo on --systemd arg [cookbooks] - 10https://gerrit.wikimedia.org/r/1193051 [09:00:48] (03CR) 10Arnaudb: "--systemd was passed with a whitespace" [cookbooks] - 10https://gerrit.wikimedia.org/r/1193051 (owner: 10Arnaudb) [09:01:39] (03PS2) 10Arnaudb: gerrit: typo on --systemd arg [cookbooks] - 10https://gerrit.wikimedia.org/r/1193051 (https://phabricator.wikimedia.org/T387833) [09:01:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5005.eqsin.wmnet [09:01:58] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on es2051.codfw.wmnet with reason: Setting up new ES host [09:02:04] (03PS8) 10Slyngshede: P:cache::haproxy copy private repo data [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) [09:03:33] I will now run the train [09:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:04:33] (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193054 (https://phabricator.wikimedia.org/T405677) [09:04:36] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by hashar@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193054 (https://phabricator.wikimedia.org/T405677) (owner: 10TrainBranchBot) [09:04:40] bold! [09:05:35] (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193054 (https://phabricator.wikimedia.org/T405677) (owner: 10TrainBranchBot) [09:05:46] (03CR) 10Arnaudb: [C:03+2] gerrit: reduce mod_qos parallel connections [puppet] - 10https://gerrit.wikimedia.org/r/1193013 (owner: 10Arnaudb) [09:06:55] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [09:06:55] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [09:08:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:09:25] PROBLEM - Host ml-serve1012 is DOWN: PING CRITICAL - Packet loss = 100% [09:09:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5005.eqsin.wmnet [09:09:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5005.eqsin.wmnet [09:10:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5006.eqsin.wmnet [09:10:59] RECOVERY - Host ml-serve1012 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [09:13:44] RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:16:48] (03CR) 10Muehlenhoff: [C:03+1] "(the old file will be cleaned out when we decom the old nodes)" [puppet] - 10https://gerrit.wikimedia.org/r/1192877 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:16:57] jmm@cumin2002 drain-node (PID 1003123) is awaiting input [09:17:41] (03PS1) 10Federico Ceratto: es2051.yaml: Prepare host for es1 [puppet] - 10https://gerrit.wikimedia.org/r/1193055 (https://phabricator.wikimedia.org/T402859) [09:17:45] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.21 refs T405677 [09:17:51] T405677: 1.45.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T405677 [09:18:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5006.eqsin.wmnet [09:19:07] (03CR) 10Elukey: [C:03+2] profile::maps::osm_master: refactor postgres grants [puppet] - 10https://gerrit.wikimedia.org/r/1192877 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:19:39] PROBLEM - Host ml-serve1012 is DOWN: PING CRITICAL - Packet loss = 100% [09:19:58] this is me --^ [09:21:13] (03CR) 10Jelto: [C:03+1] gerrit: typo on --systemd arg [cookbooks] - 10https://gerrit.wikimedia.org/r/1193051 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [09:24:07] PROBLEM - Ubuntu mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/ubuntu is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [09:24:50] FIRING: KubernetesCalicoDown: ml-serve1012.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1012.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:26:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5006.eqsin.wmnet [09:26:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5006.eqsin.wmnet [09:27:17] (03CR) 10Brouberol: Add some WMF specific network policies to the spark-operator chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192919 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [09:27:55] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [09:28:55] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [09:32:03] train rolled and the logs look quiet [09:32:59] RECOVERY - Host ml-serve1012 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [09:34:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [09:34:50] RESOLVED: KubernetesCalicoDown: ml-serve1012.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1012.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:35:02] (03CR) 10A smart kitten: Namespaces.php: Change Portal Namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [09:35:55] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [09:35:55] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [09:38:19] (03CR) 10Brouberol: Add some WMF specific network policies to the spark-operator chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192919 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [09:38:23] (03CR) 10Jelto: [C:03+2] gitlab runners: Allow new buildkit-syntax-forwarder gateway [puppet] - 10https://gerrit.wikimedia.org/r/1191486 (https://phabricator.wikimedia.org/T405651) (owner: 10Dduvall) [09:39:51] (03PS1) 10Muehlenhoff: Bump the version of the Linux 6.12 backport [puppet] - 10https://gerrit.wikimedia.org/r/1193065 (https://phabricator.wikimedia.org/T405361) [09:42:13] (03PS1) 10D3r1ck01: session: Lookup authenticated store first before anon store [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193067 (https://phabricator.wikimedia.org/T402808) [09:43:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5007.eqsin.wmnet [09:43:29] (03CR) 10Btullis: [C:03+1] "Great! Many thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1193065 (https://phabricator.wikimedia.org/T405361) (owner: 10Muehlenhoff) [09:43:30] (03CR) 10Arnaudb: [C:03+2] gerrit: typo on --systemd arg [cookbooks] - 10https://gerrit.wikimedia.org/r/1193051 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [09:44:23] I am away for lunch [09:44:30] hashar: great job! [09:44:52] FIRING: [27x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:45:59] (03CR) 10Brouberol: [C:03+1] "Thanks Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/1193065 (https://phabricator.wikimedia.org/T405361) (owner: 10Muehlenhoff) [09:46:32] (03PS9) 10Btullis: Add some WMF specific network policies to the spark-operator chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192919 (https://phabricator.wikimedia.org/T405490) [09:47:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5007.eqsin.wmnet [09:48:12] (03CR) 10Muehlenhoff: [C:03+2] Bump the version of the Linux 6.12 backport [puppet] - 10https://gerrit.wikimedia.org/r/1193065 (https://phabricator.wikimedia.org/T405361) (owner: 10Muehlenhoff) [09:48:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193067 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [09:49:21] (03CR) 10Brouberol: [C:03+1] Add some WMF specific network policies to the spark-operator chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192919 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [09:50:15] (03CR) 10Btullis: Add some WMF specific network policies to the spark-operator chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192919 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [09:51:20] (03Merged) 10jenkins-bot: gerrit: typo on --systemd arg [cookbooks] - 10https://gerrit.wikimedia.org/r/1193051 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [09:52:23] (03PS1) 10D3r1ck01: session: Lookup authenticated store first before anon store [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1193069 (https://phabricator.wikimedia.org/T402808) [09:53:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1193069 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [09:54:23] I can't edit en:WP:AIV [09:54:27] [6de5f88d-7f16-450e-af70-beb0066eebd6] 2025-10-02 09:45:03: Fatal exception of type "Wikimedia\Rdbms\DBQueryDisconnectedError" [09:54:33] Rollback works, Twinkle works [09:54:35] editing doesn't [09:54:39] other pages work [09:54:54] AbuseFilter-related issue? Can't see recent changes that could have caused it [09:55:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5007.eqsin.wmnet [09:55:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5007.eqsin.wmnet [09:55:55] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [09:56:35] (03CR) 10Slyngshede: "No, it would need to trigger a Haproxy reload." [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:57:28] (03PS2) 10Samtar: EventStreamConfig and stream registration for watchlist click tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192861 (https://phabricator.wikimedia.org/T401575) [09:58:09] PROBLEM - Druid overlord on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [09:58:25] (03CR) 10Samtar: "(Removed values not required)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192861 (https://phabricator.wikimedia.org/T401575) (owner: 10Samtar) [09:58:55] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [09:59:25] !log failover Ganeti master in eqsin to ganeti5007 [09:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T1000) [10:01:47] ^ I guess that's related and it's already known. Else, https://phabricator.wikimedia.org/T406208 [10:01:51] PROBLEM - ganeti-wconfd running on ganeti5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [10:02:19] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1222.eqiad.wmnet with reason: Maintenance [10:02:41] !log installing OpenSSL security updates on trixie/bookworm [10:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:56] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:05:56] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:06:10] RECOVERY - Druid overlord on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:09:43] (03CR) 10Btullis: [C:03+2] Add some WMF specific network policies to the spark-operator chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192919 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [10:11:21] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1225.eqiad.wmnet with reason: Maintenance [10:12:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5004.eqsin.wmnet [10:13:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5004.eqsin.wmnet [10:14:52] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [10:15:04] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1239.eqiad.wmnet with reason: Maintenance [10:17:09] (03Merged) 10jenkins-bot: Add some WMF specific network policies to the spark-operator chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192919 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [10:20:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5004.eqsin.wmnet [10:20:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5004.eqsin.wmnet [10:20:45] (03PS1) 10Zabe: Revert "RevisionStore: Find identical revisions without using rev_sha1" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193073 [10:20:57] jouncebot: nowandnext [10:20:58] For the next 0 hour(s) and 39 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T1000) [10:20:58] In 1 hour(s) and 39 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T1200) [10:21:06] (03CR) 10Zabe: [C:03+2] Revert "RevisionStore: Find identical revisions without using rev_sha1" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193073 (owner: 10Zabe) [10:23:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193073 (owner: 10Zabe) [10:23:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:24:48] (03CR) 10CI reject: [V:04-1] Revert "RevisionStore: Find identical revisions without using rev_sha1" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193073 (owner: 10Zabe) [10:24:52] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:24:54] (03CR) 10Lucas Werkmeister (WMDE): "(one of the gate-and-submit jobs failed so I aborted the other ones)" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193073 (owner: 10Zabe) [10:25:10] (03CR) 10Lucas Werkmeister (WMDE): "(probably just try again)" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193073 (owner: 10Zabe) [10:25:22] (03CR) 10Zabe: [C:03+2] Revert "RevisionStore: Find identical revisions without using rev_sha1" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193073 (owner: 10Zabe) [10:25:56] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:26:56] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:27:44] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde and nda for Maria Lechner WMDE - https://phabricator.wikimedia.org/T406106#11236400 (10WMDE-leszek) not that my opinion matters but I support @DZahn's point > This specific type of request has become standard for WMDE staff afaict. A custom form for it would... [10:29:26] (03CR) 10Clément Goubert: [C:03+1] charts: add separate requests/limits for tegola's cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193015 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [10:29:43] jouncebot: nowandnext [10:29:43] For the next 0 hour(s) and 30 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T1000) [10:29:43] In 1 hour(s) and 30 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T1200) [10:32:45] Mvolz: zabe is about to deploy [10:32:50] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6001.drmrs.wmnet [10:33:06] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:34:15] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1192573 (https://phabricator.wikimedia.org/T401647) (owner: 10Ahmon Dancy) [10:34:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6001.drmrs.wmnet [10:35:08] (03CR) 10Ladsgroup: [C:03+1] es2051.yaml: Prepare host for es1 [puppet] - 10https://gerrit.wikimedia.org/r/1193055 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:35:56] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:35:56] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:38:44] Lucas_WMDE: thanks! i last minute slotted in citoid for the empty window but I can do it after zabe is done (there should be enough time i think?) lmk when you're done [10:39:17] will do [10:39:48] (03CR) 10Clément Goubert: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1192573 (https://phabricator.wikimedia.org/T401647) (owner: 10Ahmon Dancy) [10:40:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6001.drmrs.wmnet [10:40:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6001.drmrs.wmnet [10:41:14] (03Merged) 10jenkins-bot: Revert "RevisionStore: Find identical revisions without using rev_sha1" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193073 (owner: 10Zabe) [10:41:43] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1193073|Revert "RevisionStore: Find identical revisions without using rev_sha1"]] [10:43:31] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:47:37] !log zabe@deploy2002 zabe: Backport for [[gerrit:1193073|Revert "RevisionStore: Find identical revisions without using rev_sha1"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:48:08] (03PS1) 10Clément Goubert: preseed: Set UEFI preseed for wikikube-ctrl2006 [puppet] - 10https://gerrit.wikimedia.org/r/1193075 (https://phabricator.wikimedia.org/T400661) [10:48:25] !log zabe@deploy2002 zabe: Continuing with sync [10:48:30] (03CR) 10Federico Ceratto: [C:03+2] es2051.yaml: Prepare host for es1 [puppet] - 10https://gerrit.wikimedia.org/r/1193055 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:48:36] !log failover Ganeti master in drmrs01 to ganeti6001 [10:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:40] PROBLEM - ganeti-wconfd running on ganeti6003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [10:50:59] (03CR) 10Cathal Mooney: lvs1018: remove L2 sub-interface config for row E/F vlans (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191109 (https://phabricator.wikimedia.org/T405499) (owner: 10Cathal Mooney) [10:51:44] (03CR) 10Cathal Mooney: lvs1018: remove L2 sub-interface config for row E/F vlans (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191109 (https://phabricator.wikimedia.org/T405499) (owner: 10Cathal Mooney) [10:52:49] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193073|Revert "RevisionStore: Find identical revisions without using rev_sha1"]] (duration: 11m 06s) [10:53:38] Mvolz: over to you [10:53:48] FIRING: PuppetFailure: Puppet has failed on ml-serve1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:55:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11236473 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium a:05Kappakayala→03Clement_Goubert `wikikube-ctrl1001` is waiting for decom/derack and can proba... [10:55:56] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:57:18] !log jmm@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling restart_daemons on A:ncredir [10:57:56] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [10:59:00] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [10:59:08] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:59:48] (03PS2) 10Clément Goubert: Handle transform/wikitext/to/lint(.*) requests routed to the gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189938 (https://phabricator.wikimedia.org/T385066) (owner: 10Aaron Schulz) [10:59:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6003.drmrs.wmnet [11:00:05] mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T1100). [11:00:20] zabe ty! [11:02:18] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/zotero: apply [11:02:36] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:02:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6003.drmrs.wmnet [11:04:56] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [11:05:11] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [11:06:34] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192299 (owner: 10PipelineBot) [11:07:14] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:08:17] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192299 (owner: 10PipelineBot) [11:08:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling restart_daemons on A:ncredir [11:11:04] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:12:18] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:12:37] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:13:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:14:17] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti6003.drmrs.wmnet [11:14:17] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti6003.drmrs.wmnet [11:14:22] (03PS1) 10Btullis: Allow the spark-operator webhook to contact the k8s API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193079 (https://phabricator.wikimedia.org/T405490) [11:14:30] (03CR) 10CI reject: [V:04-1] Allow the spark-operator webhook to contact the k8s API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193079 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [11:14:44] PROBLEM - ganeti-wconfd running on ganeti6003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [11:15:06] (03PS2) 10Btullis: Allow the spark-operator webhook to contact the k8s API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193079 (https://phabricator.wikimedia.org/T405490) [11:16:03] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:17:27] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:18:20] !log installing postgresql security updates on netboxdb nodes [11:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:16] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:19:44] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:20:50] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:21:15] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:21:21] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11236523 (10Jclark-ctr) Thanks @jcrespo for the assistance [11:25:07] PROBLEM - Druid historical on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:26:07] PROBLEM - Druid middlemanager on druid1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:26:36] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6002.drmrs.wmnet [11:28:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6002.drmrs.wmnet [11:29:55] (03CR) 10Btullis: [C:03+2] Allow the spark-operator webhook to contact the k8s API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193079 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [11:32:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6002.drmrs.wmnet [11:32:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6002.drmrs.wmnet [11:33:30] (03PS1) 10Jcrespo: backup: Split repos dedicated storage out of the production jobs [puppet] - 10https://gerrit.wikimedia.org/r/1193081 (https://phabricator.wikimedia.org/T403946) [11:33:59] (03CR) 10CI reject: [V:04-1] backup: Split repos dedicated storage out of the production jobs [puppet] - 10https://gerrit.wikimedia.org/r/1193081 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [11:35:03] !log failover Ganeti master in drmrs02 to ganeti6002 [11:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:07] RECOVERY - Druid historical on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:36:07] RECOVERY - Druid middlemanager on druid1008 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:36:10] FIRING: BFDdown: BFD session down between cr1-esams and 2a02:ec80:300:fe09::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:37:15] (03Merged) 10jenkins-bot: Allow the spark-operator webhook to contact the k8s API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193079 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [11:37:41] PROBLEM - ganeti-wconfd running on ganeti6004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [11:38:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Move lvs1020 link from ssw1-f1-eqiad to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T404959#11236548 (10cmooney) >>! In T404959#11229706, @VRiley-WMF wrote: > Hey @cmooney is there a good time to schedual this move? Hey @VRil... [11:39:41] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:39:52] FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:40:22] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:41:10] RESOLVED: BFDdown: BFD session down between cr1-esams and 2a02:ec80:300:fe09::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:41:33] (03PS1) 10Arnaudb: gerrit: switchover from gerrit1003 to gerrit2003 [dns] - 10https://gerrit.wikimedia.org/r/1193082 (https://phabricator.wikimedia.org/T387833) [11:43:56] (03PS2) 10Stevemunene: Define airflow-wikidata PG cluster and airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190975 (https://phabricator.wikimedia.org/T404073) [11:44:15] (03PS2) 10Jcrespo: backups: Migrate Gerrit and GitLab backups to new storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/1193081 (https://phabricator.wikimedia.org/T403946) [11:44:15] (03PS1) 10Jcrespo: backup: Setup backup1012 & backup2012 as new repo dedicated storages [puppet] - 10https://gerrit.wikimedia.org/r/1193083 (https://phabricator.wikimedia.org/T403946) [11:44:17] (03PS1) 10Jcrespo: bacula: Setup new job config to be able to use repo storages [puppet] - 10https://gerrit.wikimedia.org/r/1193084 (https://phabricator.wikimedia.org/T403946) [11:44:56] (03CR) 10CI reject: [V:04-1] backups: Migrate Gerrit and GitLab backups to new storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/1193081 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [11:45:12] (03CR) 10CI reject: [V:04-1] bacula: Setup new job config to be able to use repo storages [puppet] - 10https://gerrit.wikimedia.org/r/1193084 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [11:45:17] !log stevemunene@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on druid[1007-1008].eqiad.wmnet with reason: Decommissioning druid_public hosts [11:46:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6004.drmrs.wmnet [11:47:35] (03PS1) 10Btullis: Use a single networkpolicy for k8s-api to the spark-operator webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193085 (https://phabricator.wikimedia.org/T405490) [11:47:46] (03PS2) 10Btullis: Use a single networkpolicy for k8s-api to the spark-operator webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193085 (https://phabricator.wikimedia.org/T405490) [11:49:56] jmm@cumin2002 drain-node (PID 1080805) is awaiting input [11:54:01] (03PS2) 10Jcrespo: bacula: Setup new job config to be able to use repo storages [puppet] - 10https://gerrit.wikimedia.org/r/1193084 (https://phabricator.wikimedia.org/T403946) [11:54:36] (03PS3) 10Stevemunene: Define airflow-wikidata airflow instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190975 (https://phabricator.wikimedia.org/T404073) [11:54:36] (03CR) 10CI reject: [V:04-1] bacula: Setup new job config to be able to use repo storages [puppet] - 10https://gerrit.wikimedia.org/r/1193084 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [11:55:54] jmm@cumin2002 drain-node (PID 1080805) is awaiting input [11:55:58] (03PS3) 10Jcrespo: backups: Migrate Gerrit and GitLab backups to new storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/1193081 (https://phabricator.wikimedia.org/T403946) [11:58:01] (03PS3) 10Jcrespo: bacula: Setup new job config to be able to use repo storages [puppet] - 10https://gerrit.wikimedia.org/r/1193084 (https://phabricator.wikimedia.org/T403946) [11:58:30] (03CR) 10Stevemunene: [C:03+1] Use a single networkpolicy for k8s-api to the spark-operator webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193085 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [11:58:36] (03CR) 10CI reject: [V:04-1] bacula: Setup new job config to be able to use repo storages [puppet] - 10https://gerrit.wikimedia.org/r/1193084 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T1200) [12:02:38] (03CR) 10Stevemunene: Define airflow-wikidata airflow instance (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190975 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [12:03:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6004.drmrs.wmnet [12:04:06] (03CR) 10Clément Goubert: "Updated with the new regex format, should be good to go" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189938 (https://phabricator.wikimedia.org/T385066) (owner: 10Aaron Schulz) [12:06:14] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone_es of es2028.codfw.wmnet onto es2051.codfw.wmnet [12:06:18] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool es2028 - Depool es2028.codfw.wmnet to then clone it to es2051.codfw.wmnet - fceratto@cumin1002 [12:06:38] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2028 - Depool es2028.codfw.wmnet to then clone it to es2051.codfw.wmnet - fceratto@cumin1002 [12:09:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6004.drmrs.wmnet [12:09:38] fceratto@cumin1002 clone_es (PID 248426) is awaiting input [12:09:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6004.drmrs.wmnet [12:10:18] !log failover Ganeti master in codfw to ganeti2048 [12:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:54] (03CR) 10JMeybohm: [C:03+1] charts: add separate requests/limits for tegola's cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193015 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [12:12:28] PROBLEM - ganeti-wconfd running on ganeti2032 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [12:13:53] (03PS2) 10Jcrespo: backup: Setup backup1012 & backup2012 as new repo dedicated storages [puppet] - 10https://gerrit.wikimedia.org/r/1193083 (https://phabricator.wikimedia.org/T403946) [12:14:51] (03PS3) 10Jcrespo: backup: Setup backup1012 & backup2012 as new repo dedicated storages [puppet] - 10https://gerrit.wikimedia.org/r/1193083 (https://phabricator.wikimedia.org/T403946) [12:14:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet [12:18:00] jmm@cumin2002 drain-node (PID 1094609) is awaiting input [12:20:10] (03CR) 10Jcrespo: [C:04-1] "This is not yet ready to be merged (needs setup of bacula first), but please start giving me some feedback in case it could become a block" [puppet] - 10https://gerrit.wikimedia.org/r/1193081 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [12:20:22] ml-etcd2001 will go down for a Ganeti reboot [12:21:52] (03PS4) 10Jcrespo: bacula: Setup new job config to be able to use repo storages [puppet] - 10https://gerrit.wikimedia.org/r/1193084 (https://phabricator.wikimedia.org/T403946) [12:21:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet [12:22:27] (03CR) 10CI reject: [V:04-1] bacula: Setup new job config to be able to use repo storages [puppet] - 10https://gerrit.wikimedia.org/r/1193084 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [12:22:40] (03CR) 10Btullis: [C:03+2] Use a single networkpolicy for k8s-api to the spark-operator webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193085 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [12:23:35] (03PS5) 10Jcrespo: bacula: Setup new job config to be able to use repo storages [puppet] - 10https://gerrit.wikimedia.org/r/1193084 (https://phabricator.wikimedia.org/T403946) [12:23:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:23:55] PROBLEM - Host ml-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [12:24:38] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1193089 (owner: 10L10n-bot) [12:25:27] RECOVERY - Host ml-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 30.47 ms [12:26:22] (03PS1) 10DCausse: cirrus: test completion with default sort on simplewiki [1/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193091 (https://phabricator.wikimedia.org/T404858) [12:26:23] RECOVERY - Ubuntu mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/ubuntu is over 1 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [12:26:24] (03PS1) 10DCausse: cirrus: test completion with default sort on simplewiki [2/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193092 (https://phabricator.wikimedia.org/T404858) [12:26:27] (03PS1) 10DCausse: cirrus: test completion with default sort on simplewiki [3/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193093 (https://phabricator.wikimedia.org/T404858) [12:27:01] (03CR) 10Muehlenhoff: [C:03+2] Add maps1012 to maps1014 as replicas [puppet] - 10https://gerrit.wikimedia.org/r/1191241 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [12:28:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet [12:28:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet [12:29:25] RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [12:29:30] (03PS6) 10Jcrespo: bacula: Setup new job config to be able to use repo storages [puppet] - 10https://gerrit.wikimedia.org/r/1193084 (https://phabricator.wikimedia.org/T403946) [12:29:44] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy1024 - https://phabricator.wikimedia.org/T405804#11236633 (10Ladsgroup) sdb is missing altogether: ` ladsgroup@dbproxy1024:~$ lsblk -a NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS loop0 7:0 0 0B 0 loop loop1... [12:30:13] (03Merged) 10jenkins-bot: Use a single networkpolicy for k8s-api to the spark-operator webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193085 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [12:30:49] (03CR) 10Jon Harald Søby: "Looking good now!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [12:31:47] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:32:29] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:32:30] (03CR) 10Jon Harald Søby: Namespaces.php: Change Portal Namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [12:32:44] (03CR) 10Jon Harald Søby: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) (owner: 10Cappybaraa) [12:33:11] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:35:57] (03PS2) 10DDesouza: Update and deploy reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192635 (https://phabricator.wikimedia.org/T405410) [12:39:16] (03PS1) 10Kosta Harlan: WikimediaEventsAuthPreserveQueryParamsExperiments: Define hCaptcha A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193096 (https://phabricator.wikimedia.org/T405239) [12:39:25] FIRING: SystemdUnitFailed: postgresql@15-main.service on maps1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:39:58] (03PS3) 10DDesouza: Update reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192635 (https://phabricator.wikimedia.org/T405410) [12:41:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192635 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [12:41:40] FIRING: [29x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:44:25] RESOLVED: SystemdUnitFailed: postgresql@15-main.service on maps1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:44:55] FIRING: SystemdUnitFailed: postgresql@15-main.service on maps1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:46:21] hashar: Argh, sorry, didn't run the DB creation step, yes. Bloody mwscript-k8s [12:46:25] FIRING: [30x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:46:37] * James_F will run again with non-k8s mwscript so it actually works. [12:47:18] Oh, but of course I can't because you've reverted the dblist. [12:47:19] Eurgh. [12:47:50] James_F: hi! [12:48:03] hashar: A ping would have solved this in 5 seconds. :-P [12:48:05] i confess I haven't think twice and just went to revert the change :b [12:48:15] Indeed, I can see that. [12:48:24] wasn't it like 4am for you? :] [12:48:35] anyway, the change can be remade anytime [12:48:43] Yes, but I was around. [12:48:48] the thing is I have no idea how db tables are supposed to be created [12:48:57] It's in the runbook. [12:49:34] anyway we can redo it [12:49:47] (03PS1) 10Jforrester: Revert^2 "Enable Wikifunctions client mode on Wiktionaries, Part III, and Incubator" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193098 [12:51:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192857 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [12:52:49] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2046.codfw.wmnet [12:53:28] (03CR) 10Elukey: [C:03+2] charts: add separate requests/limits for tegola's cronjob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193015 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [12:54:07] hashar: OK, DBs created. Shall we deploy? [12:54:44] Oh, bah, the backport window is very soon. [12:54:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193098 (owner: 10Jforrester) [12:55:29] oh, and it looks pretty full too [12:55:53] Yeah, but we can slipstream a bunch of them together. [12:56:20] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2046.codfw.wmnet [12:57:57] !log jhancock@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-worker2035'] [12:58:04] fceratto@cumin1002 clone_es (PID 248426) is awaiting input [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T1300). [13:00:05] mfossati, xSavitar, danisztls, tgr, and James_F: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] o/ [13:00:14] o/ [13:00:15] Still here. :-) [13:00:23] Lucas_WMDE: Are you deploying? [13:00:39] I can [13:00:44] o/ [13:00:45] though I guess there’ll be some level of self-service too :) [13:00:51] * Lucas_WMDE looks at the calendar [13:01:07] Lucas_WMDE, you can do mine today :) [13:01:32] I don’t see mfossati yet [13:01:39] o/ [13:01:44] let’s start with danisztls and run the gate-and-submit for xSavitar’s backports in the meantime [13:01:56] James_F, hi o/ [13:01:58] the changes by tgr_ and James_F both look a bit scary and I think I’d rather separate those ^^ [13:02:09] Lucas_WMDE: Ack. [13:02:30] danisztls: want to self-service your deployment? [13:02:33] otherwise I can do it [13:02:37] I can self-service [13:02:41] ok, go ahead :) [13:02:42] the two session changes (the config change and the backports) should preferably go out at different times so we can tell which one caused a change in log volume etc [13:02:49] ack [13:03:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192635 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [13:03:14] I’ll +2 the backports as soon as scap has started running properly for the config change [13:03:20] they do different kinds of things and it's unlikely they would cause similar issues, but just in case [13:04:07] (03Merged) 10jenkins-bot: Update reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192635 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [13:04:30] !log dani@deploy2002 Started scap sync-world: Backport for [[gerrit:1192635|Update reader foundational survey on enwiki (T405410)]] [13:04:34] T405410: Phase III Short Survey - https://phabricator.wikimedia.org/T405410 [13:04:40] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193067 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [13:04:44] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1193069 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [13:05:13] (03PS1) 10Muehlenhoff: Also Add maps1012-maps1014 as maps nodes [puppet] - 10https://gerrit.wikimedia.org/r/1193107 (https://phabricator.wikimedia.org/T381565) [13:08:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:08:46] (03PS1) 10Ladsgroup: db-production: Enable shuffle sharding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193108 (https://phabricator.wikimedia.org/T405087) [13:10:16] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough [13:10:39] !log dani@deploy2002 dani: Backport for [[gerrit:1192635|Update reader foundational survey on enwiki (T405410)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:10:42] T405410: Phase III Short Survey - https://phabricator.wikimedia.org/T405410 [13:11:35] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart rolling restart_daemons on A:dnsbox [13:11:56] !log dani@deploy2002 dani: Continuing with sync [13:13:00] (03CR) 10Muehlenhoff: [C:03+2] Also Add maps1012-maps1014 as maps nodes [puppet] - 10https://gerrit.wikimedia.org/r/1193107 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:13:44] (03PS3) 10Bking: dse-k8s-eqiad: bump up minimum pod resources for opensearch-test ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192987 (https://phabricator.wikimedia.org/T397246) [13:15:09] (03PS1) 10DDesouza: Deploy reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193110 (https://phabricator.wikimedia.org/T405410) [13:16:24] !log dani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192635|Update reader foundational survey on enwiki (T405410)]] (duration: 11m 54s) [13:16:30] T405410: Phase III Short Survey - https://phabricator.wikimedia.org/T405410 [13:17:01] * Lucas_WMDE takes over for the backports [13:17:26] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart-ntp rolling restart_daemons on A:dnsbox [13:17:41] > branch 'wmf/1.45.0-wmf.20' not found in any deployed wikiversion. Deployed wikiversions: ['1.45.0-wmf.21'] [13:17:42] huh [13:17:47] yes, continue with backport [13:17:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193067 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [13:17:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1193069 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [13:18:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193110 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [13:19:38] (03PS1) 10JMeybohm: sre.k8s: Handle errors in kubectl_version() [cookbooks] - 10https://gerrit.wikimedia.org/r/1193111 (https://phabricator.wikimedia.org/T406200) [13:20:37] (03Merged) 10jenkins-bot: session: Lookup authenticated store first before anon store [core] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193067 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [13:20:43] (03Merged) 10jenkins-bot: session: Lookup authenticated store first before anon store [core] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1193069 (https://phabricator.wikimedia.org/T402808) (owner: 10D3r1ck01) [13:21:07] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1193067|session: Lookup authenticated store first before anon store (T402808)]], [[gerrit:1193069|session: Lookup authenticated store first before anon store (T402808)]] [13:21:11] T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808 [13:23:50] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough [13:24:01] (03PS5) 10Tiziano Fogli: metamonitoring: replace Gunicorn with uWSGI [puppet] - 10https://gerrit.wikimedia.org/r/1193109 (https://phabricator.wikimedia.org/T397003) [13:24:02] (03CR) 10Tiziano Fogli: "Prometheus/Thanos metamonitoring notifications will remain silenced until this patch is merged to avoid false positives." [puppet] - 10https://gerrit.wikimedia.org/r/1193109 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [13:24:16] (03PS3) 10Ssingh: team-sre/cdn: ignore (wdqs-main|wdqs-scholarly|wcqs).discovery.wmnet in ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1192940 (https://phabricator.wikimedia.org/T406141) [13:24:40] FIRING: [2x] SystemdUnitFailed: postgresql@15-main.service on maps1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:25:53] (03CR) 10Ssingh: team-sre/cdn: ignore (wdqs-main|wdqs-scholarly|wcqs).discovery.wmnet in ATSBackendErrorsHigh (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1192940 (https://phabricator.wikimedia.org/T406141) (owner: 10Ssingh) [13:27:14] !log lucaswerkmeister-wmde@deploy2002 d3r1ck01, lucaswerkmeister-wmde: Backport for [[gerrit:1193067|session: Lookup authenticated store first before anon store (T402808)]], [[gerrit:1193069|session: Lookup authenticated store first before anon store (T402808)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:27:18] T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808 [13:27:22] xSavitar: anything to test? [13:27:41] Lucas_WMDE, let me just test that login works normally but the real thing is for logstash logs to drop [13:27:45] Give me 1 min [13:28:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Move lvs1020 link from ssw1-f1-eqiad to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T404959#11236873 (10ssingh) @BCornwall from Traffic will be working on this, thanks! [13:28:56] ok [13:29:32] Just tested login on meta-wiki and edge-login still works on hewiki, cawiki, etc. Please sync Lucas_WMDE [13:29:41] !log lucaswerkmeister-wmde@deploy2002 d3r1ck01, lucaswerkmeister-wmde: Continuing with sync [13:30:10] Impact on logs will be seen only after the syncing happens. So I have an eye on that [13:30:22] once this deploy is done, I suggest we move James_F ahead of tgr_, just to have a bigger gap in the timestamps between the two session deploys [13:30:32] Sure. [13:30:49] jhancock@cumin1002 upgrade-firmware (PID 341922) is awaiting input [13:33:59] Lucas_WMDE, logs are reducing slowly :) [13:34:03] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193067|session: Lookup authenticated store first before anon store (T402808)]], [[gerrit:1193069|session: Lookup authenticated store first before anon store (T402808)]] (duration: 12m 56s) [13:34:05] nice [13:34:08] T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode - https://phabricator.wikimedia.org/T402808 [13:34:15] Good to go? [13:34:18] yup, go ahead [13:34:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193098 (owner: 10Jforrester) [13:35:19] (03Merged) 10jenkins-bot: Revert^2 "Enable Wikifunctions client mode on Wiktionaries, Part III, and Incubator" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193098 (owner: 10Jforrester) [13:35:41] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1193098|Revert^2 "Enable Wikifunctions client mode on Wiktionaries, Part III, and Incubator"]] [13:35:49] PROBLEM - Host wikikube-worker2035 is DOWN: PING CRITICAL - Packet loss = 100% [13:36:15] Lucas_WMDE, from 2.5K down to 0: https://logstash.wikimedia.org/goto/969bee9b1be1780a7d8ead05f30825cc [13:36:22] Thanks for deploying. [13:36:25] FIRING: [30x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:36:36] \o/ [13:39:07] (03PS4) 10Bking: dse-k8s-eqiad: bump up minimum pod resources for opensearch-test ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192987 (https://phabricator.wikimedia.org/T397246) [13:39:49] !log jayme@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wikikube-worker2035.codfw.wmnet with reason: Hardware failure [13:41:04] (03CR) 10Tiziano Fogli: "@kherron@wikimedia.org: Quick question before taking a closer look at the patch: did you test it on Pontoon? If not, I can do that tomorro" [puppet] - 10https://gerrit.wikimedia.org/r/1188441 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron) [13:41:22] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-worker2035'] [13:41:38] (03PS5) 10Bking: dse-k8s-eqiad: bump up minimum pod resources for opensearch-test ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192987 (https://phabricator.wikimedia.org/T397246) [13:41:41] There's always one debug sync that seems to take forever. [13:41:45] Is it always the same one? [13:41:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192861 (https://phabricator.wikimedia.org/T401575) (owner: 10Samtar) [13:42:04] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1193098|Revert^2 "Enable Wikifunctions client mode on Wiktionaries, Part III, and Incubator"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:42:17] RECOVERY - Host wikikube-worker2035 is UP: PING OK - Packet loss = 0%, RTA = 30.37 ms [13:42:37] (03CR) 10Bking: [V:03+2 C:03+2] dse-k8s-eqiad: bump up minimum pod resources for opensearch-test ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192987 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [13:42:51] !log jforrester@deploy2002 jforrester: Continuing with sync [13:43:03] no idea [13:43:52] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [13:44:17] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1193053 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey) [13:44:32] (03PS2) 10Kosta Harlan: MetricsPlatformAuthPreserveQueryParamsExperiments: Define hCaptcha A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193096 (https://phabricator.wikimedia.org/T405239) [13:44:35] !log failover Ganeti master in eqiad to ganeti1048 [13:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:38] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [13:44:52] FIRING: [27x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:47:20] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193098|Revert^2 "Enable Wikifunctions client mode on Wiktionaries, Part III, and Incubator"]] (duration: 11m 39s) [13:47:23] PROBLEM - ganeti-wconfd running on ganeti1046 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:47:38] Over to tgr_ . [13:48:31] (03PS1) 10Stevemunene: Add the analytics-research keytab to the stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/1193117 (https://phabricator.wikimedia.org/T403207) [13:48:43] thx [13:48:59] RECOVERY - MD RAID on wikikube-worker2035 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [13:49:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192857 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [13:49:45] Lucas_WMDE: I suppose that mfossati's isn't being done, as they're not here? [13:50:00] looks like it [13:50:09] Ack, will mark as such on the calendar. [13:50:17] they’re afk in slack too afaict [13:50:37] (03Merged) 10jenkins-bot: Enable JWT session cookies on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192857 (https://phabricator.wikimedia.org/T399631) (owner: 10Gergő Tisza) [13:50:48] Maybe I should file a SpiderPig request for it to mark patches as Done in the calendar. [13:50:58] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1192857|Enable JWT session cookies on group1 (T399631)]] [13:51:02] T399631: Deploy JWT cookies to production - https://phabricator.wikimedia.org/T399631 [13:51:10] I never bother with that tbh [13:51:16] if the software did it automatically I wouldn’t mind ^^ [13:51:25] (btw you marked tgr_’s config change as done instead of yours :P) [13:51:32] * James_F nods. [13:51:34] Oh, bugger. [13:51:48] Another reason why software is better than humans for this. :-) [13:51:51] :D [13:54:23] Filed as T406229 if you care to subscribe. [13:54:24] T406229: Mark patches as done (`{{deploy|…|status=d}}`) on the Deployments calendar once it's successfully deployed - https://phabricator.wikimedia.org/T406229 [13:54:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Move lvs1020 link from ssw1-f1-eqiad to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T404959#11236988 (10cmooney) >>! In T404959#11236872, @ssingh wrote: > @BCornwall from Traffic will be working on this, thanks! Thanks @ssingh... [13:56:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2046.codfw.wmnet [13:56:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet [13:58:10] FIRING: BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:58:36] !log tgr@deploy2002 tgr: Backport for [[gerrit:1192857|Enable JWT session cookies on group1 (T399631)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:58:40] T399631: Deploy JWT cookies to production - https://phabricator.wikimedia.org/T399631 [14:01:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet [14:01:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2046.codfw.wmnet [14:02:23] (03CR) 10Elukey: [C:03+2] cpufrequtils: improve cpupower's config [puppet] - 10https://gerrit.wikimedia.org/r/1193053 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey) [14:03:10] RESOLVED: BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:04:14] !log tgr@deploy2002 tgr: Continuing with sync [14:04:40] RESOLVED: [2x] SystemdUnitFailed: postgresql@15-main.service on maps1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:04:42] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1046.eqiad.wmnet [14:07:19] (03PS1) 10Bking: dse-k8s-eqiad: Add opensearch namespaces as ceph CSI clients [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193123 (https://phabricator.wikimedia.org/T397246) [14:08:38] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192857|Enable JWT session cookies on group1 (T399631)]] (duration: 17m 41s) [14:08:43] T399631: Deploy JWT cookies to production - https://phabricator.wikimedia.org/T399631 [14:08:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [14:08:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1046.eqiad.wmnet [14:09:48] FIRING: PuppetFailure: Puppet has failed on maps1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:10:04] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart (exit_code=0) rolling restart_daemons on A:dnsbox [14:10:08] !log UTC afternoon deploys done [14:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:34] \o/ [14:10:35] thanks all! [14:13:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11237050 (10Jclark-ctr) a:03Jclark-ctr [14:13:11] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:13:48] RESOLVED: PuppetFailure: Puppet has failed on ml-serve1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:14:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1046.eqiad.wmnet [14:14:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1046.eqiad.wmnet [14:14:39] (03PS2) 10Bking: dse-k8s-eqiad: Add opensearch namespaces as ceph CSI tenants [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193123 (https://phabricator.wikimedia.org/T397246) [14:14:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [14:14:48] RESOLVED: PuppetFailure: Puppet has failed on maps1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:14:52] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [14:15:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:16:55] FIRING: [4x] SystemdUnitFailed: postgresql@15-main.service on maps1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:17:16] !log drain transport circuit cr1-eqiad <-> cr1-codfw to allow for PIC card reboot on cr1-eqiad T402588 [14:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:20] T402588: Eqiad: row C/D switch refresh configuration task - https://phabricator.wikimedia.org/T402588 [14:18:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1030.eqiad.wmnet [14:19:11] FIRING: NetworkDeviceAlarmActive: Alarm active on cr1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [14:20:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1030.eqiad.wmnet [14:21:55] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Bind the K8s API service on v6 [puppet] - 10https://gerrit.wikimedia.org/r/1193124 (https://phabricator.wikimedia.org/T405078) [14:23:59] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1193109 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [14:24:28] (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::haproxy: Bind the K8s API service on v6 [puppet] - 10https://gerrit.wikimedia.org/r/1193124 (https://phabricator.wikimedia.org/T405078) (owner: 10Majavah) [14:24:52] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:25:58] !log taavi@cumin1002 START - Cookbook sre.dns.wipe-cache 'k8s.svc.toolsbeta.eqiad1.wikimedia.cloud$' on eqiad recursors [14:25:59] !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'k8s.svc.toolsbeta.eqiad1.wikimedia.cloud$' on eqiad recursors [14:26:29] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [14:26:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1030.eqiad.wmnet [14:26:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1030.eqiad.wmnet [14:26:59] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [14:27:42] (03CR) 10JHathaway: redfish: allow HTTP 204 responses in poll_task (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1193046 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [14:28:38] !log drain link from cr1-eqiad <-> ssw1-e1-eqiad to allow PIC card reboot on cr1-eqiad T402588 [14:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:43] T402588: Eqiad: row C/D switch refresh configuration task - https://phabricator.wikimedia.org/T402588 [14:29:07] (03CR) 10Elukey: redfish: allow HTTP 204 responses in poll_task (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1193046 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T1430) [14:30:44] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Fix TLS on IPv6 listener [puppet] - 10https://gerrit.wikimedia.org/r/1193125 (https://phabricator.wikimedia.org/T405078) [14:32:37] (03PS3) 10Samtar: EventStreamConfig and stream registration for watchlist click tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192861 (https://phabricator.wikimedia.org/T401575) [14:33:30] !incidents [14:33:31] 6821 (RESOLVED) Manual (paged) by LSobanski (lsobanski@wikimedia.org): Yet another test page [14:33:31] 6820 (RESOLVED) wmf - metamonitoring - thanos - notified - vip is now DOWN [14:33:31] 6819 (RESOLVED) Manual (paged) by LSobanski (lsobanski@wikimedia.org): Another test page [14:33:31] 6818 (RESOLVED) Manual (paged) by LSobanski (lsobanski@wikimedia.org): Test page [14:33:36] (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::haproxy: Fix TLS on IPv6 listener [puppet] - 10https://gerrit.wikimedia.org/r/1193125 (https://phabricator.wikimedia.org/T405078) (owner: 10Majavah) [14:36:48] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on cr[1-2]-eqiad,ssw1-e1-eqiad with reason: reset PIC 0/1 in cr1-eqiad to set port 5 speed [14:36:50] !log reset PIC 0/1 on cr1-eqiad to set port speed for port 5 T402588 [14:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:54] T402588: Eqiad: row C/D switch refresh configuration task - https://phabricator.wikimedia.org/T402588 [14:36:55] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad: row C/D switch refresh configuration task - https://phabricator.wikimedia.org/T402588#11237167 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=626cec35-f6f7-443b-90fb-3024162d9dc9) set by cmooney@cumin1003 for 0:10:00 on 3 host(s) and... [14:41:48] (03PS4) 10Jcrespo: backup: Setup backup1012 & backup2012 as new repo dedicated storages [puppet] - 10https://gerrit.wikimedia.org/r/1193083 (https://phabricator.wikimedia.org/T403946) [14:41:49] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1193083 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [14:41:52] (03CR) 10DCausse: [C:04-1] cirrus: test completion with default sort on simplewiki [1/3] (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193091 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [14:43:57] (03PS1) 10Ssingh: haptcha: add new role for hCaptcha proxy [puppet] - 10https://gerrit.wikimedia.org/r/1193126 (https://phabricator.wikimedia.org/T405631) [14:44:24] (03CR) 10CI reject: [V:04-1] haptcha: add new role for hCaptcha proxy [puppet] - 10https://gerrit.wikimedia.org/r/1193126 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [14:45:01] (03PS2) 10Ssingh: haptcha: add new role for hCaptcha proxy [puppet] - 10https://gerrit.wikimedia.org/r/1193126 (https://phabricator.wikimedia.org/T405631) [14:45:52] (03CR) 10Alexandros Kosiaris: [C:03+1] WMF-Uniq -> analytics: better stats & privacy [puppet] - 10https://gerrit.wikimedia.org/r/1191708 (https://phabricator.wikimedia.org/T405783) (owner: 10CDanis) [14:49:11] RESOLVED: NetworkDeviceAlarmActive: Alarm active on cr1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [14:53:46] (03CR) 10Brennen Bearnes: [C:03+1] phabricator: drop cluster_search config [puppet] - 10https://gerrit.wikimedia.org/r/1192636 (https://phabricator.wikimedia.org/T403948) (owner: 10Dzahn) [14:54:40] RESOLVED: [2x] SystemdUnitFailed: postgresql@15-main.service on maps1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:54:46] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [14:56:11] (03PS1) 10Dzahn: cyberbot: use wmflib::debian_php_version to pick PHP version [puppet] - 10https://gerrit.wikimedia.org/r/1193129 [14:56:37] (03CR) 10CI reject: [V:04-1] cyberbot: use wmflib::debian_php_version to pick PHP version [puppet] - 10https://gerrit.wikimedia.org/r/1193129 (owner: 10Dzahn) [14:57:11] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 07Documentation, 07Puppet (Puppet 7.0): Puppet7: Update documentation - https://phabricator.wikimedia.org/T341095#11237242 (10jhathaway) As @jcrespo pointed out on IRC, there is also a quite a bit of puppet 5 documentation which needs to be re... [14:58:18] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new dns names for cr1-eqiad et-1/0/5.100 interface IPs - cmooney@cumin1003" [14:58:23] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new dns names for cr1-eqiad et-1/0/5.100 interface IPs - cmooney@cumin1003" [14:58:23] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:58:59] (03PS1) 10Dzahn: ci: use wmflib::debian_php_version to determine PHP version [puppet] - 10https://gerrit.wikimedia.org/r/1193131 [15:00:05] jasmine_: I, the Bot under the Fountain, call upon thee, The Deployer, to do DC Switchover: Day 8 - Eqiad Repool deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T1500). [15:00:05] hashar and brennen: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T1500) [15:00:22] (03PS2) 10Dzahn: cyberbot: use wmflib::debian_php_version to pick PHP version [puppet] - 10https://gerrit.wikimedia.org/r/1193129 [15:01:06] (03CR) 10Brouberol: [C:03+1] dse-k8s-eqiad: Add opensearch namespaces as ceph CSI tenants [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193123 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [15:01:19] (03PS1) 10Reedy: CommonSettings.php: Replace usage of $wgCaptchaWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193132 (https://phabricator.wikimedia.org/T277936) [15:01:29] (03CR) 10CI reject: [V:04-1] ci: use wmflib::debian_php_version to determine PHP version [puppet] - 10https://gerrit.wikimedia.org/r/1193131 (owner: 10Dzahn) [15:02:05] (03PS1) 10Elukey: prometheus: update the amd-rocm exporter [puppet] - 10https://gerrit.wikimedia.org/r/1193133 (https://phabricator.wikimedia.org/T403697) [15:02:23] (03CR) 10Cathal Mooney: [C:03+2] ssw1-d8-eqiad: add bgp peerings to CR and Juniper spines [homer/public] - 10https://gerrit.wikimedia.org/r/1191429 (https://phabricator.wikimedia.org/T396063) (owner: 10Cathal Mooney) [15:02:45] (03CR) 10CI reject: [V:04-1] prometheus: update the amd-rocm exporter [puppet] - 10https://gerrit.wikimedia.org/r/1193133 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey) [15:03:52] (03Merged) 10jenkins-bot: ssw1-d8-eqiad: add bgp peerings to CR and Juniper spines [homer/public] - 10https://gerrit.wikimedia.org/r/1191429 (https://phabricator.wikimedia.org/T396063) (owner: 10Cathal Mooney) [15:03:55] FIRING: [4x] SystemdUnitFailed: postgresql@15-main.service on maps1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:04:18] (03CR) 10Bking: [C:03+2] dse-k8s-eqiad: Add opensearch namespaces as ceph CSI tenants [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193123 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [15:04:56] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on wikikube-worker2035 - https://phabricator.wikimedia.org/T406060#11237276 (10Jhancock.wm) @JMeybohm i updated the idrac and bios. it's still not showing a bad disk in the idrac or in person. Could you check if the issue persists? [15:05:30] (03CR) 10JHathaway: [C:03+1] redfish: allow HTTP 204 responses in poll_task (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1193046 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [15:06:50] (03PS2) 10Elukey: prometheus: update the amd-rocm exporter [puppet] - 10https://gerrit.wikimedia.org/r/1193133 (https://phabricator.wikimedia.org/T403697) [15:08:42] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:09:11] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:32] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:10:06] (03CR) 10Herron: [C:03+1] metamonitoring: replace Gunicorn with uWSGI [puppet] - 10https://gerrit.wikimedia.org/r/1193109 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [15:10:31] (03CR) 10Herron: [C:03+2] vo-escalate: absent timer [puppet] - 10https://gerrit.wikimedia.org/r/1192610 (owner: 10Herron) [15:12:01] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task ssw1-d1-eqiad - https://phabricator.wikimedia.org/T401238#11237306 (10cmooney) 05Resolved→03Open Hey @Jclark-ctr Thanks for connecting the optics. I see the local link in the rack to the leaf there as down though, can you check?... [15:12:21] !log jhancock@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2006.codfw.wmnet with OS bookworm [15:12:35] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q1:rack/setup/install wikikube-ctrl2006 - https://phabricator.wikimedia.org/T400661#11237310 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1002 for host wikikube-ctrl2006.codfw.wmnet with OS bookworm [15:14:22] (03CR) 10Dzahn: "as a side effect this will also unblock supoort for trixie and future Debian versions without having to touch this code again" [puppet] - 10https://gerrit.wikimedia.org/r/1193129 (owner: 10Dzahn) [15:14:33] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task ssw1-d1-eqiad - https://phabricator.wikimedia.org/T401238#11237325 (10Jclark-ctr) @cmooney Will check it shortly. If you want to turn on the e1/f1 I did connect them for the time being since d8 still has a bit more work from Valerie.... [15:14:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192578 (https://phabricator.wikimedia.org/T403259) (owner: 10Marco Fossati) [15:15:21] (03CR) 10Dzahn: [C:04-1] "rspec tests fail - too specific about PHP versions" [puppet] - 10https://gerrit.wikimedia.org/r/1193131 (owner: 10Dzahn) [15:15:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:17:45] Lucas_WMDE, James_F: apologies, I've totally overlooked the previous backport window, have just rescheduled! [15:18:18] np, good luck with the reschedule :) [15:18:36] (03CR) 10Vgutierrez: P:cache::haproxy copy private repo data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1192846 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [15:18:45] (03Abandoned) 10Dzahn: ci: use wmflib::debian_php_version to determine PHP version [puppet] - 10https://gerrit.wikimedia.org/r/1193131 (owner: 10Dzahn) [15:21:12] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task ssw1-d1-eqiad - https://phabricator.wikimedia.org/T401238#11237353 (10cmooney) >>! In T401238#11237325, @Jclark-ctr wrote: > @cmooney Will check it shortly. If you want to turn on the e1/f1 I did connect them for the time being since... [15:28:55] FIRING: [2x] SystemdUnitFailed: postgresql@15-main.service on maps1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:29:28] (03PS1) 10Santiago Faci: xLab: Deploying v1.0.6 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193136 (https://phabricator.wikimedia.org/T396578) [15:30:04] (03CR) 10Clare Ming: [C:03+1] EventStreamConfig and stream registration for watchlist click tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192861 (https://phabricator.wikimedia.org/T401575) (owner: 10Samtar) [15:30:08] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2049.codfw.wmnet'] [15:30:09] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2049.codfw.wmnet'] [15:31:02] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2049.codfw.wmnet'] [15:31:04] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2049.codfw.wmnet'] [15:33:03] hi folks, for visibility: we'll be repooling eqiad shortly after the staff meeting - we should be done prior to the end of the late UTC infra window [15:33:55] FIRING: [2x] SystemdUnitFailed: postgresql@15-main.service on maps1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:34:52] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:55] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2049.codfw.wmnet'] [15:36:56] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2049.codfw.wmnet'] [15:37:27] (03CR) 10Jcrespo: [C:04-2] "Sadly backup1012 is unavailable. Depending on how fast we can recover it, and how much we trust its reliability, we may have to use backup" [puppet] - 10https://gerrit.wikimedia.org/r/1193083 (https://phabricator.wikimedia.org/T403946) (owner: 10Jcrespo) [15:37:37] jhancock@cumin1002 reimage (PID 586503) is awaiting input [15:38:59] (03CR) 10Clare Ming: [C:03+2] xLab: Deploying v1.0.6 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193136 (https://phabricator.wikimedia.org/T396578) (owner: 10Santiago Faci) [15:39:03] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on wikikube-worker2035 - https://phabricator.wikimedia.org/T406060#11237432 (10Clement_Goubert) 05Open→03Resolved It looks fine now, RAID is showing up correct ` Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] md0 : ac... [15:39:11] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:39:52] FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:40:42] (03Merged) 10jenkins-bot: xLab: Deploying v1.0.6 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193136 (https://phabricator.wikimedia.org/T396578) (owner: 10Santiago Faci) [15:42:09] !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2035.codfw.wmnet [15:42:11] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2035.codfw.wmnet [15:42:17] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on wikikube-worker2035 - https://phabricator.wikimedia.org/T406060#11237461 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1003 pool for host wikikube-worker2035.codfw.wmnet completed: - wikikube-worker2035.codfw.wmn... [15:43:16] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task ssw1-d1-eqiad - https://phabricator.wikimedia.org/T401238#11237467 (10cmooney) As mentioned on irc the link to E1 looks good, but the one to F1 is not seeing light on either end. [15:44:57] (03CR) 10Elukey: "Okok makes sense, even if in my opinion the prometheus use case is a bit different, since it is a collection of different configs, meanwhi" [puppet] - 10https://gerrit.wikimedia.org/r/1188441 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron) [15:46:52] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2049.codfw.wmnet'] [15:46:53] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2049.codfw.wmnet'] [15:47:55] (03PS2) 10DCausse: cirrus: test completion with default sort on simplewiki [1/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193091 (https://phabricator.wikimedia.org/T404858) [15:47:55] (03PS2) 10DCausse: cirrus: test completion with default sort on simplewiki [2/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193092 (https://phabricator.wikimedia.org/T404858) [15:47:55] (03PS2) 10DCausse: cirrus: test completion with default sort on simplewiki [3/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193093 (https://phabricator.wikimedia.org/T404858) [15:48:55] RESOLVED: [2x] SystemdUnitFailed: postgresql@15-main.service on maps1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:50:46] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2049.codfw.wmnet'] [15:50:47] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2049.codfw.wmnet'] [15:51:20] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [15:51:21] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2049.codfw.wmnet'] [15:51:23] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2049.codfw.wmnet'] [15:51:57] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [15:52:01] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart-ntp (exit_code=0) rolling restart_daemons on A:dnsbox [15:52:35] !log elukey@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2049.codfw.wmnet'] [15:52:37] !log elukey@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp2049.codfw.wmnet'] [16:00:05] jasmine_: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) DC Switchover: Day 8 - Eqiad Repool deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T1500). [16:00:05] jhathaway and moritzm: OwO what's this, a deployment window?? Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T1600). nyaa~ [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:10:01] (03PS2) 10Jdlrobson: Revert "Temporarily use production for summary endpoint" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187491 [16:10:58] (03PS1) 10JHathaway: backup1012: reimage [puppet] - 10https://gerrit.wikimedia.org/r/1193137 (https://phabricator.wikimedia.org/T371416) [16:10:59] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for VolkerE - https://phabricator.wikimedia.org/T406243 (10Jdrewniak) 03NEW [16:13:45] (03PS2) 10Jcrespo: backup1012: reimage [puppet] - 10https://gerrit.wikimedia.org/r/1193137 (https://phabricator.wikimedia.org/T371416) (owner: 10JHathaway) [16:14:00] (03PS3) 10Jcrespo: backup1012: reimage [puppet] - 10https://gerrit.wikimedia.org/r/1193137 (https://phabricator.wikimedia.org/T371416) (owner: 10JHathaway) [16:14:10] (03CR) 10Jcrespo: [C:03+1] backup1012: reimage [puppet] - 10https://gerrit.wikimedia.org/r/1193137 (https://phabricator.wikimedia.org/T371416) (owner: 10JHathaway) [16:19:22] (03CR) 10JHathaway: [C:03+2] backup1012: reimage [puppet] - 10https://gerrit.wikimedia.org/r/1193137 (https://phabricator.wikimedia.org/T371416) (owner: 10JHathaway) [16:19:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:20:15] (03CR) 10DCausse: cirrus: test completion with default sort on simplewiki [1/3] (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193091 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [16:24:33] (03PS1) 10Btullis: Correct the name of the webhook service for spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193138 (https://phabricator.wikimedia.org/T405490) [16:24:42] (03PS2) 10Btullis: Correct the name of the webhook service for spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193138 (https://phabricator.wikimedia.org/T405490) [16:24:44] (03CR) 10CI reject: [V:04-1] Correct the name of the webhook service for spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193138 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [16:25:54] (03PS3) 10Btullis: Correct the name of the webhook service for spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193138 (https://phabricator.wikimedia.org/T405490) [16:34:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:41:46] for visibility: repooling eqiad for CDN traffic now [16:42:32] (03PS6) 10Cappybaraa: Namespaces.php: Change Portal Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) [16:42:34] !log jasmine@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool site eqiad [reason: Repool Eqiad following DC switchover (T399891), T399891] [16:42:38] T399891: 🚀 Southward Datacenter Switchover (Sept. 2025) - https://phabricator.wikimedia.org/T399891 [16:42:51] !log jasmine@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqiad [reason: Repool Eqiad following DC switchover (T399891), T399891] [16:44:40] (03PS7) 10Cappybaraa: Namespaces.php: Change Portal Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191861 (https://phabricator.wikimedia.org/T328207) [16:47:31] (03PS1) 10Dzahn: zuul: adjust zookeeper hosts/port in new zuul config [puppet] - 10https://gerrit.wikimedia.org/r/1193141 (https://phabricator.wikimedia.org/T395938) [16:50:05] (03CR) 10CI reject: [V:04-1] zuul: adjust zookeeper hosts/port in new zuul config [puppet] - 10https://gerrit.wikimedia.org/r/1193141 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [16:51:57] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11237799 (10cmadeo) Apologies for the delay! As @TLessa-WMF's manager I am happy to approve. Thank you and very sorry for holding... [16:52:47] 10SRE-swift-storage, 06Commons: Image not visible to regular users, only visible to admin - https://phabricator.wikimedia.org/T406246#11237802 (10Aklapper) I assume this is about https://commons.wikimedia.org/wiki/File:Things_near_the_Nautical_Museum_of_Litochoro_10.jpg ? For future reference, please fill in t... [16:53:02] 10SRE-swift-storage, 06Commons: Image not visible to regular users, only visible to admin - https://phabricator.wikimedia.org/T406246#11237808 (10A_smart_kitten) [16:53:09] 10SRE-swift-storage, 06Commons: HTTP 404 error for image to regular users, only visible to admin - https://phabricator.wikimedia.org/T406246#11237809 (10Aklapper) [16:53:49] 10SRE-swift-storage, 06Commons: HTTP 404 error for image to regular users, only visible to admin - https://phabricator.wikimedia.org/T406246#11237811 (10Aklapper) [16:54:08] 10SRE-swift-storage, 06Commons: [[commons:File:Things near the Nautical Museum of Litochoro 10.jpg]] only present in codfw - https://phabricator.wikimedia.org/T406246#11237813 (10taavi) [16:54:21] 10SRE-swift-storage, 06Commons: [[commons:File:Things near the Nautical Museum of Litochoro 10.jpg]] only present in codfw - https://phabricator.wikimedia.org/T406246#11237816 (10taavi) Seems to be a datacenter-specific issue: `lang=shell-session $ curl -o /dev/null -v --connect-to ::upload-lb.eqiad.wikimedia.... [16:54:25] (03PS1) 10Dzahn: zuul::main: add second zookeeper server to nodepool config (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1193142 [16:55:02] (03CR) 10Dzahn: [C:04-1] "WIP" [puppet] - 10https://gerrit.wikimedia.org/r/1193141 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [16:55:35] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11237824 (10Dzahn) 05Stalled→03In progress [16:55:59] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11237825 (10Dzahn) [16:56:13] (03CR) 10Dzahn: [C:03+1] "has the manager approval now. rebasing needed" [puppet] - 10https://gerrit.wikimedia.org/r/1191462 (https://phabricator.wikimedia.org/T405129) (owner: 10Dzahn) [16:56:39] FYI, the "DC Switchover: Day 8" work is ongoing, and is expected to continue for at least another 30 - 40 minuts [16:57:27] (03PS3) 10Dzahn: admin: upgrade tais-lessa from ldap_only to privatedata-users, no shell [puppet] - 10https://gerrit.wikimedia.org/r/1191462 (https://phabricator.wikimedia.org/T405129) [16:57:57] (03PS1) 10Btullis: Bump the image for spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193143 (https://phabricator.wikimedia.org/T405490) [16:58:05] (03PS4) 10Dzahn: admin: upgrade tais-lessa from ldap_only to privatedata-users, no shell [puppet] - 10https://gerrit.wikimedia.org/r/1191462 (https://phabricator.wikimedia.org/T405129) [16:58:05] (03CR) 10CI reject: [V:04-1] admin: upgrade tais-lessa from ldap_only to privatedata-users, no shell [puppet] - 10https://gerrit.wikimedia.org/r/1191462 (https://phabricator.wikimedia.org/T405129) (owner: 10Dzahn) [16:58:31] (03CR) 10Dzahn: [C:03+2] admin: upgrade tais-lessa from ldap_only to privatedata-users, no shell [puppet] - 10https://gerrit.wikimedia.org/r/1191462 (https://phabricator.wikimedia.org/T405129) (owner: 10Dzahn) [16:58:38] (03CR) 10Btullis: [C:03+2] Correct the name of the webhook service for spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193138 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [17:00:05] bd808: #bothumor I � Unicode. All rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T1700). [17:02:25] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11237852 (10Dzahn) Thank you @cmadeo @TLessa-WMF you have been added to the requested group just now. You should be able to see... [17:03:09] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11237854 (10Dzahn) 05In progress→03Resolved a:05cmadeo→03None [17:03:23] !log jasmine@cumin1003 START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: Repool services in Eqiad following DC switchover (T399891) - T399891 [17:03:27] T399891: 🚀 Southward Datacenter Switchover (Sept. 2025) - https://phabricator.wikimedia.org/T399891 [17:05:52] (03PS1) 10Cathal Mooney: cr1-eqiad: add BGP to ssw1-d1-eqiad spine [homer/public] - 10https://gerrit.wikimedia.org/r/1193146 (https://phabricator.wikimedia.org/T402588) [17:06:13] (03Merged) 10jenkins-bot: Correct the name of the webhook service for spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193138 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [17:14:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:17:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:18:24] FYI, we're seeing an initial bump in 5xx errors for mw-web in eqiad, currently investigating [17:22:15] FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:23:18] ^ error rates are recovering [17:24:54] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406248 (10phaultfinder) 03NEW [17:25:44] !log jasmine@cumin1003 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all active/active services in eqiad: Repool services in Eqiad following DC switchover (T399891) - T399891 [17:25:48] T399891: 🚀 Southward Datacenter Switchover (Sept. 2025) - https://phabricator.wikimedia.org/T399891 [17:27:15] RESOLVED: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:30:12] (03CR) 10Jforrester: [C:03+1] "<3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193132 (https://phabricator.wikimedia.org/T277936) (owner: 10Reedy) [17:33:32] services have been repooled in eqiad, currently monitoring [17:34:16] jouncebot: nowandnext [17:34:16] For the next 0 hour(s) and 25 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T1700) [17:34:17] In 0 hour(s) and 25 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T1800) [17:34:37] jasmine_: okay if I deploy a patch now-ish? [17:35:03] I can wait, no rush on my side [17:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:38:30] Amir1: how risky is the change? the reason I ask is that we're still monitoring in case there are lingering issues from the repool, and it would be good to avoid mixed signals [17:38:47] it's actually somewhat risky [17:39:01] I wait a bit. Just ping me when you think it's safe to move forward [17:39:13] ah, if you don't mind holding off for now, then, that would be greatly appreciated :) [17:39:30] no worries. I take a break :D [17:40:35] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:40:57] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:43:15] !log musikanimal@deploy2002 mwscript-k8s job started: extensions/CommunityRequests/maintenance/migrateFromGadget.php --wiki=metawiki --status-csv=wishes-status-migration.csv --wishes [17:44:20] !log jhancock@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:44:52] FIRING: [27x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:50:21] !log jhancock@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:52:45] (03CR) 10Herron: [V:03+1] "That would be great thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1188441 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron) [18:00:05] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T1800) [18:02:33] (03CR) 10Ebernhardson: [C:03+1] cirrus: test completion with default sort on simplewiki [1/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193091 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [18:02:55] (03CR) 10Ebernhardson: [C:03+1] cirrus: test completion with default sort on simplewiki [2/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193092 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [18:04:16] o/ [18:04:21] nothing for this window. [18:13:11] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:14:52] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [18:15:20] I am getting: [18:15:24] Timeout when loading plugins: wm-pcc,wm-schedule-deployment,wm-zuul-status,wm-motd [18:15:27] on gerrit [18:15:27] hmm [18:15:59] back [18:19:01] !incidents [18:19:02] 6821 (RESOLVED) Manual (paged) by LSobanski (lsobanski@wikimedia.org): Yet another test page [18:19:02] 6820 (RESOLVED) wmf - metamonitoring - thanos - notified - vip is now DOWN [18:22:20] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Prefer IPv4 for backend nodes [puppet] - 10https://gerrit.wikimedia.org/r/1193164 (https://phabricator.wikimedia.org/T405078) [18:23:26] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7183/co" [puppet] - 10https://gerrit.wikimedia.org/r/1193164 (https://phabricator.wikimedia.org/T405078) (owner: 10Majavah) [18:23:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:24:52] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:27:14] !log musikanimal@deploy2002 mwscript-k8s job started: extensions/CommunityRequests/maintenance/migrateFromGadget.php --wiki=metawiki --status-csv=wishes-status-migration.csv --wishes [18:32:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:33:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:33:39] FIRING: [3x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:34:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193108 (https://phabricator.wikimedia.org/T405087) (owner: 10Ladsgroup) [18:34:59] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406248#11238232 (10phaultfinder) [18:35:20] (03Merged) 10jenkins-bot: db-production: Enable shuffle sharding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193108 (https://phabricator.wikimedia.org/T405087) (owner: 10Ladsgroup) [18:35:42] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1193108|db-production: Enable shuffle sharding (T405087)]] [18:35:45] T405087: Remove concept of groups in rdbms load balancer and replace it with shuffle sharding - https://phabricator.wikimedia.org/T405087 [18:37:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:38:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:39:26] federico3: topranks: Dear oncallers, Heads up that I'm enabling shuffle sharding in our db load balancing now (T405087). It should be noop but it might actually bring down everything if we have really skewed IPs or bugs in the code [18:39:41] ack, thanks [18:41:58] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1193108|db-production: Enable shuffle sharding (T405087)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:42:01] T405087: Remove concept of groups in rdbms load balancer and replace it with shuffle sharding - https://phabricator.wikimedia.org/T405087 [18:50:04] (03PS1) 10Arlolra: Deploy Parsoid Read Views to 26 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193179 (https://phabricator.wikimedia.org/T406250) [18:51:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193179 (https://phabricator.wikimedia.org/T406250) (owner: 10Arlolra) [18:52:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Move lvs1020 link from ssw1-f1-eqiad to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T404959#11238347 (10BCornwall) @cmooney Sounds good! Should I be scheduling with you or @VRiley-WMF? [18:53:38] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [18:54:35] * swfrench-wmf is excited to see how this goes [18:54:58] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406248#11238363 (10phaultfinder) [18:58:14] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193108|db-production: Enable shuffle sharding (T405087)]] (duration: 22m 32s) [18:58:17] T405087: Remove concept of groups in rdbms load balancer and replace it with shuffle sharding - https://phabricator.wikimedia.org/T405087 [19:02:48] (03PS1) 10JHathaway: wikimedia.support: initial mx support [puppet] - 10https://gerrit.wikimedia.org/r/1193183 (https://phabricator.wikimedia.org/T400952) [19:05:08] (03CR) 10JHathaway: wikimedia.support: Rm ncredir, add zendesk records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1192236 (https://phabricator.wikimedia.org/T400952) (owner: 10BCornwall) [19:08:01] !log ladsgroup@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [19:08:24] (03CR) 10Ebernhardson: [C:03+1] cirrus: stop copying ores weighted_tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193052 (https://phabricator.wikimedia.org/T389053) (owner: 10DCausse) [19:08:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:11:37] !log ladsgroup@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [19:13:10] FIRING: [3x] BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:13:26] I'd like to deploy an eventstream config patch (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1192861) before the window, which I'll do in about 15 minutes, or however long it takes to drink my coffee, unless I hear otherwise :) [19:14:00] !log musikanimal@deploy2002 mwscript-k8s job started: extensions/CommunityRequests/maintenance/migrateFromGadget.php --wiki=metawiki --status-csv=wishes-status-migration.csv --wishes [19:14:47] (03CR) 10Ssingh: wikimedia.support: Rm ncredir, add zendesk records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1192236 (https://phabricator.wikimedia.org/T400952) (owner: 10BCornwall) [19:15:18] (03CR) 10Ssingh: wikimedia.support: Rm ncredir, add zendesk records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1192236 (https://phabricator.wikimedia.org/T400952) (owner: 10BCornwall) [19:18:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Move lvs1020 link from ssw1-f1-eqiad to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T404959#11238425 (10cmooney) >>! In T404959#11238347, @BCornwall wrote: > @cmooney Sounds good! Should I be scheduling with you or @VRiley-WMF?... [19:19:19] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Abolish api group from s2 in eqiad (T405087)', diff saved to https://phabricator.wikimedia.org/P83576 and previous config saved to /var/cache/conftool/dbconfig/20251002-191918-ladsgroup.json [19:19:22] T405087: Remove concept of groups in rdbms load balancer and replace it with shuffle sharding - https://phabricator.wikimedia.org/T405087 [19:27:26] (03PS1) 10Kosta Harlan: Implement AuthPreserveQueryParams for Metrics Platform mpo param [extensions/MetricsPlatform] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193188 (https://phabricator.wikimedia.org/T404622) [19:27:28] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Abolish api group from s2 in codfw (T405087)', diff saved to https://phabricator.wikimedia.org/P83577 and previous config saved to /var/cache/conftool/dbconfig/20251002-192726-ladsgroup.json [19:27:31] T405087: Remove concept of groups in rdbms load balancer and replace it with shuffle sharding - https://phabricator.wikimedia.org/T405087 [19:27:48] TheresNoTime: let me know when you're done please, as I'd also like to get started before the window if possible [19:29:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Abolish api group from s5 and s6 in codfw (T405087)', diff saved to https://phabricator.wikimedia.org/P83578 and previous config saved to /var/cache/conftool/dbconfig/20251002-192928-ladsgroup.json [19:32:18] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Abolish api group from s5 and s6 in eqiad (T405087)', diff saved to https://phabricator.wikimedia.org/P83579 and previous config saved to /var/cache/conftool/dbconfig/20251002-193217-ladsgroup.json [19:37:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192861 (https://phabricator.wikimedia.org/T401575) (owner: 10Samtar) [19:37:17] kostajh: will do, starting now [19:38:02] (03Merged) 10jenkins-bot: EventStreamConfig and stream registration for watchlist click tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192861 (https://phabricator.wikimedia.org/T401575) (owner: 10Samtar) [19:38:24] !log samtar@deploy2002 Started scap sync-world: Backport for [[gerrit:1192861|EventStreamConfig and stream registration for watchlist click tracking (T401575)]] [19:38:27] T401575: WE1.4.3: Instrument watchlist - https://phabricator.wikimedia.org/T401575 [19:39:52] FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:40:18] (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1193117 (https://phabricator.wikimedia.org/T403207) (owner: 10Stevemunene) [19:42:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:43:03] (03PS1) 10Stevemunene: Add test namespace to ceph tenantNamepsaces dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193190 (https://phabricator.wikimedia.org/T396478) [19:43:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:43:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:44:28] TheresNoTime: nevermind, I'll do my patch on Monday -- sorry for the confusion [19:44:28] !log samtar@deploy2002 samtar: Backport for [[gerrit:1192861|EventStreamConfig and stream registration for watchlist click tracking (T401575)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:44:31] T401575: WE1.4.3: Instrument watchlist - https://phabricator.wikimedia.org/T401575 [19:44:37] kostajh: no worries! [19:44:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193188 (https://phabricator.wikimedia.org/T404622) (owner: 10Kosta Harlan) [19:44:51] !log samtar@deploy2002 samtar: Continuing with sync [19:44:53] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406248#11238501 (10phaultfinder) [19:45:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193096 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [19:49:10] !log samtar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192861|EventStreamConfig and stream registration for watchlist click tracking (T401575)]] (duration: 10m 46s) [19:54:27] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Abolish api group from s7 in codfw (T405087)', diff saved to https://phabricator.wikimedia.org/P83580 and previous config saved to /var/cache/conftool/dbconfig/20251002-195426-ladsgroup.json [19:54:30] T405087: Remove concept of groups in rdbms load balancer and replace it with shuffle sharding - https://phabricator.wikimedia.org/T405087 [19:59:11] !log musikanimal@deploy2002 mwscript-k8s job started: extensions/CommunityRequests/maintenance/migrateFromGadget.php --wiki=metawiki --status-csv=wishes-status-migration.csv --wishes [20:00:04] thcipriani, RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T2000). [20:00:05] danisztls: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:32] danisztls: are you self-service deploying? [20:01:29] o/ [20:01:44] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Abolish api group from s7 in eqiad (T405087)', diff saved to https://phabricator.wikimedia.org/P83581 and previous config saved to /var/cache/conftool/dbconfig/20251002-200143-ladsgroup.json [20:01:47] T405087: Remove concept of groups in rdbms load balancer and replace it with shuffle sharding - https://phabricator.wikimedia.org/T405087 [20:02:07] I'm doing some training if anyone wants to _not_ self deploy today :) [20:02:11] * thcipriani asks hopefully [20:02:38] thcipriani: I've a patch you can deploy if you need one - simple config patch [20:02:53] Reedy: that'd be great if you don't mind [20:03:00] o/ [20:03:04] TheresNoTime: yes [20:03:09] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1193132 [20:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:03:55] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Abolish api group from s8 in eqiad (T405087)', diff saved to https://phabricator.wikimedia.org/P83582 and previous config saved to /var/cache/conftool/dbconfig/20251002-200354-ladsgroup.json [20:04:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193110 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [20:05:00] PROBLEM - Host sretest2001 is DOWN: PING CRITICAL - Packet loss = 100% [20:05:06] (03Merged) 10jenkins-bot: Deploy reader foundational survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193110 (https://phabricator.wikimedia.org/T405410) (owner: 10DDesouza) [20:05:29] !log dani@deploy2002 Started scap sync-world: Backport for [[gerrit:1193110|Deploy reader foundational survey on enwiki (T405410)]] [20:05:31] T405410: Phase III Short Survey - https://phabricator.wikimedia.org/T405410 [20:05:59] Reedy: perfect, thanks, I'll get you once TheresNoTime and danisztls are clear [20:06:10] howdy birdcup [20:06:16] howdy! [20:06:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Abolish api group from s8 in codfw (T405087)', diff saved to https://phabricator.wikimedia.org/P83583 and previous config saved to /var/cache/conftool/dbconfig/20251002-200621-ladsgroup.json [20:06:23] oh, just danisztls I guess (thought there was another patch here) [20:06:46] (got mine done earlier :)) [20:07:42] i'm adding one, can deploy it myself once ya'll are done [20:07:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187108 (https://phabricator.wikimedia.org/T390858) (owner: 10Ebernhardson) [20:08:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:09:49] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Harmonize weights in s8 in eqiad', diff saved to https://phabricator.wikimedia.org/P83584 and previous config saved to /var/cache/conftool/dbconfig/20251002-200948-ladsgroup.json [20:09:55] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [20:09:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193132 (https://phabricator.wikimedia.org/T277936) (owner: 10Reedy) [20:11:11] ^ Reedy added yours to the schedule [20:11:58] !log dani@deploy2002 dani: Backport for [[gerrit:1193110|Deploy reader foundational survey on enwiki (T405410)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:12:01] T405410: Phase III Short Survey - https://phabricator.wikimedia.org/T405410 [20:12:28] RECOVERY - Host sretest2001 is UP: PING OK - Packet loss = 0%, RTA = 30.33 ms [20:12:34] !log dani@deploy2002 dani: Continuing with sync [20:14:36] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1193183 (https://phabricator.wikimedia.org/T400952) (owner: 10JHathaway) [20:15:01] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [20:15:33] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Abolish api group from s4 and s1 in eqiad (T405087)', diff saved to https://phabricator.wikimedia.org/P83585 and previous config saved to /var/cache/conftool/dbconfig/20251002-201532-ladsgroup.json [20:15:37] T405087: Remove concept of groups in rdbms load balancer and replace it with shuffle sharding - https://phabricator.wikimedia.org/T405087 [20:16:11] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Harmonize weights in s1 in eqiad', diff saved to https://phabricator.wikimedia.org/P83586 and previous config saved to /var/cache/conftool/dbconfig/20251002-201611-ladsgroup.json [20:16:58] !log dani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193110|Deploy reader foundational survey on enwiki (T405410)]] (duration: 11m 29s) [20:17:26] (03CR) 10BCornwall: [V:03+1 C:03+1] wikimedia.support: initial mx support [puppet] - 10https://gerrit.wikimedia.org/r/1193183 (https://phabricator.wikimedia.org/T400952) (owner: 10JHathaway) [20:18:49] (03CR) 10Tacsipacsi: Add a banner for a Gerrit switch over maintenance (031 comment) [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1193017 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar) [20:19:27] looks like last patch deploy is complete, moving on to next [20:19:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187108 (https://phabricator.wikimedia.org/T390858) (owner: 10Ebernhardson) [20:20:45] (03Merged) 10jenkins-bot: cirrus: Start AB test of did-you-mean profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1187108 (https://phabricator.wikimedia.org/T390858) (owner: 10Ebernhardson) [20:21:08] !log ebernhardson@deploy2002 Started scap sync-world: Backport for [[gerrit:1187108|cirrus: Start AB test of did-you-mean profiles (T390858)]] [20:21:11] T390858: Improve CirrusSearch DYM suggestions using the phrase suggester on more content - https://phabricator.wikimedia.org/T390858 [20:23:19] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [20:23:22] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [20:23:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:25:37] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Abolish api group from s4 and s1 in codfw (T405087)', diff saved to https://phabricator.wikimedia.org/P83587 and previous config saved to /var/cache/conftool/dbconfig/20251002-202536-ladsgroup.json [20:25:41] T405087: Remove concept of groups in rdbms load balancer and replace it with shuffle sharding - https://phabricator.wikimedia.org/T405087 [20:25:49] !log ebernhardson@deploy2002 ebernhardson: Backport for [[gerrit:1187108|cirrus: Start AB test of did-you-mean profiles (T390858)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:26:15] !log ebernhardson@deploy2002 ebernhardson: Continuing with sync [20:29:29] (03PS1) 10Bking: Add Gabriele Modena (gmodena) to wdqs-roots, wdqs-admins groups [puppet] - 10https://gerrit.wikimedia.org/r/1193211 (https://phabricator.wikimedia.org/T404161) [20:30:12] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [20:30:37] !log ebernhardson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1187108|cirrus: Start AB test of did-you-mean profiles (T390858)]] (duration: 09m 29s) [20:30:37] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [20:30:40] T390858: Improve CirrusSearch DYM suggestions using the phrase suggester on more content - https://phabricator.wikimedia.org/T390858 [20:31:07] alright, Reedy you're up! [20:32:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebomani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193132 (https://phabricator.wikimedia.org/T277936) (owner: 10Reedy) [20:33:06] (03CR) 10Ahmon Dancy: Add traindev-staging environment for mw-web and mw-debug (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187855 (https://phabricator.wikimedia.org/T402350) (owner: 10Ahmon Dancy) [20:33:33] (03Merged) 10jenkins-bot: CommonSettings.php: Replace usage of $wgCaptchaWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193132 (https://phabricator.wikimedia.org/T277936) (owner: 10Reedy) [20:33:53] !log ebomani@deploy2002 Started scap sync-world: Backport for [[gerrit:1193132|CommonSettings.php: Replace usage of $wgCaptchaWhitelist (T277936)]] [20:33:56] T277936: Address Voice and Tone issues in ConfirmEdit - https://phabricator.wikimedia.org/T277936 [20:34:34] (03PS4) 10Ahmon Dancy: Add traindev-staging environment for mw-web and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1187855 (https://phabricator.wikimedia.org/T402350) [20:40:20] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host backup1012.eqiad.wmnet with OS bookworm [20:40:32] !log ebomani@deploy2002 reedy, ebomani: Backport for [[gerrit:1193132|CommonSettings.php: Replace usage of $wgCaptchaWhitelist (T277936)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:40:35] T277936: Address Voice and Tone issues in ConfirmEdit - https://phabricator.wikimedia.org/T277936 [20:41:14] Reedy: [20:41:26] changes live on test server :) [20:41:44] You can just deploy it, I don't need to test it ;) [20:42:15] alright, on it! [20:42:31] !log ebomani@deploy2002 reedy, ebomani: Continuing with sync [20:45:31] !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host backup1012.eqiad.wmnet with OS bookworm [20:46:01] (03PS1) 10Arnaudb: gerrit: bump parallel connection threshold [puppet] - 10https://gerrit.wikimedia.org/r/1193212 [20:46:01] (03CR) 10Arnaudb: [C:03+2] "revert to previous stable state" [puppet] - 10https://gerrit.wikimedia.org/r/1193212 (owner: 10Arnaudb) [20:46:34] (03PS1) 10Samtar: ext.wikimediaEvents.WatchlistBaseline: Add watchlist baseline metrics [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193213 (https://phabricator.wikimedia.org/T401575) [20:47:11] !log ebomani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193132|CommonSettings.php: Replace usage of $wgCaptchaWhitelist (T277936)]] (duration: 13m 17s) [20:47:14] T277936: Address Voice and Tone issues in ConfirmEdit - https://phabricator.wikimedia.org/T277936 [20:48:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Move lvs1020 link from ssw1-f1-eqiad to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T404959#11238731 (10BCornwall) Sure, no problem. Have at it. LMK if you need any help. [20:49:06] Reedy: all live, thanks for letting us deploy <3 [20:49:40] np, thanks for doing it! [20:53:41] I'm going to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1193213 to `1.45.0-wmf.21` in a moment unless I hear otherwise :) [20:54:03] TheresNoTime: go for it. :) [20:58:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193213 (https://phabricator.wikimedia.org/T401575) (owner: 10Samtar) [20:58:35] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T2100) [21:01:42] ouch the zuul queue is a bit big.. [21:03:00] (03PS1) 10JHathaway: backup1012: add to legacy slugs [cookbooks] - 10https://gerrit.wikimedia.org/r/1193229 [21:03:13] ...yes [21:03:37] ah, security release time [21:03:41] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [21:03:42] :D [21:04:03] yeahhhhhh [21:04:08] (03Merged) 10jenkins-bot: ext.wikimediaEvents.WatchlistBaseline: Add watchlist baseline metrics [extensions/WikimediaEvents] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193213 (https://phabricator.wikimedia.org/T401575) (owner: 10Samtar) [21:04:31] !log samtar@deploy2002 Started scap sync-world: Backport for [[gerrit:1193213|ext.wikimediaEvents.WatchlistBaseline: Add watchlist baseline metrics (T401575)]] [21:04:34] T401575: WE1.4.3: Instrument watchlist - https://phabricator.wikimedia.org/T401575 [21:04:51] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [21:05:15] (03PS1) 10Dzahn: zuul: adjust zookeeper hosts/port in new zuul config [puppet] - 10https://gerrit.wikimedia.org/r/1193141 (https://phabricator.wikimedia.org/T395938) [21:05:35] (03PS2) 10Dzahn: zuul: adjust zookeeper hosts/port in new zuul config [puppet] - 10https://gerrit.wikimedia.org/r/1193141 (https://phabricator.wikimedia.org/T395938) [21:08:23] (03CR) 10CI reject: [V:04-1] zuul: adjust zookeeper hosts/port in new zuul config [puppet] - 10https://gerrit.wikimedia.org/r/1193141 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [21:08:28] !log samtar@deploy2002 samtar: Backport for [[gerrit:1193213|ext.wikimediaEvents.WatchlistBaseline: Add watchlist baseline metrics (T401575)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:08:53] * TheresNoTime testing ^ [21:10:07] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [21:12:44] !log samtar@deploy2002 samtar: Continuing with sync [21:17:07] !log samtar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193213|ext.wikimediaEvents.WatchlistBaseline: Add watchlist baseline metrics (T401575)]] (duration: 12m 35s) [21:17:10] T401575: WE1.4.3: Instrument watchlist - https://phabricator.wikimedia.org/T401575 [21:24:55] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406272 (10phaultfinder) 03NEW [21:24:56] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406248#11238794 (10phaultfinder) [21:24:57] 10ops-eqiad, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406273 (10phaultfinder) 03NEW [21:25:00] (03PS2) 10JHathaway: backup1012: add to legacy slugs [cookbooks] - 10https://gerrit.wikimedia.org/r/1193229 [21:25:56] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [21:25:57] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [21:26:40] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [21:27:03] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1012.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [21:27:26] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host backup1012.eqiad.wmnet with OS bookworm [21:30:11] (03PS3) 10Dzahn: zuul: adjust zookeeper hosts/port in new zuul config [puppet] - 10https://gerrit.wikimedia.org/r/1193141 (https://phabricator.wikimedia.org/T395938) [21:30:27] jouncebot: nowandnext [21:30:27] For the next 0 hour(s) and 29 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251002T2100) [21:30:28] In 8 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251003T0600) [21:30:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11238805 (10Scott_French) conf1009 is (1) a member of eqiad main-etcd cluster, so clients will attempt to issue writes to it, (2) the upstream source for etcd-mirror replication... [21:31:46] (03CR) 10Zabe: [C:03+2] Stop setting CategoryLinksSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192867 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe) [21:35:26] lets see when zuul has capasities for the config patch [21:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:37:34] !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host backup1012.eqiad.wmnet with OS bookworm [21:38:32] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.013e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [21:40:28] (03Merged) 10jenkins-bot: Stop setting CategoryLinksSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192867 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe) [21:41:03] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1192867|Stop setting CategoryLinksSchemaMigrationStage (T299951)]] [21:41:06] T299951: Normalize categorylinks table - https://phabricator.wikimedia.org/T299951 [21:44:36] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad: 2 VM request for hCaptcha - https://phabricator.wikimedia.org/T406166#11238814 (10MoritzMuehlenhoff) Looks good, please use any of row/group B, C or D. [21:44:52] FIRING: [27x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:45:07] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: codfw: 2 VM request for hCaptcha - https://phabricator.wikimedia.org/T406167#11238819 (10MoritzMuehlenhoff) Looks good, please use any of row/group B, C or D. [21:46:54] !log zabe@deploy2002 zabe: Backport for [[gerrit:1192867|Stop setting CategoryLinksSchemaMigrationStage (T299951)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:46:57] T299951: Normalize categorylinks table - https://phabricator.wikimedia.org/T299951 [21:47:35] !log zabe@deploy2002 zabe: Continuing with sync [21:53:40] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1192867|Stop setting CategoryLinksSchemaMigrationStage (T299951)]] (duration: 12m 37s) [21:53:43] T299951: Normalize categorylinks table - https://phabricator.wikimedia.org/T299951 [21:53:47] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host backup1012.eqiad.wmnet with OS bookworm [21:54:00] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11238854 (10VRiley-WMF) Finished up ssw1-d8-eqiad and has been connected aside from ssw1-e1-eqiad ,ssw1-f1-eqiad [21:59:54] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406273#11238877 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF rebalanced power [22:03:30] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406272#11238897 (10VRiley-WMF) Rebalanced the power [22:03:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406272#11238898 (10VRiley-WMF) 05Open→03Resolved [22:04:23] (03PS3) 10BCornwall: wikimedia.support: Rm ncredir, add zendesk records [dns] - 10https://gerrit.wikimedia.org/r/1192236 (https://phabricator.wikimedia.org/T400952) [22:04:50] (03CR) 10BCornwall: wikimedia.support: Rm ncredir, add zendesk records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1192236 (https://phabricator.wikimedia.org/T400952) (owner: 10BCornwall) [22:13:11] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:14:52] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [22:15:06] !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host backup1012.eqiad.wmnet with OS bookworm [22:22:30] (03CR) 10C. Scott Ananian: [C:03+1] Deploy Parsoid Read Views to 26 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193179 (https://phabricator.wikimedia.org/T406250) (owner: 10Arlolra) [22:24:52] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:39:53] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406248#11238995 (10phaultfinder) [22:46:07] 10ops-eqiad, 06SRE, 06DC-Ops: asw2-a4-eqiad:PEM 1 is not powered - https://phabricator.wikimedia.org/T401886#11239002 (10wiki_willy) Hi @VRiley-WMF - the access to create RMA cases should be resolved now per Juniper, so hopefully it unblocks you on this one. Thanks, Willy [22:46:46] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host backup1012.eqiad.wmnet with OS bookworm [22:48:33] (03PS1) 10Samwilson: Fetch wikitext from the translation lang subpage, not the baselang [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193259 [22:51:19] (03PS2) 10MusikAnimal: Fetch wikitext from the translation lang subpage, not the baselang [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193259 (owner: 10Samwilson) [23:00:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samwilson@deploy2002 using scap backport" [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193259 (owner: 10Samwilson) [23:08:15] (03Merged) 10jenkins-bot: Fetch wikitext from the translation lang subpage, not the baselang [extensions/CommunityRequests] (wmf/1.45.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1193259 (owner: 10Samwilson) [23:08:42] !log samwilson@deploy2002 Started scap sync-world: Backport for [[gerrit:1193259|Fetch wikitext from the translation lang subpage, not the baselang]] [23:10:55] !log samwilson@deploy2002 samwilson: Backport for [[gerrit:1193259|Fetch wikitext from the translation lang subpage, not the baselang]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:20:15] !log samwilson@deploy2002 samwilson: Continuing with sync [23:22:42] (03PS1) 10Jasmine: wmnet: remove wikikube-ctrl1001 from etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1193266 (https://phabricator.wikimedia.org/T383227) [23:24:49] !log samwilson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193259|Fetch wikitext from the translation lang subpage, not the baselang]] (duration: 16m 07s) [23:35:28] jhathaway@cumin1002 reimage (PID 1047725) is awaiting input [23:38:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1193270 [23:38:47] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1193270 (owner: 10TrainBranchBot) [23:39:52] FIRING: [7x] CertAlmostExpired: Certificate for service lsw1-e5-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:55:54] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1193270 (owner: 10TrainBranchBot) [23:58:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable