[00:03:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:05] PROBLEM - Check unit status of clean-stale-certs on acmechief2002 is CRITICAL: CRITICAL: Status of the systemd unit clean-stale-certs https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:08:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1183277 [00:08:22] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1183277 (owner: 10TrainBranchBot) [00:08:59] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11134722 (10phaultfinder) [00:13:51] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11134735 (10phaultfinder) [00:22:52] (03PS1) 10Hamish: Lift permission for event-organizer in Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183278 [00:24:26] (03Abandoned) 10Hamish: Lift permission for event-organizer in Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183278 (owner: 10Hamish) [00:24:44] (03PS1) 10Hamish: Lift permission for event-organizer in Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183279 (https://phabricator.wikimedia.org/T403350) [00:28:06] (03PS16) 10Krinkle: varnish: Implement new direct routing for mobile views [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) [00:31:15] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1183277 (owner: 10TrainBranchBot) [00:32:12] (03CR) 10Tim Starling: [C:03+1] varnish: Implement new direct routing for mobile views [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [00:33:44] (03PS1) 10Pppery: Remove fallback for Asturian language [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1183280 (https://phabricator.wikimedia.org/T292750) [00:53:04] (03PS17) 10Krinkle: varnish: Improve 08-mobile-hostnames-rewrite.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1180969 (https://phabricator.wikimedia.org/T401595) [00:53:04] (03PS3) 10Krinkle: varnish: Remove 60s cap for mobileaction/useformat on m-dot [puppet] - 10https://gerrit.wikimedia.org/r/1183212 (https://phabricator.wikimedia.org/T401595) [00:53:04] (03PS17) 10Krinkle: varnish: Implement new direct routing for mobile views [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) [00:55:56] (03PS18) 10Krinkle: varnish: Improve 08-mobile-hostnames-rewrite.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1180969 (https://phabricator.wikimedia.org/T401595) [00:55:57] (03PS4) 10Krinkle: varnish: Remove 60s cap for mobileaction/useformat on m-dot [puppet] - 10https://gerrit.wikimedia.org/r/1183212 (https://phabricator.wikimedia.org/T401595) [00:55:57] (03PS18) 10Krinkle: varnish: Implement new direct routing for mobile views [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) [01:19:36] FIRING: [2x] NetworkDeviceAlarmActive: Alarm active on cr1-esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [01:25:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:et-1/0/0 (Core: asw1-bw27-esams:et-0/0/48 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:27:47] (03PS19) 10Krinkle: varnish: Implement new direct routing for mobile views [puppet] - 10https://gerrit.wikimedia.org/r/1180577 (https://phabricator.wikimedia.org/T401595) [01:29:16] (03PS1) 10Krinkle: Enable wmgUseMdotRouting in Beta Cluster for remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183281 (https://phabricator.wikimedia.org/T401595) [01:29:36] FIRING: [4x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/48 (Core: cr1-esams:et-1/0/0 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [01:29:54] FIRING: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [01:32:56] RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:32:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183281 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [01:33:47] (03Merged) 10jenkins-bot: Enable wmgUseMdotRouting in Beta Cluster for remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183281 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [01:36:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:44:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:50:56] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking), 07User-notice: RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#11134789 (10Krinkle) [02:30:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 01 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183279 (https://phabricator.wikimedia.org/T403350) (owner: 10Hamish) [02:32:54] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [02:53:32] FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:04:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:40:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:45:20] 10ops-codfw, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403356 (10phaultfinder) 03NEW [03:55:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [04:03:01] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:06:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:13:56] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11134863 (10phaultfinder) [04:19:00] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11134864 (10phaultfinder) [05:01:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:03:01] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:08:40] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:19:36] FIRING: [2x] NetworkDeviceAlarmActive: Alarm active on cr1-esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [05:25:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:et-1/0/0 (Core: asw1-bw27-esams:et-0/0/48 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:29:11] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:29:36] FIRING: [4x] SwitchCoreInterfaceDown: Switch core interface down - asw1-bw27-esams:et-0/0/48 (Core: cr1-esams:et-1/0/0 {#30367}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:29:54] FIRING: [8x] CoreBGPDown: Core BGP session down between asw1-bw27-esams and cr1-esams (185.15.59.156) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:33:40] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:44:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:59:51] (03CR) 10Stang: [C:03+1] Lift permission for event-organizer in Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183279 (https://phabricator.wikimedia.org/T403350) (owner: 10Hamish) [06:29:11] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:32:36] jouncebot: nowandnext [06:32:36] For the next 0 hour(s) and 27 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250831T0700) [06:32:37] In 0 hour(s) and 27 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250901T0700) [06:32:54] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11134910 (10MoritzMuehlenhoff) [06:32:54] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [06:33:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183109 (https://phabricator.wikimedia.org/T403263) (owner: 10Kosta Harlan) [06:34:22] (03Merged) 10jenkins-bot: hCaptcha: Disable hCaptcha for API contexts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183109 (https://phabricator.wikimedia.org/T403263) (owner: 10Kosta Harlan) [06:34:37] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1183109|hCaptcha: Disable hCaptcha for API contexts (T403263)]] [06:34:40] T403263: hCaptcha: Do not enable on API account creations - https://phabricator.wikimedia.org/T403263 [06:35:36] (03PS1) 10Muehlenhoff: Remove bast3007 as bastion node [puppet] - 10https://gerrit.wikimedia.org/r/1183453 (https://phabricator.wikimedia.org/T402259) [06:36:28] (03PS1) 10DCausse: SECURITY: declare PoolCounter settings for cirrusbuilddoc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183454 (https://phabricator.wikimedia.org/T401220) [06:39:04] (03CR) 10Muehlenhoff: [C:03+2] Remove bast3007 as bastion node [puppet] - 10https://gerrit.wikimedia.org/r/1183453 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [06:43:03] (03PS1) 10Muehlenhoff: Remove access for mszabo [puppet] - 10https://gerrit.wikimedia.org/r/1183462 [06:43:46] (03CR) 10CI reject: [V:04-1] Remove access for mszabo [puppet] - 10https://gerrit.wikimedia.org/r/1183462 (owner: 10Muehlenhoff) [06:46:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [06:49:00] (03PS2) 10Muehlenhoff: Remove access for mszabo [puppet] - 10https://gerrit.wikimedia.org/r/1183462 [06:51:02] (03CR) 10Muehlenhoff: [C:03+2] Remove access for mszabo [puppet] - 10https://gerrit.wikimedia.org/r/1183462 (owner: 10Muehlenhoff) [06:51:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [06:53:32] FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:55:11] !log restarting blazegraph on wdqs1011 (stuck) [06:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:55] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Máté Szabó out of all services on: 2410 hosts [06:56:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [06:58:08] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast3007.wikimedia.org [06:58:17] RESOLVED: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:00:04] Amir1, Urbanecm, and awight: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250901T0700). [07:00:05] hueitan, Msz2001, kostajh, Hamishcz, and dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:15] o/ [07:00:16] o/ [07:00:21] hola [07:00:23] hi, i'm nearly done syncing a change [07:00:24] I'll be deploying hueitan's changes. [07:00:28] thank you [07:00:29] k8s seems to be moving very slowly today [07:00:31] o/ [07:00:43] kostajh: it is Monday! [07:01:11] k8s doesnt want to work this weeeeek [07:01:13] lol [07:01:55] I asked in -serviceops as well, because opening `shell.php` with `mwscript-k8s` is also very slow [07:01:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:02:39] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1183109|hCaptcha: Disable hCaptcha for API contexts (T403263)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:02:42] T403263: hCaptcha: Do not enable on API account creations - https://phabricator.wikimedia.org/T403263 [07:02:49] so, my guess would be that we could start deployments in ~15 minutes, but hard to say given that sync-testservers-k8s, which is usually really fast, took 7 minutes [07:02:50] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:03:02] kostajh: Ping me once config deployment is done. [07:03:07] yep [07:03:52] kostajh: i'm driving w/ my laptop so pls ping me if my response is required [07:03:56] thx [07:04:24] !log kharlan@deploy1003 kharlan: Continuing with sync [07:04:36] bc i have to pull over then see whats going on :| [07:04:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:07:18] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast3007.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [07:07:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast3007.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [07:07:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:07:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast3007.wikimedia.org [07:08:11] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11134958 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `bast3007.wikimedia.org` - bast3007.wikimedia.org (**PASS**)... [07:09:05] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [07:09:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3005.esams.wmnet [07:10:04] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11134959 (10ops-monitoring-bot) Draining ganeti3005.esams.wmnet of running VMs [07:10:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3005.esams.wmnet [07:11:10] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [07:14:35] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [07:14:50] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:16:10] at 80% now [07:17:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir3003.esams.wmnet to plain [07:17:26] (03CR) 10Ayounsi: [C:03+2] Remove esams RIPE Atlas measurements [puppet] - 10https://gerrit.wikimedia.org/r/1180085 (https://phabricator.wikimedia.org/T402259) (owner: 10Ayounsi) [07:17:46] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11134964 (10ops-monitoring-bot) VM ncredir3003.esams.wmnet switching disk type to plain [07:17:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir3003.esams.wmnet to plain [07:17:48] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1183109|hCaptcha: Disable hCaptcha for API contexts (T403263)]] (duration: 43m 11s) [07:17:51] T403263: hCaptcha: Do not enable on API account creations - https://phabricator.wikimedia.org/T403263 [07:17:58] (03PS2) 10Ayounsi: Remove esams RIPE Atlas measurements [puppet] - 10https://gerrit.wikimedia.org/r/1180085 (https://phabricator.wikimedia.org/T402259) [07:18:14] kostajh: can I go ahead? :) [07:18:33] kart_: yes [07:18:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182861 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan) [07:18:59] hueitan: Starting with the first patch.. [07:19:07] (y) [07:19:15] (y) [07:19:33] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [07:20:00] oh, I forgot. CI. [07:20:07] (03CR) 10Ayounsi: [C:03+2] Remove esams RIPE Atlas measurements [puppet] - 10https://gerrit.wikimedia.org/r/1180085 (https://phabricator.wikimedia.org/T402259) (owner: 10Ayounsi) [07:20:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum3003.esams.wmnet to plain [07:20:11] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [07:20:40] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11134967 (10MoritzMuehlenhoff) [07:21:35] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11134969 (10ops-monitoring-bot) VM durum3003.esams.wmnet switching disk type to plain [07:21:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum3003.esams.wmnet to plain [07:22:41] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1180083 (https://phabricator.wikimedia.org/T402259) (owner: 10Ayounsi) [07:23:45] PROBLEM - Bird Internet Routing Daemon on durum3003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [07:23:56] jouncebot: next [07:23:57] In 2 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250901T1000) [07:24:09] PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:25:08] kostajh: everything good now? [07:25:45] RECOVERY - Bird Internet Routing Daemon on durum3003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [07:25:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh3003.wikimedia.org to plain [07:26:09] Hamishcz: kart_ is deploying [07:26:09] RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:26:18] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11134979 (10ops-monitoring-bot) VM doh3003.wikimedia.org switching disk type to plain [07:26:20] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [07:26:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh3003.wikimedia.org to plain [07:26:40] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [07:26:44] okayyy [07:28:31] PROBLEM - Bird Internet Routing Daemon on doh3003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [07:29:09] PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:29:39] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [07:29:49] (03Merged) 10jenkins-bot: Setup tracking for CentralNotice banners experiment for WE2.1.1 [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182861 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan) [07:30:09] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1182861|Setup tracking for CentralNotice banners experiment for WE2.1.1 (T402496)]] [07:30:09] RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:30:12] T402496: Tracking code for Scenarios 1 for WE2.1.1 - https://phabricator.wikimedia.org/T402496 [07:30:13] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [07:30:29] RECOVERY - Bird Internet Routing Daemon on doh3003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [07:31:03] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11134990 (10MoritzMuehlenhoff) [07:31:36] (03CR) 10Slyngshede: [C:03+2] P:puppetserver::volatile generate datacenter database (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [07:31:38] !log ayounsi@cumin1003 START - Cookbook sre.hosts.decommission for hosts atlas3001.wikimedia.org [07:31:57] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11134995 (10MoritzMuehlenhoff) [07:31:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:32:17] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts atlas3001.wikimedia.org [07:34:01] (03PS1) 10Ayounsi: Remove atlas3001 from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1183536 (https://phabricator.wikimedia.org/T402259) [07:34:34] (03PS1) 10Muehlenhoff: Remove ganeti3005 from esams01 cluster [puppet] - 10https://gerrit.wikimedia.org/r/1183544 (https://phabricator.wikimedia.org/T402259) [07:35:03] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1183536 (https://phabricator.wikimedia.org/T402259) (owner: 10Ayounsi) [07:35:07] (03CR) 10Ayounsi: [C:03+2] Remove atlas3001 from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1183536 (https://phabricator.wikimedia.org/T402259) (owner: 10Ayounsi) [07:35:32] (03CR) 10Vgutierrez: [C:03+1] profile:cache: remove varnishkafka (webrequest) from cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1183081 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [07:35:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1183544 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [07:35:49] !log kartik@deploy1003 kartik, hueitan: Backport for [[gerrit:1182861|Setup tracking for CentralNotice banners experiment for WE2.1.1 (T402496)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:36:00] T402496: Tracking code for Scenarios 1 for WE2.1.1 - https://phabricator.wikimedia.org/T402496 [07:36:13] (03CR) 10Fabfur: [C:03+2] profile:cache: remove varnishkafka (webrequest) from cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1183081 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [07:36:22] !log ayounsi@cumin1003 START - Cookbook sre.hosts.decommission for hosts atlas3001.wikimedia.org [07:37:49] (03PS1) 10Slyngshede: P:puppetserver::volatile fix group name [puppet] - 10https://gerrit.wikimedia.org/r/1183599 (https://phabricator.wikimedia.org/T398161) [07:38:51] (03CR) 10Ayounsi: [C:03+1] "Thx I think it's because we use the "set" output format for Nokia, which makes longer lines." [puppet] - 10https://gerrit.wikimedia.org/r/1183140 (owner: 10Muehlenhoff) [07:39:36] FIRING: ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://wikitech.wikimedia.org/wiki/RIPE_Atlas#HTTP_checks_failing - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:40:12] !log arnaudb@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 12:00:00 on people1005.eqiad.wmnet with reason: WIP T402953#11120672 [07:40:15] T402953: SystemdUnitFailed - envoyproxy on people1005 - https://phabricator.wikimedia.org/T402953 [07:40:19] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [07:40:22] (03CR) 10Vgutierrez: [C:03+1] P:puppetserver::volatile fix group name [puppet] - 10https://gerrit.wikimedia.org/r/1183599 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [07:40:33] (03CR) 10Fabfur: [C:03+1] P:puppetserver::volatile fix group name [puppet] - 10https://gerrit.wikimedia.org/r/1183599 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [07:40:45] (03CR) 10Slyngshede: [C:03+2] P:puppetserver::volatile fix group name [puppet] - 10https://gerrit.wikimedia.org/r/1183599 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [07:41:14] (03CR) 10Ayounsi: [C:03+2] Add esams routed ganeti VM ranges to network/data/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1180083 (https://phabricator.wikimedia.org/T402259) (owner: 10Ayounsi) [07:43:31] (03CR) 10Ayounsi: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1183544 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [07:43:40] FIRING: [2x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://wikitech.wikimedia.org/wiki/RIPE_Atlas#HTTP_checks_failing - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:44:29] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: atlas3001.wikimedia.org decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1003" [07:44:33] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: atlas3001.wikimedia.org decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1003" [07:44:33] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:44:34] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts atlas3001.wikimedia.org [07:44:50] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11135031 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1003 for hosts: `atlas3001.wikimedia.org` - atlas3001.wikimedia.org (**WA... [07:47:40] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti3005 from esams01 cluster [puppet] - 10https://gerrit.wikimedia.org/r/1183544 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [07:48:28] Sorry, testing is taking longer time.. [07:49:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [07:49:31] PROBLEM - ganeti-confd running on ganeti3005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [07:49:31] PROBLEM - ganeti-noded running on ganeti3005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [07:49:36] FIRING: [3x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:51:40] (03CR) 10Muehlenhoff: [C:03+2] Line-wrap Homer diffs [puppet] - 10https://gerrit.wikimedia.org/r/1183140 (owner: 10Muehlenhoff) [07:51:56] !log kartik@deploy1003 kartik, hueitan: Continuing with sync [07:54:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [07:54:21] (03PS1) 10Elukey: role::maps: increase max-conns and shared buffers on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) [07:55:04] (03PS1) 10Huei Tan: Setup tracking for CentralNotice banners experiment for WE2.1.1 [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183610 (https://phabricator.wikimedia.org/T402496) [07:55:45] (03CR) 10KartikMistry: [C:03+2] Update HomepageVisit schema to 1.6.1 [extensions/GrowthExperiments] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182862 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan) [07:55:53] (03PS1) 10Brouberol: postgresql-airflow-main: increase max CPU and disk space [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183611 [07:57:28] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11135039 (10MoritzMuehlenhoff) [07:57:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183610 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan) [07:59:39] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1182861|Setup tracking for CentralNotice banners experiment for WE2.1.1 (T402496)]] (duration: 29m 29s) [07:59:42] T402496: Tracking code for Scenarios 1 for WE2.1.1 - https://phabricator.wikimedia.org/T402496 [08:00:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182862 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan) [08:01:09] (03CR) 10Muehlenhoff: role::maps: increase max-conns and shared buffers on Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [08:02:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 01 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183610 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan) [08:02:17] (03PS2) 10Elukey: role::maps: increase max-conns and shared buffers on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) [08:02:31] kart_: can we move forward soon? or I reschedule to next backport window [08:02:42] (03PS3) 10Ayounsi: esams: add Ganeti "customer" [homer/public] - 10https://gerrit.wikimedia.org/r/1180081 (https://phabricator.wikimedia.org/T402259) [08:03:29] Hamishcz: sorry, first patch is done, on the second patch but we're overtime. [08:03:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti3005.esams.wmnet with OS bookworm [08:03:40] RESOLVED: [3x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:03:51] If there are nothing schedule, you can go ahead after my patch is done. [08:04:25] (03PS2) 10Huei Tan: Setup tracking for CentralNotice banners experiment for WE2.1.1 [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183610 (https://phabricator.wikimedia.org/T402496) [08:05:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:05:27] (03CR) 10David Caro: "> For the rest, how would the alerts/task notification to users work? I'm asking because I worry that unless that's fully automatic (i.e. " [alerts] - 10https://gerrit.wikimedia.org/r/1182900 (https://phabricator.wikimedia.org/T402932) (owner: 10David Caro) [08:05:36] (03PS1) 10Slyngshede: P:puppetserver::volatile enable datacenter timer [puppet] - 10https://gerrit.wikimedia.org/r/1183612 (https://phabricator.wikimedia.org/T398161) [08:05:57] (03CR) 10Ayounsi: [C:03+2] esams: add Ganeti "customer" [homer/public] - 10https://gerrit.wikimedia.org/r/1180081 (https://phabricator.wikimedia.org/T402259) (owner: 10Ayounsi) [08:06:42] okayyy i think I can wait [08:06:57] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6804/co" [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [08:07:03] but I cant deploy myself right? if my memory is correct [08:07:23] (03Merged) 10jenkins-bot: esams: add Ganeti "customer" [homer/public] - 10https://gerrit.wikimedia.org/r/1180081 (https://phabricator.wikimedia.org/T402259) (owner: 10Ayounsi) [08:07:47] (03PS3) 10Elukey: role::maps: increase max-conns and shared buffers on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) [08:09:31] (03Merged) 10jenkins-bot: Update HomepageVisit schema to 1.6.1 [extensions/GrowthExperiments] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182862 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan) [08:09:47] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1182862|Update HomepageVisit schema to 1.6.1 (T402496 T402497)]] [08:09:52] T402496: Tracking code for Scenarios 1 for WE2.1.1 - https://phabricator.wikimedia.org/T402496 [08:09:52] T402497: Tracking code for Scenarios 2 for WE2.1.1 - https://phabricator.wikimedia.org/T402497 [08:10:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:10:29] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6805/co" [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [08:13:12] (03CR) 10Vgutierrez: [C:03+1] acme-chief: Move clean-stale-certs to file [puppet] - 10https://gerrit.wikimedia.org/r/1174881 (https://phabricator.wikimedia.org/T399419) (owner: 10BCornwall) [08:16:07] !log kartik@deploy1003 hueitan, kartik: Backport for [[gerrit:1182862|Update HomepageVisit schema to 1.6.1 (T402496 T402497)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:16:12] T402496: Tracking code for Scenarios 1 for WE2.1.1 - https://phabricator.wikimedia.org/T402496 [08:16:12] T402497: Tracking code for Scenarios 2 for WE2.1.1 - https://phabricator.wikimedia.org/T402497 [08:16:16] (03CR) 10Elukey: [V:03+1] role::maps: increase max-conns and shared buffers on Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [08:17:04] (03CR) 10Vgutierrez: "makes sense but I'm wondering if initially we could manage purge value via hiera son we can test on a few hosts before proceeding with a g" [puppet] - 10https://gerrit.wikimedia.org/r/1178597 (https://phabricator.wikimedia.org/T401858) (owner: 10JHathaway) [08:17:09] (03CR) 10Elukey: [V:03+1] role::maps: increase max-conns and shared buffers on Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [08:18:49] (03CR) 10KartikMistry: Cleanup: Simplify configuration for wgSpecialContributeSkinsEnabled (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182944 (owner: 10Jdlrobson) [08:18:57] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11135094 (10phaultfinder) [08:19:11] (03CR) 10KartikMistry: [C:03+1] Cleanup: Simplify configuration for wgSpecialContributeSkinsEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182944 (owner: 10Jdlrobson) [08:19:45] (03CR) 10KartikMistry: "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152558 (https://phabricator.wikimedia.org/T380930) (owner: 10KartikMistry) [08:20:01] !log kartik@deploy1003 hueitan, kartik: Continuing with sync [08:20:20] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for es2049.codfw.wmnet [08:21:14] (03PS4) 10Elukey: role::maps: increase max-conns and shared buffers on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) [08:23:14] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [08:23:54] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11135108 (10phaultfinder) [08:25:05] (03CR) 10Elukey: [V:03+1] role::maps: increase max-conns and shared buffers on Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [08:25:09] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1182862|Update HomepageVisit schema to 1.6.1 (T402496 T402497)]] (duration: 15m 21s) [08:25:13] T402496: Tracking code for Scenarios 1 for WE2.1.1 - https://phabricator.wikimedia.org/T402496 [08:25:14] T402497: Tracking code for Scenarios 2 for WE2.1.1 - https://phabricator.wikimedia.org/T402497 [08:27:04] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11135114 (10elukey) I am also seeing a lot of the following logs in various replicas: ` GMT FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000... [08:27:21] fceratto@cumin1002 upgrade (PID 3914791) is awaiting input [08:28:02] Hamishcz: I'm done. [08:28:33] yea im still around but i'm in need of your help to deploy [08:28:46] I can deploy [08:29:00] i think i cant deploy myself due to lack of permission? [08:29:30] dcausse: oh thank you [08:29:59] next is Msz2001? [08:30:13] jouncebot: nowandnext [08:30:13] No deployments scheduled for the next 1 hour(s) and 29 minute(s) [08:30:14] In 1 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250901T1000) [08:30:42] Msz2001: are you around? [08:30:42] I've moved the patches to the next window, but if we have time and there's somebody to deploy them for me, let's do that [08:31:32] Msz2001: ack, can I ship both at once? [08:31:36] YEs [08:31:41] ok [08:32:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182692 (https://phabricator.wikimedia.org/T403148) (owner: 10Mszwarc) [08:32:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182798 (https://phabricator.wikimedia.org/T280532) (owner: 10Mszwarc) [08:33:00] (please note that the one about logo requires a script to be run afterwards, to purge the server-side cache; it's mentioned in the Phab task) [08:33:07] ok [08:33:40] (03Merged) 10jenkins-bot: Revert "wikimaniawiki: update logo to 2025" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182692 (https://phabricator.wikimedia.org/T403148) (owner: 10Mszwarc) [08:33:42] (03Merged) 10jenkins-bot: Remove setting `wgEnablePartialActionBlocks`. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182798 (https://phabricator.wikimedia.org/T280532) (owner: 10Mszwarc) [08:33:58] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1182692|Revert "wikimaniawiki: update logo to 2025" (T403148)]], [[gerrit:1182798|Remove setting `wgEnablePartialActionBlocks`. (T280532)]] [08:34:03] T403148: Change Wikimania wiki logo from 2025 to generic - https://phabricator.wikimedia.org/T403148 [08:34:03] T280532: Remove partial action blocks feature flag - https://phabricator.wikimedia.org/T280532 [08:36:55] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es2049.codfw.wmnet [08:39:35] !log dcausse@deploy1003 mszwarc, dcausse: Backport for [[gerrit:1182692|Revert "wikimaniawiki: update logo to 2025" (T403148)]], [[gerrit:1182798|Remove setting `wgEnablePartialActionBlocks`. (T280532)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:39:39] T403148: Change Wikimania wiki logo from 2025 to generic - https://phabricator.wikimedia.org/T403148 [08:39:40] T280532: Remove partial action blocks feature flag - https://phabricator.wikimedia.org/T280532 [08:40:06] Both patches work fine [08:40:11] Msz2001: ack [08:40:52] !log dcausse@deploy1003 mszwarc, dcausse: Continuing with sync [08:41:34] !log jmm@cumin2002 START - Cookbook sre.postgresql.postgres-init [08:42:11] !log jmm@cumin2002 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99) [08:42:24] kostajh: o/ are you around for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ConfirmEdit/+/1183112 ? [08:42:47] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [08:42:54] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1179 (T403362)', diff saved to https://phabricator.wikimedia.org/P82283 and previous config saved to /var/cache/conftool/dbconfig/20250901-084254-ladsgroup.json [08:42:57] T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362 [08:44:29] jouncebot: nowandnext [08:44:29] No deployments scheduled for the next 1 hour(s) and 15 minute(s) [08:44:29] In 1 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250901T1000) [08:44:45] (03CR) 10Vgutierrez: [C:03+1] P:puppetserver::volatile enable datacenter timer [puppet] - 10https://gerrit.wikimedia.org/r/1183612 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [08:45:52] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2149.codfw.wmnet with reason: Maintenance [08:45:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T401906)', diff saved to https://phabricator.wikimedia.org/P82284 and previous config saved to /var/cache/conftool/dbconfig/20250901-084558-fceratto.json [08:46:02] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [08:46:03] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1182692|Revert "wikimaniawiki: update logo to 2025" (T403148)]], [[gerrit:1182798|Remove setting `wgEnablePartialActionBlocks`. (T280532)]] (duration: 12m 05s) [08:46:07] T403148: Change Wikimania wiki logo from 2025 to generic - https://phabricator.wikimedia.org/T403148 [08:46:08] T280532: Remove partial action blocks feature flag - https://phabricator.wikimedia.org/T280532 [08:46:21] dcausse: I'm around [08:46:28] kostajh: ok [08:46:59] Msz2001: running the maint script now [08:47:13] purged, I can see the new logo now [08:47:30] Me too. Thanks for deploying! [08:47:40] yw! :) [08:48:33] kostajh: do you mind if I ship Hamishcz config patch quickly while yours run through CI? [08:48:51] Hamishcz: are you still around? [08:49:25] yes [08:49:56] ok shipping your patch now [08:51:00] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2049.codfw.wmnet with reason: T402859 [08:51:03] T402859: Productionize es2049-es2057 - https://phabricator.wikimedia.org/T402859 [08:51:12] !log jmm@cumin2002 START - Cookbook sre.postgresql.postgres-init [08:51:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183279 (https://phabricator.wikimedia.org/T403350) (owner: 10Hamish) [08:52:11] (03Merged) 10jenkins-bot: Lift permission for event-organizer in Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183279 (https://phabricator.wikimedia.org/T403350) (owner: 10Hamish) [08:52:24] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1183279|Lift permission for event-organizer in Chinese Wikipedia (T403350)]] [08:52:28] T403350: Lift permission for event-organizer in Chinese Wikipedia - https://phabricator.wikimedia.org/T403350 [08:52:47] (03CR) 10DCausse: [C:03+2] hCaptcha: Provide label/help in authmanagerinfo API calls [extensions/ConfirmEdit] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183112 (https://phabricator.wikimedia.org/T403253) (owner: 10Kosta Harlan) [08:56:17] dcausse: sorry for the late reply, yes, no problem [08:56:27] np :) [08:57:39] (03CR) 10Muehlenhoff: role::maps: increase max-conns and shared buffers on Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [08:58:27] !log dcausse@deploy1003 hamishz, dcausse: Backport for [[gerrit:1183279|Lift permission for event-organizer in Chinese Wikipedia (T403350)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:58:29] T403350: Lift permission for event-organizer in Chinese Wikipedia - https://phabricator.wikimedia.org/T403350 [08:58:50] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11135398 (10ayounsi) [08:59:01] Hamishcz: it's on test servers, please let me know if everyting's OK [08:59:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T401906)', diff saved to https://phabricator.wikimedia.org/P82285 and previous config saved to /var/cache/conftool/dbconfig/20250901-085920-fceratto.json [08:59:22] 06SRE, 10SRE-swift-storage, 10Ceph, 10envoy, 06serviceops: Data-persistence envoy upgrades to 1.26.8-1 - https://phabricator.wikimedia.org/T403374 (10MatthewVernon) 03NEW [08:59:23] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [08:59:42] seems still not live on testserver? [09:00:05] h [09:00:09] hmm.. should be [09:00:10] (03CR) 10Tiziano Fogli: [C:03+2] check_prometheus: add migration task param [puppet] - 10https://gerrit.wikimedia.org/r/1183126 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [09:00:22] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T370153 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1183127 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [09:00:45] ah live now, checked and LGTM [09:00:53] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T309012 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1183128 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [09:01:04] Hamishcz: ack, shipping [09:01:05] (03CR) 10Elukey: [V:03+1] role::maps: increase max-conns and shared buffers on Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183609 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:01:07] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T370157 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1183129 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [09:01:08] ty [09:01:21] (03CR) 10Tiziano Fogli: [C:03+2] monitoring services: add migration task T315866 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1183130 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [09:01:34] !log dcausse@deploy1003 hamishz, dcausse: Continuing with sync [09:01:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:04:28] (03CR) 10Vgutierrez: [C:03+2] Varnish: Fix rate limit comment to match code [puppet] - 10https://gerrit.wikimedia.org/r/1183245 (https://phabricator.wikimedia.org/T400119) (owner: 10Pppery) [09:04:57] !upgrade envoyproxy on ms-fe T403374 [09:04:58] T403374: Data-persistence envoy upgrades to 1.26.8-1 - https://phabricator.wikimedia.org/T403374 [09:05:44] jouncebot: nowandnext [09:05:44] No deployments scheduled for the next 0 hour(s) and 54 minute(s) [09:05:44] In 0 hour(s) and 54 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250901T1000) [09:05:57] (03Merged) 10jenkins-bot: hCaptcha: Provide label/help in authmanagerinfo API calls [extensions/ConfirmEdit] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183112 (https://phabricator.wikimedia.org/T403253) (owner: 10Kosta Harlan) [09:06:44] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1183279|Lift permission for event-organizer in Chinese Wikipedia (T403350)]] (duration: 14m 20s) [09:06:47] T403350: Lift permission for event-organizer in Chinese Wikipedia - https://phabricator.wikimedia.org/T403350 [09:07:13] Hamishcz: should be live [09:07:33] kostajh: shipping your patch now [09:07:44] dcausse: thanks! [09:08:32] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1183112|hCaptcha: Provide label/help in authmanagerinfo API calls (T403253)]] [09:08:35] T403253: TypeError: MediaWiki\Api\ApiAuthManagerHelper::formatMessage(): Argument #3 ($message) must be of type MediaWiki\Message\Message, null given, called in /srv/mediawiki/php-1.45.0-wmf.16/includes/api/ApiAuthManagerHelper.php on l - https://phabricator.wikimedia.org/T403253 [09:09:13] dcausse: okay now thank you :) [09:09:21] yw! :) [09:09:40] 10ops-esams, 06DC-Ops: ganeti3005 doesn't come back up during reimage - https://phabricator.wikimedia.org/T403375 (10MoritzMuehlenhoff) 03NEW [09:10:10] 10ops-esams, 06DC-Ops: ganeti3005 doesn't come back up during reimage - https://phabricator.wikimedia.org/T403375#11135458 (10MoritzMuehlenhoff) [09:10:13] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11135459 (10MoritzMuehlenhoff) [09:14:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P82286 and previous config saved to /var/cache/conftool/dbconfig/20250901-091427-fceratto.json [09:14:32] !log dcausse@deploy1003 kharlan, dcausse: Backport for [[gerrit:1183112|hCaptcha: Provide label/help in authmanagerinfo API calls (T403253)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:14:35] T403253: TypeError: MediaWiki\Api\ApiAuthManagerHelper::formatMessage(): Argument #3 ($message) must be of type MediaWiki\Message\Message, null given, called in /srv/mediawiki/php-1.45.0-wmf.16/includes/api/ApiAuthManagerHelper.php on l - https://phabricator.wikimedia.org/T403253 [09:14:53] kostajh: should be on testservers ready to test [09:15:32] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T403362)', diff saved to https://phabricator.wikimedia.org/P82287 and previous config saved to /var/cache/conftool/dbconfig/20250901-091531-ladsgroup.json [09:15:35] T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362 [09:16:55] dcausse: looking [09:19:27] dcausse: lgtm [09:19:35] (03CR) 10Filippo Giunchedi: [C:03+2] wmcs: alert on nova agents unavailable [alerts] - 10https://gerrit.wikimedia.org/r/1182034 (https://phabricator.wikimedia.org/T402778) (owner: 10Filippo Giunchedi) [09:19:36] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [09:19:37] kostajh: ack, shipping [09:19:39] !log dcausse@deploy1003 kharlan, dcausse: Continuing with sync [09:19:53] 10ops-esams, 06DC-Ops: esams: document power cables in Netbox - https://phabricator.wikimedia.org/T403376 (10ayounsi) 03NEW [09:20:12] (03CR) 10Filippo Giunchedi: [C:03+2] openstack: move nova-compute alerts to higher level [puppet] - 10https://gerrit.wikimedia.org/r/1182085 (https://phabricator.wikimedia.org/T402778) (owner: 10Filippo Giunchedi) [09:21:40] (03CR) 10Filippo Giunchedi: [C:03+1] "Ok to go ahead and see how much busywork we'll self-inflict" [alerts] - 10https://gerrit.wikimedia.org/r/1182900 (https://phabricator.wikimedia.org/T402932) (owner: 10David Caro) [09:22:25] (03Abandoned) 10Filippo Giunchedi: rake: default to python3 [puppet] - 10https://gerrit.wikimedia.org/r/1122090 (owner: 10Filippo Giunchedi) [09:22:26] (03CR) 10Btullis: [C:03+1] postgresql-airflow-main: increase max CPU and disk space [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183611 (owner: 10Brouberol) [09:22:32] (03CR) 10Brouberol: [C:03+2] postgresql-airflow-main: increase max CPU and disk space [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183611 (owner: 10Brouberol) [09:22:37] (03Abandoned) 10Filippo Giunchedi: grafana: set max_source_resolution=auto for thanos ds [puppet] - 10https://gerrit.wikimedia.org/r/1135948 (https://phabricator.wikimedia.org/T371102) (owner: 10Filippo Giunchedi) [09:23:06] (03CR) 10Hnowlan: [C:03+2] rest-gateway: route wikifeeds configuration endpoint. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182876 (https://phabricator.wikimedia.org/T403193) (owner: 10Dbrant) [09:23:23] one more deploy after this one and we should be done [09:24:23] 06SRE, 10SRE-swift-storage, 10Ceph, 10envoy, 06serviceops: Data-persistence envoy upgrades to 1.26.8-1 - https://phabricator.wikimedia.org/T403374#11135496 (10MatthewVernon) [09:24:48] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1183112|hCaptcha: Provide label/help in authmanagerinfo API calls (T403253)]] (duration: 16m 15s) [09:24:48] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti3005.esams.wmnet with OS bookworm [09:24:50] T403253: TypeError: MediaWiki\Api\ApiAuthManagerHelper::formatMessage(): Argument #3 ($message) must be of type MediaWiki\Message\Message, null given, called in /srv/mediawiki/php-1.45.0-wmf.16/includes/api/ApiAuthManagerHelper.php on l - https://phabricator.wikimedia.org/T403253 [09:25:03] kostajh: should be live [09:25:08] (03Merged) 10jenkins-bot: rest-gateway: route wikifeeds configuration endpoint. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182876 (https://phabricator.wikimedia.org/T403193) (owner: 10Dbrant) [09:25:37] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-main: apply [09:25:41] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-main: apply [09:25:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183454 (https://phabricator.wikimedia.org/T401220) (owner: 10DCausse) [09:26:50] (03Merged) 10jenkins-bot: SECURITY: declare PoolCounter settings for cirrusbuilddoc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183454 (https://phabricator.wikimedia.org/T401220) (owner: 10DCausse) [09:27:06] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1183454|SECURITY: declare PoolCounter settings for cirrusbuilddoc (T401220)]] [09:28:39] (03PS1) 10JMeybohm: Update SSH key for conniecc1 [puppet] - 10https://gerrit.wikimedia.org/r/1183617 (https://phabricator.wikimedia.org/T403242) [09:28:46] dcausse: thanks for deploying! [09:29:17] yw :) [09:29:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P82288 and previous config saved to /var/cache/conftool/dbconfig/20250901-092934-fceratto.json [09:29:36] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [09:29:39] (03PS1) 10Slyngshede: P:cache::haproxy Allow user-agents with contact information [puppet] - 10https://gerrit.wikimedia.org/r/1183618 (https://phabricator.wikimedia.org/T400119) [09:30:40] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P82289 and previous config saved to /var/cache/conftool/dbconfig/20250901-093039-ladsgroup.json [09:32:10] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11135537 (10Vgutierrez) >>! In T400119#11134249, @Don-vip wrote: > I still have [[ https://gitlab.wikimedia.org/toolforge-repos/spacemedia/-/jobs/601049 |... [09:33:04] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6807/co" [puppet] - 10https://gerrit.wikimedia.org/r/1183618 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [09:33:16] !log dcausse@deploy1003 dcausse: Backport for [[gerrit:1183454|SECURITY: declare PoolCounter settings for cirrusbuilddoc (T401220)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:36:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 01 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183216 (https://phabricator.wikimedia.org/T402527) (owner: 10D3r1ck01) [09:37:39] jouncebot: nownandnext [09:37:43] jouncebot: nowandnext [09:37:43] No deployments scheduled for the next 0 hour(s) and 22 minute(s) [09:37:43] In 0 hour(s) and 22 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250901T1000) [09:37:58] !log dcausse@deploy1003 dcausse: Continuing with sync [09:38:12] (03CR) 10Tiziano Fogli: [C:03+2] pdb_resource_exporter: add check_prometheus tasks query [puppet] - 10https://gerrit.wikimedia.org/r/1183131 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli) [09:38:16] (03CR) 10D3r1ck01: "NOTE: Once this patch is deployed, it'll be a no-op until Ib5702b11b3ef642b6eda6e4c291c2fb670bc07f1 gets merged. This config patch only in" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183216 (https://phabricator.wikimedia.org/T402527) (owner: 10D3r1ck01) [09:41:27] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [09:41:36] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [09:43:05] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [09:43:17] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [09:43:40] sigh I accentally hit Ctrl-C in scap, it was at "deployment progress: 87% (ok: 1963; fail: 0; left: 276)" [09:44:15] can I just rerun scap or is there anything I should do? [09:44:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:44:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T401906)', diff saved to https://phabricator.wikimedia.org/P82290 and previous config saved to /var/cache/conftool/dbconfig/20250901-094442-fceratto.json [09:44:49] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [09:44:57] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2156.codfw.wmnet with reason: Maintenance [09:45:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T401906)', diff saved to https://phabricator.wikimedia.org/P82291 and previous config saved to /var/cache/conftool/dbconfig/20250901-094504-fceratto.json [09:45:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P82292 and previous config saved to /var/cache/conftool/dbconfig/20250901-094547-ladsgroup.json [09:47:09] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [09:47:17] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [09:47:28] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1183454|SECURITY: declare PoolCounter settings for cirrusbuilddoc (T401220)]] [09:52:56] !log dcausse@deploy1003 dcausse: Backport for [[gerrit:1183454|SECURITY: declare PoolCounter settings for cirrusbuilddoc (T401220)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:53:15] !log dcausse@deploy1003 dcausse: Continuing with sync [09:55:03] (03CR) 10Arendpieter: [C:03+1] SUL3: Use `metawiki` as central wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183216 (https://phabricator.wikimedia.org/T402527) (owner: 10D3r1ck01) [09:55:27] 06SRE, 06Wikimedia Enterprise: Provide auth-less access to Enterprise APIs from WMF Analytics cluster - https://phabricator.wikimedia.org/T403298#11135615 (10JMeybohm) Please keep in mind that allowing the HTTP proxy IPs will ultimately allow Enterprise API access from all systems allowed to use the HTTP proxi... [09:58:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T401906)', diff saved to https://phabricator.wikimedia.org/P82293 and previous config saved to /var/cache/conftool/dbconfig/20250901-095822-fceratto.json [09:58:26] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [09:58:40] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1183454|SECURITY: declare PoolCounter settings for cirrusbuilddoc (T401220)]] (duration: 11m 12s) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250901T1000) [10:00:43] (03PS1) 10Elukey: profile::base: Pin linux-base's version for Bookworm bpo [puppet] - 10https://gerrit.wikimedia.org/r/1183621 (https://phabricator.wikimedia.org/T393948) [10:00:56] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T403362)', diff saved to https://phabricator.wikimedia.org/P82294 and previous config saved to /var/cache/conftool/dbconfig/20250901-100054-ladsgroup.json [10:00:59] T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362 [10:01:12] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [10:01:26] (03PS1) 10Volans: insetup: fix report recipients [puppet] - 10https://gerrit.wikimedia.org/r/1183622 [10:04:38] !log jmm@cumin2002 START - Cookbook sre.postgresql.postgres-init [10:04:43] !upgrade envoyproxy on thanos-fe T403374 [10:04:44] T403374: Data-persistence envoy upgrades to 1.26.8-1 - https://phabricator.wikimedia.org/T403374 [10:05:25] !log jmm@cumin2002 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99) [10:05:35] (03CR) 10JMeybohm: [C:03+2] Update SSH key for conniecc1 [puppet] - 10https://gerrit.wikimedia.org/r/1183617 (https://phabricator.wikimedia.org/T403242) (owner: 10JMeybohm) [10:06:26] !log jmm@cumin2002 START - Cookbook sre.postgresql.postgres-init [10:06:54] (03CR) 10MVernon: [C:03+2] swift: use admin to manage swift uid/gid, remove old bodges [puppet] - 10https://gerrit.wikimedia.org/r/1182573 (https://phabricator.wikimedia.org/T123918) (owner: 10MVernon) [10:07:19] !log jmm@cumin2002 START - Cookbook sre.postgresql.postgres-init [10:07:57] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Update SSH key for Connie Chen - https://phabricator.wikimedia.org/T403242#11135656 (10JMeybohm) 05Open→03Resolved a:03JMeybohm Hi @cchen, I've updated your SSH key according to this request. [10:09:49] (03PS2) 10Elukey: profile::base: Pin linux-base's version for Bookworm bpo [puppet] - 10https://gerrit.wikimedia.org/r/1183621 (https://phabricator.wikimedia.org/T393948) [10:09:57] (03PS1) 10Btullis: Remove old hadoop workers from the exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/1183624 (https://phabricator.wikimedia.org/T397166) [10:10:37] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6809/co" [puppet] - 10https://gerrit.wikimedia.org/r/1183621 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [10:11:23] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918#11135667 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon All done! [10:12:12] (03PS1) 10David Caro: test_cookbook.py: fix typo in the help to log dir [puppet] - 10https://gerrit.wikimedia.org/r/1183625 [10:12:43] (03CR) 10Volans: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1183625 (owner: 10David Caro) [10:13:28] (03CR) 10David Caro: [C:03+2] test_cookbook.py: fix typo in the help to log dir [puppet] - 10https://gerrit.wikimedia.org/r/1183625 (owner: 10David Caro) [10:13:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P82295 and previous config saved to /var/cache/conftool/dbconfig/20250901-101330-fceratto.json [10:13:45] 06SRE, 10SRE-swift-storage, 10Ceph, 10envoy, 06serviceops: Data-persistence envoy upgrades to 1.26.8-1 - https://phabricator.wikimedia.org/T403374#11135684 (10MatthewVernon) [10:14:57] (03CR) 10Ladsgroup: "recheck" [extensions/CategoryTree] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183269 (https://phabricator.wikimedia.org/T299951) (owner: 10Ladsgroup) [10:15:28] (03PS1) 10Muehlenhoff: postgresql.postgres-init: Ensure that it's run from within a screen session [cookbooks] - 10https://gerrit.wikimedia.org/r/1183626 [10:16:15] (03CR) 10Brouberol: [C:03+1] Remove old hadoop workers from the exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/1183624 (https://phabricator.wikimedia.org/T397166) (owner: 10Btullis) [10:16:36] (03CR) 10Volans: postgresql.postgres-init: Ensure that it's run from within a screen session (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1183626 (owner: 10Muehlenhoff) [10:18:05] (03PS2) 10Muehlenhoff: postgresql.postgres-init: Ensure that it's run from within a screen session [cookbooks] - 10https://gerrit.wikimedia.org/r/1183626 [10:18:09] (03PS6) 10Cyndywikime: [Growth] enwiki: Deploy "Add a link" to 100% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524) [10:18:30] (03CR) 10Cyndywikime: "This patch is ready for review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524) (owner: 10Cyndywikime) [10:18:33] (03CR) 10Muehlenhoff: postgresql.postgres-init: Ensure that it's run from within a screen session (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1183626 (owner: 10Muehlenhoff) [10:18:42] (03PS1) 10MVernon: swift: remove 3 drained eqiad nodes for disk controller swap [puppet] - 10https://gerrit.wikimedia.org/r/1183627 (https://phabricator.wikimedia.org/T400877) [10:18:44] (03PS1) 10MVernon: swift: re-add 3 nodes, drain the next 3 [puppet] - 10https://gerrit.wikimedia.org/r/1183628 (https://phabricator.wikimedia.org/T400877) [10:20:09] (03CR) 10Btullis: [C:03+2] Remove old hadoop workers from the exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/1183624 (https://phabricator.wikimedia.org/T397166) (owner: 10Btullis) [10:20:11] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1183626 (owner: 10Muehlenhoff) [10:22:01] !upgrade envoyproxy on apus frontends T403374 [10:22:01] T403374: Data-persistence envoy upgrades to 1.26.8-1 - https://phabricator.wikimedia.org/T403374 [10:24:58] (03CR) 10Jcrespo: [C:03+1] swift: re-add 3 nodes, drain the next 3 [puppet] - 10https://gerrit.wikimedia.org/r/1183628 (https://phabricator.wikimedia.org/T400877) (owner: 10MVernon) [10:25:09] (03CR) 10Jcrespo: [C:03+1] swift: remove 3 drained eqiad nodes for disk controller swap [puppet] - 10https://gerrit.wikimedia.org/r/1183627 (https://phabricator.wikimedia.org/T400877) (owner: 10MVernon) [10:28:13] (03CR) 10Muehlenhoff: [C:03+2] postgresql.postgres-init: Ensure that it's run from within a screen session [cookbooks] - 10https://gerrit.wikimedia.org/r/1183626 (owner: 10Muehlenhoff) [10:28:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P82296 and previous config saved to /var/cache/conftool/dbconfig/20250901-102837-fceratto.json [10:28:45] (03PS2) 10MVernon: swift: remove 3 drained eqiad nodes for disk controller swap [puppet] - 10https://gerrit.wikimedia.org/r/1183627 (https://phabricator.wikimedia.org/T400877) [10:28:45] (03PS2) 10MVernon: swift: re-add 3 nodes, drain the next 3 [puppet] - 10https://gerrit.wikimedia.org/r/1183628 (https://phabricator.wikimedia.org/T400877) [10:29:00] (03PS3) 10Elukey: profile::base: Pin linux-base's version for Bookworm bpo [puppet] - 10https://gerrit.wikimedia.org/r/1183621 (https://phabricator.wikimedia.org/T393948) [10:29:28] (03PS1) 10Ladsgroup: common.yaml: Remove two more dropped tables from the list of private [puppet] - 10https://gerrit.wikimedia.org/r/1183629 (https://phabricator.wikimedia.org/T398945) [10:29:49] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6811/co" [puppet] - 10https://gerrit.wikimedia.org/r/1183621 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [10:30:17] (03PS2) 10Ladsgroup: common.yaml: Remove two more dropped tables from the list of private [puppet] - 10https://gerrit.wikimedia.org/r/1183629 (https://phabricator.wikimedia.org/T398945) [10:30:27] (03CR) 10Ladsgroup: [V:03+2 C:03+2] "Doesn't exist in production" [puppet] - 10https://gerrit.wikimedia.org/r/1183629 (https://phabricator.wikimedia.org/T398945) (owner: 10Ladsgroup) [10:32:54] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [10:33:48] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1183622 (owner: 10Volans) [10:34:05] (03PS1) 10Btullis: Move the fifth hadoop journalnode [puppet] - 10https://gerrit.wikimedia.org/r/1183630 (https://phabricator.wikimedia.org/T397166) [10:36:25] (03CR) 10MVernon: [C:03+2] swift: remove 3 drained eqiad nodes for disk controller swap [puppet] - 10https://gerrit.wikimedia.org/r/1183627 (https://phabricator.wikimedia.org/T400877) (owner: 10MVernon) [10:36:54] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Novem Linguae - https://phabricator.wikimedia.org/T403336#11135721 (10JMeybohm) [10:38:36] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6813/console" [puppet] - 10https://gerrit.wikimedia.org/r/1183621 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [10:39:20] (03CR) 10Zabe: "the failure is fixed by https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CategoryTree/+/1182550 and https://gerrit.wikimedia.org/r/c/" [extensions/CategoryTree] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183269 (https://phabricator.wikimedia.org/T299951) (owner: 10Ladsgroup) [10:39:20] (03PS4) 10Elukey: profile::base: Pin linux-base's version for Bookworm bpo [puppet] - 10https://gerrit.wikimedia.org/r/1183621 (https://phabricator.wikimedia.org/T393948) [10:39:38] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Novem Linguae - https://phabricator.wikimedia.org/T403336#11135739 (10JMeybohm) @Dreamy_Jazz please sign off your sponsor role @Milimetric || @Ahoelzl || @Ottomata please sign off for `analytics-privatedata-users` access [10:39:47] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Novem Linguae - https://phabricator.wikimedia.org/T403336#11135740 (10Dreamy_Jazz) I confirm that I am sponsoring this request. [10:40:13] (03PS3) 10Abijeet Patro: CentralNotice banner experiment WE2.1.1 - Add missing extension config [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183610 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan) [10:40:36] (03CR) 10Elukey: [C:03+1] ":(" [puppet] - 10https://gerrit.wikimedia.org/r/1183622 (owner: 10Volans) [10:41:53] (03CR) 10Ladsgroup: "Yeah, noticed. Now thinking whether I should band-aid it until wmf.16 rolls out." [extensions/CategoryTree] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183269 (https://phabricator.wikimedia.org/T299951) (owner: 10Ladsgroup) [10:42:45] (03CR) 10Ladsgroup: "wmf.17" [extensions/CategoryTree] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183269 (https://phabricator.wikimedia.org/T299951) (owner: 10Ladsgroup) [10:43:29] (03CR) 10Volans: [C:03+2] insetup: fix report recipients [puppet] - 10https://gerrit.wikimedia.org/r/1183622 (owner: 10Volans) [10:43:32] (03PS1) 10Ladsgroup: ParserTestRunner: Update category counts for articles [core] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183632 (https://phabricator.wikimedia.org/T365303) [10:43:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T401906)', diff saved to https://phabricator.wikimedia.org/P82297 and previous config saved to /var/cache/conftool/dbconfig/20250901-104345-fceratto.json [10:43:48] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [10:44:00] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2177.codfw.wmnet with reason: Maintenance [10:44:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2177 (T401906)', diff saved to https://phabricator.wikimedia.org/P82298 and previous config saved to /var/cache/conftool/dbconfig/20250901-104407-fceratto.json [10:44:22] jouncebot: nowandnext [10:44:22] For the next 0 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250901T1000) [10:44:22] In 2 hour(s) and 15 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250901T1300) [10:44:44] (03PS1) 10Ladsgroup: CategoryCacheTest: Update category count [extensions/CategoryTree] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183633 [10:45:06] (03PS2) 10Ladsgroup: Drop support for categorylinks read old [extensions/CategoryTree] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183269 (https://phabricator.wikimedia.org/T299951) [10:45:21] 06SRE, 10SRE-swift-storage, 10Ceph, 10envoy, 06serviceops: Data-persistence envoy upgrades to 1.26.8-1 - https://phabricator.wikimedia.org/T403374#11135763 (10MatthewVernon) [10:45:29] 06SRE, 10SRE-swift-storage, 10Ceph, 10envoy, 06serviceops: Data-persistence envoy upgrades to 1.26.8-1 - https://phabricator.wikimedia.org/T403374#11135764 (10MatthewVernon) 05Open→03Resolved [10:45:34] (03CR) 10Ladsgroup: [C:03+2] ParserTestRunner: Update category counts for articles [core] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183632 (https://phabricator.wikimedia.org/T365303) (owner: 10Ladsgroup) [10:45:53] !log installing luajit security updates [10:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:06] (03CR) 10Ladsgroup: [C:03+2] CategoryCacheTest: Update category count [extensions/CategoryTree] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183633 (owner: 10Ladsgroup) [10:47:11] (03CR) 10Ladsgroup: [C:03+2] Drop support for categorylinks read old [extensions/CategoryTree] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183269 (https://phabricator.wikimedia.org/T299951) (owner: 10Ladsgroup) [10:49:04] (03CR) 10Abijeet Patro: [C:03+1] CentralNotice banner experiment WE2.1.1 - Add missing extension config [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183610 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan) [10:49:40] (03PS2) 10Btullis: Move the fifth hadoop journalnode [puppet] - 10https://gerrit.wikimedia.org/r/1183630 (https://phabricator.wikimedia.org/T397166) [10:50:15] (03CR) 10Brouberol: [C:03+1] Move the fifth hadoop journalnode [puppet] - 10https://gerrit.wikimedia.org/r/1183630 (https://phabricator.wikimedia.org/T397166) (owner: 10Btullis) [10:50:36] (03CR) 10Btullis: [C:03+2] Move the fifth hadoop journalnode [puppet] - 10https://gerrit.wikimedia.org/r/1183630 (https://phabricator.wikimedia.org/T397166) (owner: 10Btullis) [10:50:48] 06SRE, 06Traffic, 10Wikidata, 10Wikidata-Query-Service: Find a solution for SPARQL federation that is blocked by stricter user agent policy enforcement - https://phabricator.wikimedia.org/T402959#11135769 (10Lydia_Pintscher) >>! In T402959#11132802, @CDanis wrote: > Hi @Lydia_Pintscher , SRE can make some... [10:53:38] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11135773 (10Don-vip) >>! In T400119#11135537, @Vgutierrez wrote: > It looks like that test for some reason is using the default UA of the HttpClient librar... [10:57:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T401906)', diff saved to https://phabricator.wikimedia.org/P82299 and previous config saved to /var/cache/conftool/dbconfig/20250901-105724-fceratto.json [10:57:28] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [11:01:22] (03Merged) 10jenkins-bot: ParserTestRunner: Update category counts for articles [core] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183632 (https://phabricator.wikimedia.org/T365303) (owner: 10Ladsgroup) [11:01:25] (03Merged) 10jenkins-bot: CategoryCacheTest: Update category count [extensions/CategoryTree] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183633 (owner: 10Ladsgroup) [11:01:26] (03Merged) 10jenkins-bot: Drop support for categorylinks read old [extensions/CategoryTree] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183269 (https://phabricator.wikimedia.org/T299951) (owner: 10Ladsgroup) [11:03:56] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1183632|ParserTestRunner: Update category counts for articles (T365303)]], [[gerrit:1183633|CategoryCacheTest: Update category count]], [[gerrit:1183269|Drop support for categorylinks read old (T299951 T403147 T403337)]] [11:04:05] T365303: Move update of category members count to CategoryMembershipChangeJob - https://phabricator.wikimedia.org/T365303 [11:04:06] T299951: Normalize categorylinks table - https://phabricator.wikimedia.org/T299951 [11:04:06] T403147: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'cl_to' in 'ON'Function: MediaWiki\Extension\CategoryTree\CategoryTree::renderChildrenQuery: SELECT page_id,page_namespace,page_title,page_is_redirect,page_len,page_la - https://phabricator.wikimedia.org/T403147 [11:04:06] T403337: Wikimedia\Rdbms\DBQueryError: Inaccessible page in the Project_talk namespace on jawiki - https://phabricator.wikimedia.org/T403337 [11:04:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:09:48] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1183632|ParserTestRunner: Update category counts for articles (T365303)]], [[gerrit:1183633|CategoryCacheTest: Update category count]], [[gerrit:1183269|Drop support for categorylinks read old (T299951 T403147 T403337)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:09:55] T365303: Move update of category members count to CategoryMembershipChangeJob - https://phabricator.wikimedia.org/T365303 [11:09:56] T299951: Normalize categorylinks table - https://phabricator.wikimedia.org/T299951 [11:09:56] T403147: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'cl_to' in 'ON'Function: MediaWiki\Extension\CategoryTree\CategoryTree::renderChildrenQuery: SELECT page_id,page_namespace,page_title,page_is_redirect,page_len,page_la - https://phabricator.wikimedia.org/T403147 [11:09:56] T403337: Wikimedia\Rdbms\DBQueryError: Inaccessible page in the Project_talk namespace on jawiki - https://phabricator.wikimedia.org/T403337 [11:11:14] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [11:12:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P82300 and previous config saved to /var/cache/conftool/dbconfig/20250901-111232-fceratto.json [11:12:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:16:24] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1183632|ParserTestRunner: Update category counts for articles (T365303)]], [[gerrit:1183633|CategoryCacheTest: Update category count]], [[gerrit:1183269|Drop support for categorylinks read old (T299951 T403147 T403337)]] (duration: 12m 28s) [11:16:32] T365303: Move update of category members count to CategoryMembershipChangeJob - https://phabricator.wikimedia.org/T365303 [11:16:32] T299951: Normalize categorylinks table - https://phabricator.wikimedia.org/T299951 [11:16:33] T403147: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'cl_to' in 'ON'Function: MediaWiki\Extension\CategoryTree\CategoryTree::renderChildrenQuery: SELECT page_id,page_namespace,page_title,page_is_redirect,page_len,page_la - https://phabricator.wikimedia.org/T403147 [11:16:33] T403337: Wikimedia\Rdbms\DBQueryError: Inaccessible page in the Project_talk namespace on jawiki - https://phabricator.wikimedia.org/T403337 [11:17:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:18:37] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11135896 (10MoritzMuehlenhoff) Since ganeti3005 has hardware issues which will take some time to resolve, we'll proceed with the second cluster in a similar manner... [11:20:55] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3006.esams.wmnet [11:21:38] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1223.eqiad.wmnet with reason: Maintenance [11:21:51] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11135899 (10ops-monitoring-bot) Draining ganeti3006.esams.wmnet of running VMs [11:22:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3006.esams.wmnet [11:22:55] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [11:24:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of install3003.wikimedia.org to plain [11:24:55] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11135914 (10ops-monitoring-bot) VM install3003.wikimedia.org switching disk type to plain [11:25:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install3003.wikimedia.org to plain [11:25:33] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2205.codfw.wmnet with reason: Maintenance [11:26:00] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11135918 (10MatthewVernon) [11:26:16] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [11:27:16] !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ms-be[1083-1085].eqiad.wmnet with reason: awaiting controller swap [11:27:28] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11135919 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5d9cb26e-171b-4940-aeef-3b79dd0f568e) set by mvernon@cu... [11:27:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P82301 and previous config saved to /var/cache/conftool/dbconfig/20250901-112739-fceratto.json [11:28:26] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11135921 (10MatthewVernon) @VRiley-WMF three nodes - ms-be1083 ms-be1084 ms-be1085 are now ready for disk swaps, as soon as you've s... [11:30:05] (03PS1) 10Ayounsi: esams: add includes for routed ganeti ranges [dns] - 10https://gerrit.wikimedia.org/r/1183638 (https://phabricator.wikimedia.org/T402259) [11:30:13] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: PuppetPendingCertificateRequest (instance puppetmaster1001:9100) - https://phabricator.wikimedia.org/T403388 (10LSobanski) 03NEW [11:30:49] (03CR) 10CI reject: [V:04-1] esams: add includes for routed ganeti ranges [dns] - 10https://gerrit.wikimedia.org/r/1183638 (https://phabricator.wikimedia.org/T402259) (owner: 10Ayounsi) [11:32:52] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [11:36:02] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [11:36:25] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [11:37:35] (03PS1) 10D3r1ck01: Add caller to maintenance script SQL queries [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183640 (https://phabricator.wikimedia.org/T313900) [11:38:08] (03PS2) 10D3r1ck01: Add caller to maintenance script SQL queries [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183640 (https://phabricator.wikimedia.org/T313900) [11:39:19] (03PS1) 10Volans: data.yaml: add my ecdsa-sk SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/1183641 [11:40:04] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: esams v4 routed ganeti IPs - ayounsi@cumin1003" [11:40:09] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: esams v4 routed ganeti IPs - ayounsi@cumin1003" [11:40:09] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:40:12] (03CR) 10Ayounsi: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1183638 (https://phabricator.wikimedia.org/T402259) (owner: 10Ayounsi) [11:40:18] (03CR) 10Lucas Werkmeister (WMDE): CentralNotice banner experiment WE2.1.1 - Add missing extension config (031 comment) [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183610 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan) [11:41:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir3004.esams.wmnet to plain [11:41:39] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11135956 (10ops-monitoring-bot) VM ncredir3004.esams.wmnet switching disk type to plain [11:41:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir3004.esams.wmnet to plain [11:42:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T401906)', diff saved to https://phabricator.wikimedia.org/P82302 and previous config saved to /var/cache/conftool/dbconfig/20250901-114247-fceratto.json [11:42:51] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [11:43:03] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2190.codfw.wmnet with reason: Maintenance [11:43:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T401906)', diff saved to https://phabricator.wikimedia.org/P82303 and previous config saved to /var/cache/conftool/dbconfig/20250901-114310-fceratto.json [11:43:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 01 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183640 (https://phabricator.wikimedia.org/T313900) (owner: 10D3r1ck01) [11:43:34] (03CR) 10Muehlenhoff: [C:03+1] "This matches the networks currently configured in Netbox" [dns] - 10https://gerrit.wikimedia.org/r/1183638 (https://phabricator.wikimedia.org/T402259) (owner: 10Ayounsi) [11:43:45] (03CR) 10Ayounsi: [C:03+2] esams: add includes for routed ganeti ranges [dns] - 10https://gerrit.wikimedia.org/r/1183638 (https://phabricator.wikimedia.org/T402259) (owner: 10Ayounsi) [11:44:00] !log ayounsi@dns1004 START - running authdns-update [11:44:07] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1183641 (owner: 10Volans) [11:45:07] (03PS2) 10Ladsgroup: Stop writing to cl_to and cl_collation on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181720 (https://phabricator.wikimedia.org/T399579) (owner: 10Zabe) [11:45:16] !log ayounsi@dns1004 END - running authdns-update [11:45:18] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum3004.esams.wmnet to plain [11:45:51] jouncebot: nowandnext [11:45:51] No deployments scheduled for the next 1 hour(s) and 14 minute(s) [11:45:51] In 1 hour(s) and 14 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250901T1300) [11:45:59] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11135967 (10ops-monitoring-bot) VM durum3004.esams.wmnet switching disk type to plain [11:46:09] (03CR) 10Ladsgroup: [C:03+2] Stop writing to cl_to and cl_collation on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181720 (https://phabricator.wikimedia.org/T399579) (owner: 10Zabe) [11:46:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum3004.esams.wmnet to plain [11:46:23] (03PS1) 10Btullis: Use the standby analytics_meta mariadb server temporarily [puppet] - 10https://gerrit.wikimedia.org/r/1183642 (https://phabricator.wikimedia.org/T394498) [11:46:25] (03CR) 10Volans: [C:03+2] data.yaml: add my ecdsa-sk SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/1183641 (owner: 10Volans) [11:46:29] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11135968 (10ayounsi) [11:46:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181720 (https://phabricator.wikimedia.org/T399579) (owner: 10Zabe) [11:46:57] (03Merged) 10jenkins-bot: Stop writing to cl_to and cl_collation on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181720 (https://phabricator.wikimedia.org/T399579) (owner: 10Zabe) [11:47:11] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1181720|Stop writing to cl_to and cl_collation on commonswiki (T399579)]] [11:47:18] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [11:47:19] (03CR) 10Lucas Werkmeister (WMDE): "Whether this actually needs backporting depends on how soon you want to run the maintenance script for T398177 again, I guess… right now i" [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183640 (https://phabricator.wikimedia.org/T313900) (owner: 10D3r1ck01) [11:47:21] T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579 [11:47:26] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1224 (T403362)', diff saved to https://phabricator.wikimedia.org/P82304 and previous config saved to /var/cache/conftool/dbconfig/20250901-114725-ladsgroup.json [11:47:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh3004.wikimedia.org to plain [11:47:28] T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362 [11:48:16] PROBLEM - BFD status on asw1-bw27-esams.mgmt is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:48:16] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11135995 (10ops-monitoring-bot) VM doh3004.wikimedia.org switching disk type to plain [11:48:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh3004.wikimedia.org to plain [11:48:38] PROBLEM - Bird Internet Routing Daemon on durum3004 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [11:49:38] RECOVERY - Bird Internet Routing Daemon on durum3004 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [11:50:39] (03CR) 10D3r1ck01: "Except we don't intend to run soonish, then we can just wait until the master changes rollout before we re-run the script again." [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183640 (https://phabricator.wikimedia.org/T313900) (owner: 10D3r1ck01) [11:51:23] (03PS1) 10Btullis: Facilitate a role swap between an-mariadb1001 and an-mariadb1002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183643 (https://phabricator.wikimedia.org/T394498) [11:52:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of netflow3003.esams.wmnet to plain [11:52:16] RECOVERY - BFD status on asw1-bw27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:52:43] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11136018 (10ops-monitoring-bot) VM netflow3003.esams.wmnet switching disk type to plain [11:52:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow3003.esams.wmnet to plain [11:53:12] !log ladsgroup@deploy1003 ladsgroup, zabe: Backport for [[gerrit:1181720|Stop writing to cl_to and cl_collation on commonswiki (T399579)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:53:15] T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579 [11:54:13] (03PS4) 10Huei Tan: Setup tracking for CentralNotice banners experiment for WE2.1.1 [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183610 (https://phabricator.wikimedia.org/T402496) [11:54:16] !log ladsgroup@deploy1003 ladsgroup, zabe: Continuing with sync [11:54:37] (03PS5) 10Huei Tan: Setup tracking for CentralNotice banners experiment for WE2.1.1 [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183610 (https://phabricator.wikimedia.org/T402496) [11:54:50] (03CR) 10Huei Tan: Setup tracking for CentralNotice banners experiment for WE2.1.1 (031 comment) [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183610 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan) [11:55:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of prometheus3003.esams.wmnet to plain [11:55:41] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating esams to routed Ganeti - https://phabricator.wikimedia.org/T402259#11136029 (10ops-monitoring-bot) VM prometheus3003.esams.wmnet switching disk type to plain [11:55:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of prometheus3003.esams.wmnet to plain [11:56:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T401906)', diff saved to https://phabricator.wikimedia.org/P82305 and previous config saved to /var/cache/conftool/dbconfig/20250901-115649-fceratto.json [11:56:52] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [11:58:38] PROBLEM - ganeti-confd running on ganeti3006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [11:58:54] PROBLEM - ganeti-noded running on ganeti3006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [11:58:57] (03PS1) 10Muehlenhoff: Remove ganeti3006 from ganeti02 cluster in esams [puppet] - 10https://gerrit.wikimedia.org/r/1183645 (https://phabricator.wikimedia.org/T402259) [11:59:18] (03PS2) 10Btullis: Facilitate a role swap between an-mariadb1001 and an-mariadb1002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183643 (https://phabricator.wikimedia.org/T394498) [11:59:26] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1181720|Stop writing to cl_to and cl_collation on commonswiki (T399579)]] (duration: 12m 15s) [11:59:29] T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579 [11:59:36] FIRING: ProbeDown: Service ganeti3006:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:59:47] (03PS2) 10Muehlenhoff: Remove ganeti3006 from ganeti02 cluster in esams [puppet] - 10https://gerrit.wikimedia.org/r/1183645 (https://phabricator.wikimedia.org/T402259) [12:01:06] (03PS3) 10Btullis: Facilitate a role swap between an-mariadb1001 and an-mariadb1002 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183643 (https://phabricator.wikimedia.org/T394498) [12:02:03] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1183645 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [12:02:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.035s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:11:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P82306 and previous config saved to /var/cache/conftool/dbconfig/20250901-121156-fceratto.json [12:12:00] PROBLEM - very high load average likely xfs on ms-be1091 is CRITICAL: CRITICAL - load average: 106.81, 100.94, 71.84 https://wikitech.wikimedia.org/wiki/Swift [12:13:09] (03CR) 10Ayounsi: [C:03+1] Remove ganeti3006 from ganeti02 cluster in esams [puppet] - 10https://gerrit.wikimedia.org/r/1183645 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [12:14:14] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti3006 from ganeti02 cluster in esams [puppet] - 10https://gerrit.wikimedia.org/r/1183645 (https://phabricator.wikimedia.org/T402259) (owner: 10Muehlenhoff) [12:21:01] RECOVERY - very high load average likely xfs on ms-be1091 is OK: OK - load average: 71.53, 79.46, 74.60 https://wikitech.wikimedia.org/wiki/Swift [12:22:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.216s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:22:42] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti3006.esams.wmnet [12:22:46] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.3s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:23:03] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ganeti3006.esams.wmnet [12:23:07] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti3006.esams.wmnet [12:23:52] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11136156 (10phaultfinder) [12:25:12] (03CR) 10Brouberol: [C:03+1] Facilitate a role swap between an-mariadb1001 and an-mariadb1002 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183643 (https://phabricator.wikimedia.org/T394498) (owner: 10Btullis) [12:27:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P82307 and previous config saved to /var/cache/conftool/dbconfig/20250901-122704-fceratto.json [12:27:31] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.3s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:27:55] (03CR) 10Brouberol: [C:03+1] "Beautifully explained, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1183621 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [12:29:00] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11136178 (10phaultfinder) [12:31:31] (03CR) 10Bartosz Dziewoński: "I would prefer to run it soon, so that I can finish this and focus on something else." [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183640 (https://phabricator.wikimedia.org/T313900) (owner: 10D3r1ck01) [12:32:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.133s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:32:49] (03PS1) 10Bartosz Dziewoński: FixRenameUserLocalLogs: Batch more queries to speed up the script [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183653 (https://phabricator.wikimedia.org/T398177) [12:33:06] (03PS1) 10Bartosz Dziewoński: FixRenameUserLocalLogs: Skip rows where the performer is 'Global rename script' [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183654 (https://phabricator.wikimedia.org/T398177) [12:33:40] RESOLVED: ProbeDown: Service ganeti3006:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:34:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 01 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183640 (https://phabricator.wikimedia.org/T313900) (owner: 10D3r1ck01) [12:35:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 01 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183653 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [12:35:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 01 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183654 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [12:38:09] (03CR) 10D3r1ck01: "Ack!" [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183640 (https://phabricator.wikimedia.org/T313900) (owner: 10D3r1ck01) [12:41:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155805 (https://phabricator.wikimedia.org/T396347) (owner: 10Huji) [12:42:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T401906)', diff saved to https://phabricator.wikimedia.org/P82308 and previous config saved to /var/cache/conftool/dbconfig/20250901-124211-fceratto.json [12:42:15] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [12:42:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.587s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:42:17] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2194.codfw.wmnet with reason: Maintenance [12:42:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2194 (T401906)', diff saved to https://phabricator.wikimedia.org/P82309 and previous config saved to /var/cache/conftool/dbconfig/20250901-124223-fceratto.json [12:43:12] jmm@cumin2002 upgrade-firmware (PID 4105560) is awaiting input [12:46:02] (03CR) 10Vgutierrez: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1183618 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [12:47:52] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T403362)', diff saved to https://phabricator.wikimedia.org/P82310 and previous config saved to /var/cache/conftool/dbconfig/20250901-124751-ladsgroup.json [12:47:55] T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362 [12:52:45] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1183621 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [12:52:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3006.esams.wmnet [12:56:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T401906)', diff saved to https://phabricator.wikimedia.org/P82311 and previous config saved to /var/cache/conftool/dbconfig/20250901-125602-fceratto.json [12:56:08] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [12:56:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.666s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:57:00] (03CR) 10Volans: [C:03+1] "No blockers for me. I'll leave it to you." [cookbooks] - 10https://gerrit.wikimedia.org/r/1181795 (owner: 10JHathaway) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250901T1300) [13:00:05] huji, hueitan, xSavitar, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:42] o/ [13:00:46] o/ [13:01:09] o/ [13:01:14] I can deploy [13:01:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.044s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:01:24] (though I have a meeting in exactly one hour so I can’t go over the window today) [13:01:27] let’s start with hueitan [13:01:27] Lucas_WMDE: you can deploy huji's patch, then I can go with hueitan's patch or you can deploy that as well. [13:01:33] hi [13:01:36] oh good. [13:01:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:01:41] oh, sorry, i meant huji actually [13:01:46] as that’s the config change [13:01:48] assuming they’re around [13:02:04] seems not here? [13:02:10] let’s start with xSavitar then [13:02:15] and let the backport gate-and-submit in the background [13:02:29] Ack [13:02:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183216 (https://phabricator.wikimedia.org/T402527) (owner: 10D3r1ck01) [13:03:00] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P82312 and previous config saved to /var/cache/conftool/dbconfig/20250901-130259-ladsgroup.json [13:03:37] (03Merged) 10jenkins-bot: SUL3: Use `metawiki` as central wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183216 (https://phabricator.wikimedia.org/T402527) (owner: 10D3r1ck01) [13:03:40] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183610 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan) [13:03:50] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1183216|SUL3: Use `metawiki` as central wiki (T402527)]] [13:03:53] T402527: Stop using loginwiki during SUL3 central login - https://phabricator.wikimedia.org/T402527 [13:03:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3006.esams.wmnet [13:04:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti3006.esams.wmnet [13:04:28] Lucas_WMDE, nothing to test, so you can sync when it's ready. [13:04:36] yup, ok [13:04:58] Hi Lucas_WMDE [13:05:21] Here for patch 1155805 [13:05:28] hi! [13:05:34] (03Merged) 10jenkins-bot: Setup tracking for CentralNotice banners experiment for WE2.1.1 [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183610 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan) [13:05:41] oh wow that was a fast merge [13:05:52] (03CR) 10Muehlenhoff: [C:03+2] Blacklist jffs2 [puppet] - 10https://gerrit.wikimedia.org/r/1183072 (owner: 10Muehlenhoff) [13:06:01] ok then it’s hueitan next (once the current config change is done), then hujihuji [13:06:02] It's been a while since I helped with a deployment so I need you to point me to the browser extension that helps me connec tto the specific node that you are deploying to [13:06:10] https://wikitech.wikimedia.org/wiki/WikimediaDebug :) [13:06:19] (03CR) 10Elukey: [C:03+2] profile::base: Pin linux-base's version for Bookworm bpo [puppet] - 10https://gerrit.wikimedia.org/r/1183621 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [13:06:36] and nowadays you don’t need to pick a specific server anymore, k8s-mwdebug should be enough [13:06:45] (you might remember being asked to pick mwdebug1002, or mwdebug2002, or etc.) [13:07:29] Yes, I am that old ;) [13:07:42] !log lucaswerkmeister-wmde@deploy1003 d3r1ck01, lucaswerkmeister-wmde: Backport for [[gerrit:1183216|SUL3: Use `metawiki` as central wiki (T402527)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:07:51] Seems like you are working with hueitan first, let me take a quick second to install that extension, brb [13:08:09] !log lucaswerkmeister-wmde@deploy1003 d3r1ck01, lucaswerkmeister-wmde: Continuing with sync [13:08:31] ok, all set [13:08:44] i'm away for a bit, i should be back before it's my turn [13:08:57] ok [13:09:00] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11136351 (10elukey) To keep archives happy: Moritz is re-initializing all the maps-test replicas that show the above sign of failure, and we'll likely also bump up max-conns with http... [13:09:08] I think we can probably deploy the changes for hueitan and hujihuji together [13:09:11] they both seem harmless enough [13:09:19] no problem [13:09:41] Let's deploy the hu* patches then [13:09:46] hehe [13:11:06] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: PuppetPendingCertificateRequest (instance puppetmaster1001:9100) - https://phabricator.wikimedia.org/T403388#11136355 (10elukey) 05Open→03Resolved a:03elukey ` elukey@puppetmaster1001:~$ sudo puppet cert destroy sretest2005.co... [13:11:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P82313 and previous config saved to /var/cache/conftool/dbconfig/20250901-131110-fceratto.json [13:11:13] Lucas_WMDE, thank you very much for deploying my patch. 🙏🏽 [13:11:19] np [13:12:56] 10SRE-swift-storage, 06Commons: HTTP 404 / File not found errors for three images in one category - https://phabricator.wikimedia.org/T403314#11136361 (10TheDJ) Missing originals: Strange, these are 2004 files, Considering there were thumbnails of these before, the originals must have been present at some time... [13:13:27] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1183216|SUL3: Use `metawiki` as central wiki (T402527)]] (duration: 09m 36s) [13:13:30] T402527: Stop using loginwiki during SUL3 central login - https://phabricator.wikimedia.org/T402527 [13:13:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155805 (https://phabricator.wikimedia.org/T396347) (owner: 10Huji) [13:14:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [13:14:45] (03Merged) 10jenkins-bot: Enable electionclerk user group on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155805 (https://phabricator.wikimedia.org/T396347) (owner: 10Huji) [13:15:03] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1183610|Setup tracking for CentralNotice banners experiment for WE2.1.1 (T402496)]], [[gerrit:1155805|Enable electionclerk user group on fawiki (T396347)]] [13:15:08] T402496: Tracking code for Scenarios 1 for WE2.1.1 - https://phabricator.wikimedia.org/T402496 [13:15:08] T396347: Enable SecurePoll extension and electionclerk user group on fawiki - https://phabricator.wikimedia.org/T396347 [13:15:29] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183640 (https://phabricator.wikimedia.org/T313900) (owner: 10D3r1ck01) [13:15:36] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183653 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [13:15:40] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183654 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [13:18:08] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P82314 and previous config saved to /var/cache/conftool/dbconfig/20250901-131807-ladsgroup.json [13:18:26] Lucas_WMDE: it appears to be working [13:18:34] Thanks for merging my change [13:18:47] that’s very early, it says 0% deplyoment progress even on the test servers :P [13:18:54] ok now it jumped to 75% [13:19:07] so I guess you got lucky and hit a server that already had the config change [13:19:36] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [13:20:33] I wish I was always this lucky ;) [13:20:41] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1170543 (owner: 10Cathal Mooney) [13:20:56] !log lucaswerkmeister-wmde@deploy1003 huji, hueitan, lucaswerkmeister-wmde: Backport for [[gerrit:1183610|Setup tracking for CentralNotice banners experiment for WE2.1.1 (T402496)]], [[gerrit:1155805|Enable electionclerk user group on fawiki (T396347)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:21:00] T402496: Tracking code for Scenarios 1 for WE2.1.1 - https://phabricator.wikimedia.org/T402496 [13:21:01] T396347: Enable SecurePoll extension and electionclerk user group on fawiki - https://phabricator.wikimedia.org/T396347 [13:21:13] aw, they didn’t even get to see the message [13:21:24] would’ve been useful to know for their next deployment [13:21:28] anyway, hueitan, please test :) [13:22:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti3006.esams.wmnet with OS bookworm [13:23:32] hueitan: ^^ [13:24:27] Lucas_WMDE: Things are fine. You can go ahead. [13:24:31] !log lucaswerkmeister-wmde@deploy1003 huji, hueitan, lucaswerkmeister-wmde: Continuing with sync [13:24:34] ok, thanks [13:25:07] (03Merged) 10jenkins-bot: Add caller to maintenance script SQL queries [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183640 (https://phabricator.wikimedia.org/T313900) (owner: 10D3r1ck01) [13:25:26] (03PS1) 10Elukey: profile::amd_gpu: add a flag to deploy firmwares from Bookworm BPO [puppet] - 10https://gerrit.wikimedia.org/r/1183678 (https://phabricator.wikimedia.org/T393948) [13:25:41] Lucas_WMDE my patch is fine, can go ahead [13:26:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P82315 and previous config saved to /var/cache/conftool/dbconfig/20250901-132617-fceratto.json [13:26:28] (03PS1) 10Slyngshede: P:cache::haproxy disallow Wikidata Query Service as UA [puppet] - 10https://gerrit.wikimedia.org/r/1183679 [13:26:31] (03Merged) 10jenkins-bot: FixRenameUserLocalLogs: Batch more queries to speed up the script [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183653 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [13:26:32] (03Merged) 10jenkins-bot: FixRenameUserLocalLogs: Skip rows where the performer is 'Global rename script' [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183654 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [13:26:48] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in esams to Bookworm - https://phabricator.wikimedia.org/T382509#11136405 (10MoritzMuehlenhoff) This update is piggybacked on https://phabricator.wikimedia.org/T402259 [13:27:02] !log Add 15G to prometheus-k8s-dse lv [13:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:17] (03PS1) 10Elukey: Delete profile::python38 [puppet] - 10https://gerrit.wikimedia.org/r/1183680 [13:28:36] (03CR) 10Muehlenhoff: profile::amd_gpu: add a flag to deploy firmwares from Bookworm BPO (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183678 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [13:29:22] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1183680 (owner: 10Elukey) [13:29:36] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [13:29:57] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1183610|Setup tracking for CentralNotice banners experiment for WE2.1.1 (T402496)]], [[gerrit:1155805|Enable electionclerk user group on fawiki (T396347)]] (duration: 14m 53s) [13:30:01] T402496: Tracking code for Scenarios 1 for WE2.1.1 - https://phabricator.wikimedia.org/T402496 [13:30:01] T396347: Enable SecurePoll extension and electionclerk user group on fawiki - https://phabricator.wikimedia.org/T396347 [13:30:22] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in esams to Bookworm - https://phabricator.wikimedia.org/T382509#11136426 (10MoritzMuehlenhoff) [13:30:31] MatmaRex: ^ fyi [13:30:39] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1183640|Add caller to maintenance script SQL queries (T313900 T398177 T403387)]], [[gerrit:1183653|FixRenameUserLocalLogs: Batch more queries to speed up the script (T398177)]], [[gerrit:1183654|FixRenameUserLocalLogs: Skip rows where the performer is 'Global rename script' (T398177)]] [13:30:45] T313900: Renaming a user doubles their edit count according to CentralAuthUser::getGlobalEditCount() / global_edit_count.gec_count field - https://phabricator.wikimedia.org/T313900 [13:30:45] T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177 [13:30:46] T403387: SQL query did not specify the caller (guessed caller: {caller}): {sql} - https://phabricator.wikimedia.org/T403387 [13:31:34] Lucas_WMDE: my device crashed. So much for being lucky ... [13:31:43] Thanks again for your help! Anything else needed from me? [13:32:11] nope! [13:32:11] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:cache::haproxy Allow user-agents with contact information [puppet] - 10https://gerrit.wikimedia.org/r/1183618 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [13:32:24] huji_wmf: you just barely missed the message that would’ve told you that *now* the change was ready for testing :D [13:32:33] just so you know that’s a thing next time ^^ [13:32:47] (03PS1) 10Elukey: Add a new insetup role for ml-k8s hosts to test their GPU [puppet] - 10https://gerrit.wikimedia.org/r/1183681 (https://phabricator.wikimedia.org/T393948) [13:33:10] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone_es of es2026.codfw.wmnet onto es2049.codfw.wmnet [13:33:14] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool es2026 - Depool es2026.codfw.wmnet to then clone it to es2049.codfw.wmnet - fceratto@cumin1002 [13:33:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T403362)', diff saved to https://phabricator.wikimedia.org/P82317 and previous config saved to /var/cache/conftool/dbconfig/20250901-133314-ladsgroup.json [13:33:20] T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362 [13:33:21] (03CR) 10CI reject: [V:04-1] Add a new insetup role for ml-k8s hosts to test their GPU [puppet] - 10https://gerrit.wikimedia.org/r/1183681 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [13:33:30] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [13:33:33] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2026 - Depool es2026.codfw.wmnet to then clone it to es2049.codfw.wmnet - fceratto@cumin1002 [13:33:54] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6816/co" [puppet] - 10https://gerrit.wikimedia.org/r/1183681 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [13:35:32] i'm back [13:35:36] (03PS1) 10Brouberol: mediawiki-dumps-legacy: only keep 3 dumps directories for each wiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183683 (https://phabricator.wikimedia.org/T403401) [13:35:44] Lucas_WMDE: thanks [13:35:52] 10SRE-swift-storage, 06Commons: HTTP 404 / File not found errors for three images in one category - https://phabricator.wikimedia.org/T403314#11136464 (10Pigsonthewing) >>! In T403314#11136361, @TheDJ wrote: > Missing originals I tried to use "Upload a new version of this file" for one of them, with the orig... [13:36:11] (03PS2) 10Elukey: profile::amd_gpu: add a flag to deploy firmwares from Bookworm BPO [puppet] - 10https://gerrit.wikimedia.org/r/1183678 (https://phabricator.wikimedia.org/T393948) [13:36:11] (03PS2) 10Elukey: Delete profile::python38 [puppet] - 10https://gerrit.wikimedia.org/r/1183680 [13:36:11] (03PS2) 10Elukey: Add a new insetup role for ml-k8s hosts to test their GPU [puppet] - 10https://gerrit.wikimedia.org/r/1183681 (https://phabricator.wikimedia.org/T393948) [13:36:24] (03CR) 10Elukey: profile::amd_gpu: add a flag to deploy firmwares from Bookworm BPO (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183678 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [13:36:33] fceratto@cumin1002 clone_es (PID 238179) is awaiting input [13:36:41] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, matmarex, d3r1ck01: Backport for [[gerrit:1183640|Add caller to maintenance script SQL queries (T313900 T398177 T403387)]], [[gerrit:1183653|FixRenameUserLocalLogs: Batch more queries to speed up the script (T398177)]], [[gerrit:1183654|FixRenameUserLocalLogs: Skip rows where the performer is 'Global rename script' (T398177)]] synced to the testservers (see http [13:36:41] s://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:36:45] there should be nothing to test for these backports (they’re all only in maintenance/), I’ll just do a very quick sanity check that being logged in and logging in still works [13:36:49] T313900: Renaming a user doubles their edit count according to CentralAuthUser::getGlobalEditCount() / global_edit_count.gec_count field - https://phabricator.wikimedia.org/T313900 [13:36:49] T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177 [13:36:50] T403387: SQL query did not specify the caller (guessed caller: {caller}): {sql} - https://phabricator.wikimedia.org/T403387 [13:37:22] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, matmarex, d3r1ck01: Continuing with sync [13:40:33] (03PS3) 10Elukey: Add a new insetup role for ml-k8s hosts to test their GPU [puppet] - 10https://gerrit.wikimedia.org/r/1183681 (https://phabricator.wikimedia.org/T393948) [13:41:05] MatmaRex: same four batches as before for FixRenamedUserGlobalEditCount --fix, right? [13:41:19] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6817/co" [puppet] - 10https://gerrit.wikimedia.org/r/1183681 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [13:41:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T401906)', diff saved to https://phabricator.wikimedia.org/P82319 and previous config saved to /var/cache/conftool/dbconfig/20250901-134125-fceratto.json [13:41:27] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1183678 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [13:41:29] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [13:41:32] Lucas_WMDE: yep. thank you [13:41:40] Lucas_WMDE: I have reconfirmed that things are working. I am going to close the Phab task. Have a great rest of the day [13:41:41] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2209.codfw.wmnet with reason: Maintenance [13:41:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2209 (T401906)', diff saved to https://phabricator.wikimedia.org/P82320 and previous config saved to /var/cache/conftool/dbconfig/20250901-134148-fceratto.json [13:41:49] * huji_wmf says bye [13:41:54] bye huji_wmf! [13:42:20] (03PS4) 10Elukey: Add a new insetup role for ml-k8s hosts to test their GPU [puppet] - 10https://gerrit.wikimedia.org/r/1183681 (https://phabricator.wikimedia.org/T393948) [13:42:27] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1183640|Add caller to maintenance script SQL queries (T313900 T398177 T403387)]], [[gerrit:1183653|FixRenameUserLocalLogs: Batch more queries to speed up the script (T398177)]], [[gerrit:1183654|FixRenameUserLocalLogs: Skip rows where the performer is 'Global rename script' (T398177)]] (duration: 11m 48s) [13:42:32] T313900: Renaming a user doubles their edit count according to CentralAuthUser::getGlobalEditCount() / global_edit_count.gec_count field - https://phabricator.wikimedia.org/T313900 [13:42:33] T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177 [13:42:33] T403387: SQL query did not specify the caller (guessed caller: {caller}): {sql} - https://phabricator.wikimedia.org/T403387 [13:42:47] !log lucaswerkmeister-wmde@deploy1003 mwscript-k8s job started: foreachwikiindblist sul CentralAuth:FixRenameUserLocalLogs --logwiki=metawiki # T398177 (dry run) [13:43:05] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6818/co" [puppet] - 10https://gerrit.wikimedia.org/r/1183681 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [13:43:41] FIRING: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [13:43:54] !log jmm@cumin2002 START - Cookbook sre.postgresql.postgres-init [13:44:13] (03PS5) 10Elukey: Add a new insetup role for ml-k8s hosts to test their GPU [puppet] - 10https://gerrit.wikimedia.org/r/1183681 (https://phabricator.wikimedia.org/T393948) [13:44:15] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti3006.esams.wmnet with reason: host reimage [13:44:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:44:41] !log jmm@cumin2002 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99) [13:44:49] !log jmm@cumin2002 START - Cookbook sre.postgresql.postgres-init [13:44:58] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6819/co" [puppet] - 10https://gerrit.wikimedia.org/r/1183681 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [13:45:06] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: only keep 3 dumps directories for each wiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183683 (https://phabricator.wikimedia.org/T403401) (owner: 10Brouberol) [13:45:07] !log lucaswerkmeister-wmde@deploy1003 mwscript-k8s job started: CentralAuth:FixRenamedUserGlobalEditCount metawiki --fix --since=20220310000000 --until=20230101000000 # T313900 [13:45:22] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: only keep 3 dumps directories for each wiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1183683 (https://phabricator.wikimedia.org/T403401) (owner: 10Brouberol) [13:45:27] MatmaRex: FixRenameUserLocalLogs already made it through aawiki so it looks like some speedup is happening [13:45:45] (03CR) 10Elukey: [V:03+1] "This is a proposal for a new role that is halfway between insetup and ml-k8s, lemmek now!" [puppet] - 10https://gerrit.wikimedia.org/r/1183681 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [13:46:13] nice [13:47:22] !log jmm@cumin2002 START - Cookbook sre.postgresql.postgres-init [13:49:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti3006.esams.wmnet with reason: host reimage [13:50:53] (03PS1) 10Slyngshede: Revert "P:cache::haproxy Allow user-agents with contact information" [puppet] - 10https://gerrit.wikimedia.org/r/1183687 [13:50:55] > Corrected edit count for 'AndrewGarfieldIsTheBestSpiderMan': from 887 to 756 (-131; 0.85x) [13:51:05] I’m sorry, we’ll have to revert the whole maintenance script run. this is clearly incorrect information /s [13:51:09] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [13:51:40] 10SRE-swift-storage, 06Commons: HTTP 404 / File not found errors for three images in one category - https://phabricator.wikimedia.org/T403314#11136523 (10Pigsonthewing) I now see that the Marischal College image is showing again. I will try the same steps with the other two. [13:52:59] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [13:53:09] :D [13:56:01] * Lucas_WMDE in a meeting now [13:56:06] so I might not start batch 2 immediately [13:56:12] (but right now batch 1 isn’t done yet) [13:56:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T401906)', diff saved to https://phabricator.wikimedia.org/P82321 and previous config saved to /var/cache/conftool/dbconfig/20250901-135626-fceratto.json [13:56:30] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [13:57:29] (03CR) 10Slyngshede: [C:03+2] Revert "P:cache::haproxy Allow user-agents with contact information" [puppet] - 10https://gerrit.wikimedia.org/r/1183687 (owner: 10Slyngshede) [13:59:05] (03PS1) 10Fabfur: team-traffic: raise haproxykafka alert thresholds [alerts] - 10https://gerrit.wikimedia.org/r/1183689 (https://phabricator.wikimedia.org/T370668) [13:59:20] (03PS27) 10Arnaudb: gerrit: mod qos configuration [puppet] - 10https://gerrit.wikimedia.org/r/1183117 (https://phabricator.wikimedia.org/T402611) [13:59:20] (03CR) 10Arnaudb: [C:03+2] "+2 to test on gerrit2003" [puppet] - 10https://gerrit.wikimedia.org/r/1183117 (https://phabricator.wikimedia.org/T402611) (owner: 10Arnaudb) [13:59:42] (03PS1) 10Arnaudb: Revert "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1183690 [14:00:24] (03PS1) 10Stevemunene: dse-k8s: disable cluster_dns to allow core-dns deploy. [puppet] - 10https://gerrit.wikimedia.org/r/1183691 (https://phabricator.wikimedia.org/T397298) [14:00:40] (03CR) 10CI reject: [V:04-1] team-traffic: raise haproxykafka alert thresholds [alerts] - 10https://gerrit.wikimedia.org/r/1183689 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [14:02:06] “Done, corrected 8757 edit counts” [14:02:14] !log lucaswerkmeister-wmde@deploy1003 mwscript-k8s job started: CentralAuth:FixRenamedUserGlobalEditCount metawiki --fix --since=20230101000000 --until=20240101000000 # T313900 [14:02:17] T313900: Renaming a user doubles their edit count according to CentralAuthUser::getGlobalEditCount() / global_edit_count.gec_count field - https://phabricator.wikimedia.org/T313900 [14:03:45] (03PS1) 10Huei Tan: Setup tracking for CentralNotice banners experiment for WE2.1.1 [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183692 (https://phabricator.wikimedia.org/T402496) [14:03:49] (03PS1) 10Vgutierrez: Revert^2 "P:cache::haproxy Allow user-agents with contact information" [puppet] - 10https://gerrit.wikimedia.org/r/1183693 [14:04:57] (03PS2) 10Vgutierrez: Revert^2 "P:cache::haproxy Allow user-agents with contact information" [puppet] - 10https://gerrit.wikimedia.org/r/1183693 (https://phabricator.wikimedia.org/T400119) [14:05:06] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1183693 (https://phabricator.wikimedia.org/T400119) (owner: 10Vgutierrez) [14:07:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183692 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan) [14:08:00] (03CR) 10Slyngshede: Revert^2 "P:cache::haproxy Allow user-agents with contact information" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183693 (https://phabricator.wikimedia.org/T400119) (owner: 10Vgutierrez) [14:08:35] (03CR) 10Fabfur: Revert^2 "P:cache::haproxy Allow user-agents with contact information" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1183693 (https://phabricator.wikimedia.org/T400119) (owner: 10Vgutierrez) [14:08:48] (03PS3) 10Vgutierrez: Revert^2 "P:cache::haproxy Allow user-agents with contact information" [puppet] - 10https://gerrit.wikimedia.org/r/1183693 (https://phabricator.wikimedia.org/T400119) [14:09:02] (03CR) 10Vgutierrez: Revert^2 "P:cache::haproxy Allow user-agents with contact information" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1183693 (https://phabricator.wikimedia.org/T400119) (owner: 10Vgutierrez) [14:10:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:10:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti3006.esams.wmnet with OS bookworm [14:11:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P82322 and previous config saved to /var/cache/conftool/dbconfig/20250901-141133-fceratto.json [14:12:04] jouncebot: nowandnext [14:12:04] No deployments scheduled for the next 0 hour(s) and 17 minute(s) [14:12:04] In 0 hour(s) and 17 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250901T1430) [14:12:30] (03CR) 10Arnaudb: [C:03+2] "tests done" [puppet] - 10https://gerrit.wikimedia.org/r/1183690 (owner: 10Arnaudb) [14:12:56] (03CR) 10CI reject: [V:04-1] Setup tracking for CentralNotice banners experiment for WE2.1.1 [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183692 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan) [14:13:35] (03PS1) 10Arnaudb: Revert^2 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1183698 [14:13:38] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1183693 (https://phabricator.wikimedia.org/T400119) (owner: 10Vgutierrez) [14:13:55] (03CR) 10Fabfur: [C:03+1] Revert^2 "P:cache::haproxy Allow user-agents with contact information" [puppet] - 10https://gerrit.wikimedia.org/r/1183693 (https://phabricator.wikimedia.org/T400119) (owner: 10Vgutierrez) [14:15:09] (03PS2) 10Fabfur: team-traffic: raise haproxykafka alert thresholds [alerts] - 10https://gerrit.wikimedia.org/r/1183689 (https://phabricator.wikimedia.org/T370668) [14:17:01] (03CR) 10CI reject: [V:04-1] team-traffic: raise haproxykafka alert thresholds [alerts] - 10https://gerrit.wikimedia.org/r/1183689 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [14:17:16] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:17:31] (03CR) 10Vgutierrez: [C:03+2] Revert^2 "P:cache::haproxy Allow user-agents with contact information" [puppet] - 10https://gerrit.wikimedia.org/r/1183693 (https://phabricator.wikimedia.org/T400119) (owner: 10Vgutierrez) [14:17:56] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:17:58] (03CR) 10Huei Tan: "recheck" [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183692 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan) [14:18:10] FIRING: BFDdown: BFD session down between cr2-esams and 185.15.59.144 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:20:22] “Done, corrected 10639 edit counts” [14:20:49] !log lucaswerkmeister-wmde@deploy1003 mwscript-k8s job started: CentralAuth:FixRenamedUserGlobalEditCount metawiki --fix --since=20240101000000 --until=20250101000000 # T313900 [14:20:52] T313900: Renaming a user doubles their edit count according to CentralAuthUser::getGlobalEditCount() / global_edit_count.gec_count field - https://phabricator.wikimedia.org/T313900 [14:22:59] (03PS3) 10Fabfur: team-traffic: raise haproxykafka alert thresholds [alerts] - 10https://gerrit.wikimedia.org/r/1183689 (https://phabricator.wikimedia.org/T370668) [14:23:10] FIRING: [2x] BFDdown: BFD session down between cr1-eqiad and 185.15.59.145 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:24:29] (03CR) 10CI reject: [V:04-1] team-traffic: raise haproxykafka alert thresholds [alerts] - 10https://gerrit.wikimedia.org/r/1183689 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [14:25:13] (03CR) 10Huei Tan: "recheck" [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183692 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan) [14:25:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:26:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P82323 and previous config saved to /var/cache/conftool/dbconfig/20250901-142641-fceratto.json [14:28:40] (03CR) 10Brouberol: "Could you hadd a `Hosts` header, so we could see the PCC diff?" [puppet] - 10https://gerrit.wikimedia.org/r/1183691 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [14:29:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-esams (185.15.59.145) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:30:04] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250901T1430) [14:32:50] !log dreamyjazz Deployed security patch for T403289 [14:32:54] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [14:36:56] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:37:16] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:38:06] “Done, corrected 13028 edit counts” [14:38:10] FIRING: [2x] BFDdown: BFD session down between cr1-eqiad and 185.15.59.145 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:38:27] !log lucaswerkmeister-wmde@deploy1003 mwscript-k8s job started: CentralAuth:FixRenamedUserGlobalEditCount metawiki --fix --since=20250101000000 # T313900 [14:38:30] T313900: Renaming a user doubles their edit count according to CentralAuthUser::getGlobalEditCount() / global_edit_count.gec_count field - https://phabricator.wikimedia.org/T313900 [14:39:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-esams (185.15.59.145) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:40:38] (03PS2) 10Stevemunene: dse-k8s: disable cluster_dns to allow core-dns deploy. [puppet] - 10https://gerrit.wikimedia.org/r/1183691 (https://phabricator.wikimedia.org/T397298) [14:40:45] (03PS4) 10Fabfur: team-traffic: raise haproxykafka alert thresholds [alerts] - 10https://gerrit.wikimedia.org/r/1183689 (https://phabricator.wikimedia.org/T370668) [14:40:45] 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11136677 (10ayounsi) p:05Triage→03Low [14:41:02] (03PS1) 10Krinkle: Disable wmgUseMdotRouting on testwiki in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183700 (https://phabricator.wikimedia.org/T401595) [14:41:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T401906)', diff saved to https://phabricator.wikimedia.org/P82324 and previous config saved to /var/cache/conftool/dbconfig/20250901-144148-fceratto.json [14:41:52] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [14:42:04] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2227.codfw.wmnet with reason: Maintenance [14:42:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2227 (T401906)', diff saved to https://phabricator.wikimedia.org/P82325 and previous config saved to /var/cache/conftool/dbconfig/20250901-144211-fceratto.json [14:42:54] (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1183691 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [14:43:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and 185.15.59.145 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:43:41] RESOLVED: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:50:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Automate PDU Deployment Process - https://phabricator.wikimedia.org/T403173#11136703 (10LSobanski) p:05Triage→03Medium @Jclark-ctr please let I/F know when a new PDU arrives so that the traffic can be analyzed. [14:50:59] “Done, corrected 10497 edit counts” [14:55:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T401906)', diff saved to https://phabricator.wikimedia.org/P82326 and previous config saved to /var/cache/conftool/dbconfig/20250901-145548-fceratto.json [14:55:52] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [14:56:46] 10SRE-tools, 10homer, 06Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415#11136718 (10ayounsi) p:05High→03Medium Lowering the priority as this has been working fine. [15:04:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:05:05] (03PS3) 10Stevemunene: dse-k8s: disable cluster_dns to allow core-dns deploy. [puppet] - 10https://gerrit.wikimedia.org/r/1183691 (https://phabricator.wikimedia.org/T397298) [15:07:30] (03PS1) 10Nik Gkountas: ContentTranslation: Add cxserver host for server-side requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183703 (https://phabricator.wikimedia.org/T386131) [15:08:40] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:09] (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1183691 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [15:10:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P82327 and previous config saved to /var/cache/conftool/dbconfig/20250901-151056-fceratto.json [15:17:48] (03PS1) 10Muehlenhoff: Assign ganeti_routed role to ganeti3006 and configure cluster in esams [puppet] - 10https://gerrit.wikimedia.org/r/1183704 [15:17:50] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1237.eqiad.wmnet with reason: Maintenance [15:17:58] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1237 (T403362)', diff saved to https://phabricator.wikimedia.org/P82332 and previous config saved to /var/cache/conftool/dbconfig/20250901-151757-ladsgroup.json [15:18:01] T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362 [15:19:23] !log installing luajit security updates [15:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:40] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11136788 (10DavidBrooks) Re the comment: "Allow user-agents with contact information" - implies blocking UAs with no contact information. Is this referring... [15:25:27] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11136791 (10Vgutierrez) >>! In T400119#11136788, @DavidBrooks wrote: > Re the comment: "Allow user-agents with contact information" - implies blocking UAs... [15:26:01] (03PS2) 10Ayounsi: Assign ganeti_routed role to ganeti3006 and configure cluster in esams [puppet] - 10https://gerrit.wikimedia.org/r/1183704 (owner: 10Muehlenhoff) [15:26:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P82333 and previous config saved to /var/cache/conftool/dbconfig/20250901-152603-fceratto.json [15:28:15] (03PS3) 10Ayounsi: Assign ganeti_routed role to ganeti3006 and configure cluster in esams [puppet] - 10https://gerrit.wikimedia.org/r/1183704 (owner: 10Muehlenhoff) [15:28:23] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1183704 (owner: 10Muehlenhoff) [15:28:40] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:29:36] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:30:05] jan_drewniak: Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250901T1530). Please do the needful. [15:37:15] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1183681 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [15:40:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1237 (T403362)', diff saved to https://phabricator.wikimedia.org/P82334 and previous config saved to /var/cache/conftool/dbconfig/20250901-154028-ladsgroup.json [15:40:32] T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362 [15:41:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T401906)', diff saved to https://phabricator.wikimedia.org/P82335 and previous config saved to /var/cache/conftool/dbconfig/20250901-154111-fceratto.json [15:41:14] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [15:41:27] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2239.codfw.wmnet with reason: Maintenance [15:48:48] 06SRE, 06Wikimedia Enterprise: Provide auth-less access to Enterprise APIs from WMF Analytics cluster - https://phabricator.wikimedia.org/T403298#11136889 (10Urbanecm) >>! In T403298#11135615, @JMeybohm wrote: > Please keep in mind that allowing the HTTP proxy IPs will ultimately allow Enterprise API access fr... [15:55:36] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1237', diff saved to https://phabricator.wikimedia.org/P82336 and previous config saved to /var/cache/conftool/dbconfig/20250901-155535-ladsgroup.json [16:10:45] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1237', diff saved to https://phabricator.wikimedia.org/P82337 and previous config saved to /var/cache/conftool/dbconfig/20250901-161043-ladsgroup.json [16:18:38] (03CR) 10Abijeet Patro: Setup tracking for CentralNotice banners experiment for WE2.1.1 (031 comment) [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183692 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan) [16:20:01] (03PS1) 10Filippo Giunchedi: java: add support for Trixie / Java 21 [puppet] - 10https://gerrit.wikimedia.org/r/1183707 (https://phabricator.wikimedia.org/T403154) [16:23:06] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2205.codfw.wmnet with reason: Maintenance [16:25:53] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1237 (T403362)', diff saved to https://phabricator.wikimedia.org/P82338 and previous config saved to /var/cache/conftool/dbconfig/20250901-162552-ladsgroup.json [16:25:56] T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362 [16:26:08] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [16:28:54] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11137032 (10phaultfinder) [16:33:57] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11137048 (10phaultfinder) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250901T1700) [17:00:05] ryankemper: Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250901T1700). Please do the needful. [17:01:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:17:55] (03CR) 10KartikMistry: [C:03+1] ContentTranslation: Add cxserver host for server-side requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183703 (https://phabricator.wikimedia.org/T386131) (owner: 10Nik Gkountas) [17:18:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.206s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:29:36] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:43:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.3s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:44:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:59:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [17:59:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [18:09:51] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [18:09:58] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2186 (T403362)', diff saved to https://phabricator.wikimedia.org/P82339 and previous config saved to /var/cache/conftool/dbconfig/20250901-180958-ladsgroup.json [18:10:01] T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362 [18:11:48] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:21:49] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:24:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175222 (https://phabricator.wikimedia.org/T400428) (owner: 10NMW03) [18:30:44] (03PS3) 10NMW03: Add rights to bypass spam blacklists for azwiki sysops and interface-admins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175222 (https://phabricator.wikimedia.org/T400428) [18:32:54] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [18:34:27] 10SRE-swift-storage, 06Commons: HTTP 404 / File not found errors for three images in one category - https://phabricator.wikimedia.org/T403314#11137316 (10MatthewVernon) 05Open→03Resolved a:03Pigsonthewing Thanks! [18:43:51] (03PS2) 10Huei Tan: Setup tracking for CentralNotice banners experiment for WE2.1.1 [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183692 (https://phabricator.wikimedia.org/T402496) [18:43:57] (03CR) 10Huei Tan: Setup tracking for CentralNotice banners experiment for WE2.1.1 (031 comment) [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1183692 (https://phabricator.wikimedia.org/T402496) (owner: 10Huei Tan) [19:04:23] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11137337 (10VRiley-WMF) Starting on ms-be1083 [19:04:34] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11137338 (10VRiley-WMF) 05Open→03In progress [19:04:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:10:00] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2186 (T403362)', diff saved to https://phabricator.wikimedia.org/P82340 and previous config saved to /var/cache/conftool/dbconfig/20250901-190959-ladsgroup.json [19:10:03] T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362 [19:14:03] (03CR) 10Urbanecm: [C:03+1] "functionally, LGTM. let's wait for the sync on Tuesday to double check we want to go ahead." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524) (owner: 10Cyndywikime) [19:23:44] !log cr1-esams> request chassis fpc slot 1 offline - T403360 [19:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:51] T403360: FPC1 Failure on cr1-esams - take 2 - https://phabricator.wikimedia.org/T403360 [19:25:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2186', diff saved to https://phabricator.wikimedia.org/P82341 and previous config saved to /var/cache/conftool/dbconfig/20250901-192507-ladsgroup.json [19:40:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2186', diff saved to https://phabricator.wikimedia.org/P82342 and previous config saved to /var/cache/conftool/dbconfig/20250901-194014-ladsgroup.json [19:55:23] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2186 (T403362)', diff saved to https://phabricator.wikimedia.org/P82345 and previous config saved to /var/cache/conftool/dbconfig/20250901-195522-ladsgroup.json [19:55:26] T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362 [19:55:38] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2191.codfw.wmnet with reason: Maintenance [19:55:45] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2191 (T403362)', diff saved to https://phabricator.wikimedia.org/P82346 and previous config saved to /var/cache/conftool/dbconfig/20250901-195545-ladsgroup.json [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250901T2000). [20:00:05] Nemoralis: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:19] o/ [20:12:29] anyone? [20:19:44] how many deployers does it take to change a config... (more than five apparently) [20:24:33] (03PS1) 10Hokwelum: Set $wgPHPSessionHandling to 'disable' on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183741 (https://phabricator.wikimedia.org/T362324) [20:31:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [20:32:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [20:33:59] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11137429 (10phaultfinder) [20:34:29] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11137431 (10VRiley-WMF) [20:37:16] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11137433 (10VRiley-WMF) ms-be1083 has been completed. moving onto ms-be1084 [20:38:59] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11137435 (10phaultfinder) [20:39:01] (03CR) 10D3r1ck01: [C:03+1] Set $wgPHPSessionHandling to 'disable' on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183741 (https://phabricator.wikimedia.org/T362324) (owner: 10Hokwelum) [20:46:35] 06SRE, 10DNS, 06Traffic, 10WikiLearn: DNS records for WikiLearn - https://phabricator.wikimedia.org/T365435#11137439 (10Ijon) 05Open→03Declined Thanks for the ping. We are indeed resolving it by using an address in learn.wiki. This ticket can be closed. [20:55:12] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2191 (T403362)', diff saved to https://phabricator.wikimedia.org/P82347 and previous config saved to /var/cache/conftool/dbconfig/20250901-205511-ladsgroup.json [20:55:15] T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362 [21:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor My software never has bugs. It just develops random features. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250901T2100). [21:01:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:05:19] (03CR) 10Bartosz Dziewoński: [C:03+1] Set $wgPHPSessionHandling to 'disable' on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183741 (https://phabricator.wikimedia.org/T362324) (owner: 10Hokwelum) [21:07:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1183741 (https://phabricator.wikimedia.org/T362324) (owner: 10Hokwelum) [21:10:20] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2191', diff saved to https://phabricator.wikimedia.org/P82348 and previous config saved to /var/cache/conftool/dbconfig/20250901-211019-ladsgroup.json [21:14:59] 10ops-eqiad, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403431 (10phaultfinder) 03NEW [21:25:27] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2191', diff saved to https://phabricator.wikimedia.org/P82350 and previous config saved to /var/cache/conftool/dbconfig/20250901-212526-ladsgroup.json [21:29:36] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [21:40:35] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2191 (T403362)', diff saved to https://phabricator.wikimedia.org/P82351 and previous config saved to /var/cache/conftool/dbconfig/20250901-214034-ladsgroup.json [21:40:38] T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362 [21:40:50] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2196.codfw.wmnet with reason: Maintenance [21:40:58] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2196 (T403362)', diff saved to https://phabricator.wikimedia.org/P82352 and previous config saved to /var/cache/conftool/dbconfig/20250901-214057-ladsgroup.json [21:44:36] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:56:52] PROBLEM - mysqld processes on es2026 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [21:57:14] PROBLEM - MariaDB read only es2 on es2026 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [21:58:14] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11137490 (10VRiley-WMF) [21:58:36] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11137491 (10VRiley-WMF) ms-be1084 completed. Moving onto ms-be1085 [22:32:54] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [22:38:08] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2196 (T403362)', diff saved to https://phabricator.wikimedia.org/P82353 and previous config saved to /var/cache/conftool/dbconfig/20250901-223807-ladsgroup.json [22:38:11] T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362 [22:41:01] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11137536 (10VRiley-WMF) 05In progress→03Open [22:41:14] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11137539 (10VRiley-WMF) ms-be1085 is completed [22:53:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2196', diff saved to https://phabricator.wikimedia.org/P82354 and previous config saved to /var/cache/conftool/dbconfig/20250901-225314-ladsgroup.json [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250901T2300) [23:04:36] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:08:23] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2196', diff saved to https://phabricator.wikimedia.org/P82355 and previous config saved to /var/cache/conftool/dbconfig/20250901-230822-ladsgroup.json [23:23:31] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2196 (T403362)', diff saved to https://phabricator.wikimedia.org/P82356 and previous config saved to /var/cache/conftool/dbconfig/20250901-232330-ladsgroup.json [23:23:34] T403362: Change row format of cx_corpora - https://phabricator.wikimedia.org/T403362 [23:23:46] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2197.codfw.wmnet with reason: Maintenance [23:38:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1183749 [23:38:53] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1183749 (owner: 10TrainBranchBot) [23:52:50] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1183749 (owner: 10TrainBranchBot)