[00:00:17] !log bking@cumin1002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227 [00:00:18] inflatador: separatedly but relatedly I found an incorrect entry in the eqiad cirrus config, patch here https://gerrit.wikimedia.org/r/c/operations/puppet/+/1171307 [00:00:22] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [00:00:49] ryankemper it was cirrussearch1122.eqiad.wmnet. I just stopped it, let's see if that helps [00:01:36] inflatador: do you remember was it codfw or eqiad where we saw this issue last time? trying to figure out if we just hadn't done a full rolling restart of eqiad yet post-upgrade, or if we have and this issue will keep resurfacing [00:01:47] (03PS6) 10Zabe: Set categorylinks to read new on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169198 (https://phabricator.wikimedia.org/T397912) [00:02:13] it was in eqiad, and stopping cirrussearch1122.eqiad.wmnet didn't seem to fix the problem [00:02:41] inflatador: I'll try another master [00:02:44] I just started it again, let's try stopping one master at a time...if that doesn't fix it, we'll start it again [00:02:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [00:03:15] ryankemper sounds good, let me know which one you're stopping [00:06:33] inflatador: haven't found a good candidate yet [00:06:37] ryankemper I started https://etherpad.wikimedia.org/p/503-eqiad-cirrussearch [00:07:52] looks like we have some split brain, let's see which host or hosts has the latest cluster state. I think it'll be in those type of log messages [00:07:54] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1171308 [00:07:54] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1171308 (owner: 10TrainBranchBot) [00:09:30] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1094 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 60, [00:09:30] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 11969, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [00:09:30] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1082 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 60, [00:09:30] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 11971, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [00:09:30] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1090 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 60, [00:09:31] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 11987, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [00:09:31] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1124 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 60, [00:09:31] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 11980, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [00:09:32] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1083 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 60, [00:09:45] !log stop opensearch on `cirrussearch1081` [00:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:10] ryankemper nice, you found the right one! [00:12:30] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1094 is CRITICAL: CRITICAL - elasticsearch inactive shards 2958 threshold =0.15 breach: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1406, active_shards: 1406, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2958, delayed_unassigned_s [00:12:30] , number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.21814848762603 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:12:30] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1082 is CRITICAL: CRITICAL - elasticsearch inactive shards 2958 threshold =0.15 breach: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1406, active_shards: 1406, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2958, delayed_unassigned_s [00:12:30] , number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.21814848762603 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:12:30] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1070 is CRITICAL: CRITICAL - elasticsearch inactive shards 2958 threshold =0.15 breach: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1406, active_shards: 1406, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2958, delayed_unassigned_s [00:12:30] , number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.21814848762603 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:12:31] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1083 is CRITICAL: CRITICAL - elasticsearch inactive shards 2958 threshold =0.15 breach: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1406, active_shards: 1406, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2958, delayed_unassigned_s [00:12:31] , number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.21814848762603 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:12:32] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1090 is CRITICAL: CRITICAL - elasticsearch inactive shards 2958 threshold =0.15 breach: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1406, active_shards: 1406, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2958, delayed_unassigned_s [00:12:32] , number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.21814848762603 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:12:41] inflatador: I was operating on the theory that we erroneously had 4 masters instead of 5 but it's only the omega cluster that has the wrong config not the main one [00:12:55] Let's make sure the Search Update Pipeline recovers as well [00:12:58] but yeah the one i stopped hadn't yet been restarted [00:14:54] Interesting...we may need to update the docs then. We may need to take a closer look at eqiad in general [00:16:36] (03CR) 10Ryan Kemper: [C:03+1] cirrus: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1171307 (owner: 10Ryan Kemper) [00:16:38] (03CR) 10Ryan Kemper: [C:03+2] cirrus: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1171307 (owner: 10Ryan Kemper) [00:17:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [00:19:09] RESOLVED: SystemdUnitFailed: opensearch-disable-readahead-production-search-eqiad.service on cirrussearch1113:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:21:46] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cirrussearch1081 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:22:16] !log amastilovic@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [00:23:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1081-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [00:23:39] !log amastilovic@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [00:24:09] FIRING: [2x] SystemdUnitFailed: push_cross_cluster_settings_9200.service on cirrussearch1081:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:26:42] !log bking@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch1081.eqiad.wmnet with reason: bad cluster node/MSS issue [00:27:50] (03PS1) 10BCornwall: cirrus: Fix another typo [puppet] - 10https://gerrit.wikimedia.org/r/1171309 [00:28:17] (03CR) 10CI reject: [V:04-1] cirrus: Fix another typo [puppet] - 10https://gerrit.wikimedia.org/r/1171309 (owner: 10BCornwall) [00:28:18] ryankemper: I found another cirrus typo: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1171309 [00:29:06] (03PS2) 10BCornwall: cirrus: Fix another typo [puppet] - 10https://gerrit.wikimedia.org/r/1171309 [00:29:09] RESOLVED: SystemdUnitFailed: opensearch-disable-readahead-production-search-eqiad.service on cirrussearch1113:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:29:53] (03CR) 10Bking: [C:03+2] cirrus: Fix another typo [puppet] - 10https://gerrit.wikimedia.org/r/1171309 (owner: 10BCornwall) [00:30:05] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1171308 (owner: 10TrainBranchBot) [00:30:29] brett ryankemper just merged ^^ [00:30:35] gracias [00:30:57] no, thank you! [00:31:24] brett: great catch! [00:31:30] while I have you ;) ... do you happen to know anything about the MSS alert that's in #traffic? It fired on the same host that just blew up [00:31:44] https://alerts.wikimedia.org/?q=alertname%3DFermMSS [00:31:56] it seemed like it was ncredir related, but lemme look [00:32:09] oh, no, why was I thinking that [00:33:57] yeah, I'm not sure, I haven't been exposed much to those issues :( [00:33:58] !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=search,name=eqiad [00:36:55] not sure what the root cause of the earlier cirrussearch outage was - doesn't seem related to any sort of MSS clamping? [00:40:02] brett I doubt it's related, but just trying to cover the bases. If it was a network issue we'd have more general alerts and I can't find any [00:42:14] anyway, no need to spend too much time digging..we'll look at it more closely tomorrow [00:46:40] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/4a13d565dc9d235e14ee42cb66387acd98e894a439915ec87108168e23d25fc1/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [00:48:38] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1122 is CRITICAL: CRITICAL - elasticsearch inactive shards 2956 threshold =0.15 breach: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 1405, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2956, delayed_unassigned_s [00:48:38] , number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.21738133455629 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:51:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:52:38] FIRING: GoRoutinesTooHigh: gNMIc running on netflow2003 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [01:06:40] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:07:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.11 [core] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1171310 (https://phabricator.wikimedia.org/T396372) [01:07:54] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.11 [core] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1171310 (https://phabricator.wikimedia.org/T396372) (owner: 10TrainBranchBot) [01:21:09] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.11 [core] (wmf/1.45.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1171310 (https://phabricator.wikimedia.org/T396372) (owner: 10TrainBranchBot) [01:41:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250722T0200) [02:11:48] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:21:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:31:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [02:49:26] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/2 (inter.link reserved port) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:54:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:58:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250722T0300) [03:01:57] (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171314 (https://phabricator.wikimedia.org/T396372) [03:01:59] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.45.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171314 (https://phabricator.wikimedia.org/T396372) (owner: 10TrainBranchBot) [03:02:50] (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171314 (https://phabricator.wikimedia.org/T396372) (owner: 10TrainBranchBot) [03:03:14] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.45.0-wmf.11 refs T396372 [03:03:19] T396372: 1.45.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T396372 [03:09:26] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:38:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [03:49:05] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.45.0-wmf.11 refs T396372 (duration: 45m 51s) [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250722T0400) [04:02:00] !log mwpresync@deploy1003 Pruned MediaWiki: 1.45.0-wmf.8 (duration: 01m 50s) [04:06:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [04:11:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [04:12:48] FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [04:18:31] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [04:41:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [04:52:38] FIRING: GoRoutinesTooHigh: gNMIc running on netflow2003 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:26:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [05:27:42] (03PS1) 10Clare Ming: xLab: Deploy v0.7.9 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171319 (https://phabricator.wikimedia.org/T397363) [05:46:17] (03CR) 10Arnaudb: [C:03+1] microsites: update recipient email for home dir size warning mails [puppet] - 10https://gerrit.wikimedia.org/r/1171260 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn) [05:58:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250722T0600) [06:00:05] marostegui, Amir1, and federico3: Time to snap out of that daydream and deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250722T0600). [06:10:27] (03CR) 10Clare Ming: [C:03+2] xLab: Deploy v0.7.9 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171319 (https://phabricator.wikimedia.org/T397363) (owner: 10Clare Ming) [06:12:24] (03Merged) 10jenkins-bot: xLab: Deploy v0.7.9 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171319 (https://phabricator.wikimedia.org/T397363) (owner: 10Clare Ming) [06:13:31] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [06:14:09] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [06:21:10] FIRING: BFDdown: BFD session down between cr2-eqord and 208.80.154.208 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqord:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:26:10] RESOLVED: BFDdown: BFD session down between cr2-eqord and 208.80.154.208 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqord:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:28:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [06:32:48] RESOLVED: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate [06:34:40] (03CR) 10Ayounsi: [C:03+2] Routed Ganeti: also permit anycast to be advertised from VMs [homer/public] - 10https://gerrit.wikimedia.org/r/1171236 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [06:35:16] (03Merged) 10jenkins-bot: Routed Ganeti: also permit anycast to be advertised from VMs [homer/public] - 10https://gerrit.wikimedia.org/r/1171236 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [06:35:51] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [06:35:57] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [06:43:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [06:48:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [06:49:19] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1223.eqiad.wmnet with reason: Maintenance [06:49:26] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/2 (inter.link reserved port) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:49:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [06:53:20] (03CR) 10Marostegui: [C:03+1] installserver: Prepare dbprov1007, dbprov2007 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1171245 (https://phabricator.wikimedia.org/T399040) (owner: 10Jcrespo) [06:54:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:54:24] !log ayounsi@cumin1003 START - Cookbook sre.ganeti.makevm for new host netflow2004.codfw.wmnet [06:54:26] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [06:54:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [06:54:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2035 T399927', diff saved to https://phabricator.wikimedia.org/P79550 and previous config saved to /var/cache/conftool/dbconfig/20250722-065454-root.json [06:54:59] T399927: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927 [06:55:10] (03PS1) 10Giuseppe Lavagetto: Bugfix on rename operations [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1171463 [06:55:25] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es2035.codfw.wmnet with reason: Maintenance [06:55:31] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Bugfix on rename operations [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1171463 (owner: 10Giuseppe Lavagetto) [06:57:46] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow2004.codfw.wmnet - ayounsi@cumin1003" [06:58:12] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow2004.codfw.wmnet - ayounsi@cumin1003" [06:58:13] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:58:13] !log ayounsi@cumin1003 START - Cookbook sre.dns.wipe-cache netflow2004.codfw.wmnet on all recursors [06:58:16] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow2004.codfw.wmnet on all recursors [06:58:29] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11022661 (10Marostegui) @Jhancock.wm es2035 is ready [06:58:44] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow2004.codfw.wmnet - ayounsi@cumin1003" [06:58:48] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow2004.codfw.wmnet - ayounsi@cumin1003" [06:59:10] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host netflow2004.codfw.wmnet with OS bookworm [07:00:04] Amir1, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250722T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:02:03] (03PS1) 10Marostegui: s1 codfw: Move to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1171465 [07:05:33] (03Abandoned) 10Ayounsi: reimage: merge UUID and MAC [cookbooks] - 10https://gerrit.wikimedia.org/r/1164446 (owner: 10Ayounsi) [07:09:09] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:09:26] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:11:44] 06SRE, 10Hiddenparma, 06Traffic: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119 (10Joe) 03NEW [07:11:55] 06SRE, 10Hiddenparma, 06Traffic: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11022677 (10Joe) p:05Triage→03High a:05Joe→03None [07:24:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [07:24:44] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [07:26:21] 06SRE, 10Hiddenparma, 06Traffic: Better mapping of requests coming from datacenters/clouds - https://phabricator.wikimedia.org/T400120 (10Joe) 03NEW [07:26:39] (03CR) 10Jelto: [C:03+1] microsites: update recipient email for home dir size warning mails [puppet] - 10https://gerrit.wikimedia.org/r/1171260 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn) [07:29:05] (03PS2) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) [07:31:39] (03CR) 10CI reject: [V:04-1] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [07:40:16] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host netflow2004.codfw.wmnet with OS bookworm [07:40:16] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host netflow2004.codfw.wmnet [07:46:46] (03PS1) 10Arnaudb: Gitlab: switchover between gitlab-replica-a and gitlab-replica-b [dns] - 10https://gerrit.wikimedia.org/r/1171537 (https://phabricator.wikimedia.org/T400121) [07:49:16] (03PS5) 10Tiziano Fogli: traffic: new alerts for haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [07:50:40] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host netflow2004.codfw.wmnet with OS bookworm [07:51:06] (03PS3) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) [07:52:35] (03PS1) 10Arnaudb: Gitlab: switchover between gitlab-replica-a and gitlab-replica-b [puppet] - 10https://gerrit.wikimedia.org/r/1171539 (https://phabricator.wikimedia.org/T400121) [07:53:20] !log dcaro@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1006.eqiad.wmnet with OS bullseye [07:53:35] 06SRE-OnFire, 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11022744 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1003 for host cloudcephosd1006.... [07:53:47] (03CR) 10CI reject: [V:04-1] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [07:54:09] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:59:09] RESOLVED: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:03:12] (03PS6) 10Fabfur: traffic: new alerts for haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) [08:06:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:07:50] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on netflow2004.codfw.wmnet with reason: host reimage [08:10:25] !log Ran fixStuckGlobalRename.php for T400117 [08:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:31] T400117: Unblock stuck global rename of Ugo Cavitte - https://phabricator.wikimedia.org/T400117 [08:13:07] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow2004.codfw.wmnet with reason: host reimage [08:22:42] FIRING: [3x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:23:08] !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [08:23:33] !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [08:24:11] 10ops-codfw, 06SRE, 06DC-Ops: Arelion IC-374549 100G Transport outage (cr1-codfw -> cr1-eqiad) July 2025 - https://phabricator.wikimedia.org/T399097#11022792 (10cmooney) Some signs of progress, link has now been stable for over 8 hours: ` Jul 22 01:00:17 re0.cr1-codfw mib2d[38982]: SNMP_TRAP_LINK_UP: ifInde... [08:26:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:27:08] (03CR) 10Jcrespo: [C:03+2] installserver: Prepare dbprov1007, dbprov2007 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1171245 (https://phabricator.wikimedia.org/T399040) (owner: 10Jcrespo) [08:27:14] (03PS3) 10Jcrespo: installserver: Prepare dbprov1007, dbprov2007 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1171245 (https://phabricator.wikimedia.org/T399040) [08:27:15] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:27:42] RESOLVED: [3x] JobUnavailable: Reduced availability for job fastnetmon in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:28:13] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2152.codfw.wmnet with reason: Maintenance [08:28:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T399728)', diff saved to https://phabricator.wikimedia.org/P79551 and previous config saved to /var/cache/conftool/dbconfig/20250722-082819-fceratto.json [08:28:24] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [08:31:10] (03CR) 10Vgutierrez: traffic: new alerts for haproxykafka (035 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [08:32:15] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:32:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T399728)', diff saved to https://phabricator.wikimedia.org/P79552 and previous config saved to /var/cache/conftool/dbconfig/20250722-083215-fceratto.json [08:32:55] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow2004.codfw.wmnet with OS bookworm [08:33:23] (03CR) 10Jcrespo: [C:03+2] installserver: Prepare dbprov1007, dbprov2007 for reimage [puppet] - 10https://gerrit.wikimedia.org/r/1171245 (https://phabricator.wikimedia.org/T399040) (owner: 10Jcrespo) [08:33:40] !log dcaro@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1006.eqiad.wmnet with OS bullseye [08:33:50] 06SRE-OnFire, 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11022828 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1003 for host cloudcephosd1006.eqia... [08:35:48] !log dcaro@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1006.eqiad.wmnet with OS bullseye [08:36:00] 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858#11022829 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1003 for host cloudceph... [08:37:23] RESOLVED: GoRoutinesTooHigh: gNMIc running on netflow2003 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [08:38:23] FIRING: GnmiTargetDown: fasw2-c8b-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [08:39:09] FIRING: [20x] GnmiTargetDown: fasw2-c8b-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [08:39:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:47:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P79553 and previous config saved to /var/cache/conftool/dbconfig/20250722-084722-fceratto.json [08:47:36] (03CR) 10Vgutierrez: "kinda new, at least on ::1 I can run varnish tests twice without manually deleting the container" [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur) [08:48:23] 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858#11022860 (10fnegri) cloudcephosd1006 was reimaged again on 2025-07-21, but this time //without// keeping the data. Th... [08:49:22] (03CR) 10Majavah: [C:03+1] Remove ldap-admins from cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/1169633 (owner: 10Muehlenhoff) [08:49:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:50:06] (03CR) 10Stevemunene: [C:03+2] druid: Add new an-druid100[67] to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1171207 (https://phabricator.wikimedia.org/T397440) (owner: 10Stevemunene) [08:51:38] (03CR) 10Majavah: [C:03+1] cloudweb: Restrict access to Envoy port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1098556 (owner: 10Muehlenhoff) [08:51:54] !log dcaro@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1006.eqiad.wmnet with reason: host reimage [08:55:46] !log dcaro@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1006.eqiad.wmnet with reason: host reimage [08:58:23] RESOLVED: GnmiTargetDown: fasw2-c8b-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [08:59:09] RESOLVED: [20x] GnmiTargetDown: fasw2-c8b-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [08:59:21] (03PS4) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) [09:01:40] (03CR) 10CI reject: [V:04-1] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [09:02:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P79554 and previous config saved to /var/cache/conftool/dbconfig/20250722-090230-fceratto.json [09:04:31] (03PS5) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) [09:05:03] (03PS7) 10Fabfur: traffic: new alerts for haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) [09:05:32] (03CR) 10Fabfur: traffic: new alerts for haproxykafka (035 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [09:05:44] (03PS6) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) [09:06:13] (03CR) 10CI reject: [V:04-1] traffic: new alerts for haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [09:07:59] (03CR) 10CI reject: [V:04-1] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [09:11:41] (03PS7) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) [09:13:57] (03CR) 10CI reject: [V:04-1] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [09:15:25] (03PS1) 10Tiziano Fogli: prom/metamon: add a dedicated sysuser for the daemons [puppet] - 10https://gerrit.wikimedia.org/r/1171546 (https://phabricator.wikimedia.org/T397003) [09:15:32] PROBLEM - Druid historical on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [09:15:52] (03CR) 10CI reject: [V:04-1] prom/metamon: add a dedicated sysuser for the daemons [puppet] - 10https://gerrit.wikimedia.org/r/1171546 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [09:16:55] (03PS2) 10Tiziano Fogli: prom/metamon: add a dedicated sysuser for the daemons [puppet] - 10https://gerrit.wikimedia.org/r/1171546 (https://phabricator.wikimedia.org/T397003) [09:17:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T399728)', diff saved to https://phabricator.wikimedia.org/P79555 and previous config saved to /var/cache/conftool/dbconfig/20250722-091737-fceratto.json [09:17:42] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [09:17:53] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2154.codfw.wmnet with reason: Maintenance [09:18:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T399728)', diff saved to https://phabricator.wikimedia.org/P79556 and previous config saved to /var/cache/conftool/dbconfig/20250722-091800-fceratto.json [09:18:36] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1240.eqiad.wmnet with reason: upgrade mariadb [09:19:22] 06SRE, 06Traffic, 07affects-Kiwix-and-openZIM: Rate limiting/status code 429 for mwclient? - https://phabricator.wikimedia.org/T400018#11022917 (10Joe) 05Open→03Invalid The task is invalid as the bot was indeed using a user-agent that doesn't respect our UA policy., which has been in place since 2010... [09:20:02] (03CR) 10Tiziano Fogli: "The patch is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/1171546 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [09:20:29] (03PS8) 10Fabfur: traffic: new alerts for haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) [09:20:40] (03PS8) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) [09:20:43] (03PS1) 10Jcrespo: Upgrade db1240 to MariaDB package 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171547 (https://phabricator.wikimedia.org/T394487) [09:20:59] (03PS2) 10Jcrespo: Upgrade db1240 to MariaDB package 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171547 (https://phabricator.wikimedia.org/T394487) [09:22:15] (03CR) 10Jcrespo: [C:03+2] Upgrade db1240 to MariaDB package 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171547 (https://phabricator.wikimedia.org/T394487) (owner: 10Jcrespo) [09:22:55] (03CR) 10CI reject: [V:04-1] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [09:23:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T399728)', diff saved to https://phabricator.wikimedia.org/P79558 and previous config saved to /var/cache/conftool/dbconfig/20250722-092306-fceratto.json [09:23:11] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [09:23:42] (03PS9) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) [09:25:55] (03CR) 10CI reject: [V:04-1] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [09:27:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr4-ulsfo and NTT (129.250.204.5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [09:31:26] (03PS3) 10Cathal Mooney: sre.hosts.decommision: remove virtual interfaces from during decom [cookbooks] - 10https://gerrit.wikimedia.org/r/1170530 (https://phabricator.wikimedia.org/T398412) [09:31:32] RECOVERY - Druid historical on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [09:31:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 22 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167910 (owner: 10Daimona Eaytoy) [09:32:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr4-ulsfo and NTT (129.250.204.5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [09:32:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 22 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CampaignEvents] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171268 (https://phabricator.wikimedia.org/T397270) (owner: 10Daimona Eaytoy) [09:36:12] (03CR) 10Marostegui: [C:03+2] s1 codfw: Move to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1171465 (owner: 10Marostegui) [09:38:03] (03CR) 10CI reject: [V:04-1] sre.hosts.decommision: remove virtual interfaces from during decom [cookbooks] - 10https://gerrit.wikimedia.org/r/1170530 (https://phabricator.wikimedia.org/T398412) (owner: 10Cathal Mooney) [09:38:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P79560 and previous config saved to /var/cache/conftool/dbconfig/20250722-093814-fceratto.json [09:38:54] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2150.codfw.wmnet with reason: Maintenance [09:39:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T399249)', diff saved to https://phabricator.wikimedia.org/P79561 and previous config saved to /var/cache/conftool/dbconfig/20250722-093901-marostegui.json [09:39:06] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [09:39:43] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2150.codfw.wmnet with reason: Maintenance [09:40:58] (03CR) 10Jelto: [C:03+1] "lgtm, thanks for uploading the change!" [dns] - 10https://gerrit.wikimedia.org/r/1171537 (https://phabricator.wikimedia.org/T400121) (owner: 10Arnaudb) [09:41:23] (03PS1) 10Jcrespo: mariadb: Upgrade db2239 MariaDB package to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171549 (https://phabricator.wikimedia.org/T394487) [09:41:37] (03PS10) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) [09:41:54] (03CR) 10Jelto: [C:03+1] "lgtm, thanks for uploading the change!" [puppet] - 10https://gerrit.wikimedia.org/r/1171539 (https://phabricator.wikimedia.org/T400121) (owner: 10Arnaudb) [09:41:58] (03PS4) 10Cathal Mooney: sre.hosts.decommision: remove virtual interfaces from during decom [cookbooks] - 10https://gerrit.wikimedia.org/r/1170530 (https://phabricator.wikimedia.org/T398412) [09:44:06] (03CR) 10CI reject: [V:04-1] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [09:45:25] (03PS1) 10David Caro: cloudcephosd1006: update interface names [puppet] - 10https://gerrit.wikimedia.org/r/1171550 [09:46:22] (03PS3) 10Cathal Mooney: cephosd: un-set bird bgp neighbors rather than override for each host [puppet] - 10https://gerrit.wikimedia.org/r/1170543 [09:46:48] (03CR) 10Phuedx: "I845f5d8f727f5b2ddfcf4dd7fae256bb1c12ec6d was backported and the backport deployed yesterday." [puppet] - 10https://gerrit.wikimedia.org/r/1171205 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx) [09:46:54] (03PS1) 10Marostegui: s1 eqiad: Move to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1171551 [09:47:08] (03CR) 10David Caro: [C:03+2] cloudcephosd1006: update interface names [puppet] - 10https://gerrit.wikimedia.org/r/1171550 (owner: 10David Caro) [09:47:24] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2239.codfw.wmnet with reason: upgrade mariadb [09:47:38] (03PS4) 10Cathal Mooney: cephosd: un-set bird bgp neighbors rather than override for each host [puppet] - 10https://gerrit.wikimedia.org/r/1170543 [09:47:50] (03CR) 10Marostegui: [C:03+2] s1 eqiad: Move to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1171551 (owner: 10Marostegui) [09:48:25] (03CR) 10CI reject: [V:04-1] sre.hosts.decommision: remove virtual interfaces from during decom [cookbooks] - 10https://gerrit.wikimedia.org/r/1170530 (https://phabricator.wikimedia.org/T398412) (owner: 10Cathal Mooney) [09:49:14] (03PS2) 10Phuedx: mw::maintenance: ExperimentationLab periodic job [puppet] - 10https://gerrit.wikimedia.org/r/1171205 (https://phabricator.wikimedia.org/T398422) [09:51:58] (03CR) 10Jcrespo: [C:03+2] mariadb: Upgrade db2239 MariaDB package to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171549 (https://phabricator.wikimedia.org/T394487) (owner: 10Jcrespo) [09:52:48] (03PS1) 10Marostegui: db2208: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171553 (https://phabricator.wikimedia.org/T399955) [09:53:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P79564 and previous config saved to /var/cache/conftool/dbconfig/20250722-095321-fceratto.json [09:53:40] (03CR) 10Marostegui: [C:03+2] db2208: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171553 (https://phabricator.wikimedia.org/T399955) (owner: 10Marostegui) [09:53:59] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2208.codfw.wmnet with reason: Maintenance [09:54:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2208 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79565 and previous config saved to /var/cache/conftool/dbconfig/20250722-095402-marostegui.json [09:55:44] !log dcaro@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1006.eqiad.wmnet with OS bullseye [09:55:53] (03PS5) 10Cathal Mooney: sre.hosts.decommision: remove virtual interfaces from during decom [cookbooks] - 10https://gerrit.wikimedia.org/r/1170530 (https://phabricator.wikimedia.org/T398412) [09:55:55] 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858#11023053 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1003 for host cloudcephosd1... [09:56:11] !log jynus@cumin1003 START - Cookbook sre.hosts.remove-downtime for db1240.eqiad.wmnet [09:56:12] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1240.eqiad.wmnet [09:57:24] (03PS11) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) [09:57:38] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:57:40] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:59:10] (03PS1) 10Jelto: gitlab: exclude packages from failover backup [puppet] - 10https://gerrit.wikimedia.org/r/1171554 (https://phabricator.wikimedia.org/T399306) [09:59:22] (03PS1) 10Marostegui: db1253: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171555 (https://phabricator.wikimedia.org/T399955) [09:59:51] (03CR) 10CI reject: [V:04-1] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [09:59:57] (03CR) 10Marostegui: [C:03+2] db1253: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171555 (https://phabricator.wikimedia.org/T399955) (owner: 10Marostegui) [10:00:00] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170543 (owner: 10Cathal Mooney) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250722T1000) [10:00:36] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1253.eqiad.wmnet with reason: Maintenance [10:00:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1253 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79566 and previous config saved to /var/cache/conftool/dbconfig/20250722-100040-marostegui.json [10:02:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2208 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79567 and previous config saved to /var/cache/conftool/dbconfig/20250722-100201-root.json [10:04:19] !log cgoubert@cumin1003 START - Cookbook sre.dns.netbox [10:05:38] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:05:39] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:06:53] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance [10:06:54] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:07:16] !log homer "cr*eqiad*" commit 'wikikube-worker1243 to active' [10:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1253 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79568 and previous config saved to /var/cache/conftool/dbconfig/20250722-100828-root.json [10:08:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T399728)', diff saved to https://phabricator.wikimedia.org/P79569 and previous config saved to /var/cache/conftool/dbconfig/20250722-100829-fceratto.json [10:08:36] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [10:08:38] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:08:44] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2161.codfw.wmnet with reason: Maintenance [10:08:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr4-ulsfo:xe-0/1/2 (Transport: cr2-eqsin:xe-0/1/4 (NTT, 369639) {#1076}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr4-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:08:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2161 (T399728)', diff saved to https://phabricator.wikimedia.org/P79570 and previous config saved to /var/cache/conftool/dbconfig/20250722-100851-fceratto.json [10:09:39] !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [10:09:44] !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [10:11:37] jouncebot: nowandnext [10:11:37] For the next 0 hour(s) and 48 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250722T1000) [10:11:37] In 1 hour(s) and 48 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250722T1200) [10:12:16] !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1243.eqiad.wmnet [10:12:17] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1243.eqiad.wmnet [10:13:22] (03PS9) 10Fabfur: haproxy: script to perform configuration validation [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) [10:13:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T399728)', diff saved to https://phabricator.wikimedia.org/P79571 and previous config saved to /var/cache/conftool/dbconfig/20250722-101352-fceratto.json [10:13:57] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [10:15:39] (03CR) 10CI reject: [V:04-1] haproxy: script to perform configuration validation [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur) [10:16:38] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:17:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2208 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79572 and previous config saved to /var/cache/conftool/dbconfig/20250722-101707-root.json [10:18:06] (03PS10) 10Fabfur: haproxy: script to perform configuration validation [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) [10:18:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr4-ulsfo:xe-0/1/2 (Transport: cr2-eqsin:xe-0/1/4 (NTT, 369639) {#1076}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr4-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:19:24] (03PS11) 10Fabfur: haproxy: script to perform configuration validation [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) [10:19:39] (03CR) 10Fabfur: haproxy: script to perform configuration validation (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur) [10:22:51] (03CR) 10Cparle: [C:03+1] Add new MediaSearch config/coefficients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171239 (https://phabricator.wikimedia.org/T385286) (owner: 10Matthias Mullie) [10:23:10] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [10:23:20] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [10:23:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1253 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79573 and previous config saved to /var/cache/conftool/dbconfig/20250722-102334-root.json [10:23:38] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [10:23:43] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [10:24:03] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [10:24:07] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [10:27:13] (03PS2) 10Samtar: IS: Set wgTemplateDataEnableFeaturedTemplates default true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171175 (https://phabricator.wikimedia.org/T391064) [10:29:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P79574 and previous config saved to /var/cache/conftool/dbconfig/20250722-102859-fceratto.json [10:29:23] (03CR) 10Vgutierrez: haproxy: script to perform configuration validation (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur) [10:31:38] jouncebot now [10:31:38] For the next 0 hour(s) and 28 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250722T1000) [10:31:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:32:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2208 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79575 and previous config saved to /var/cache/conftool/dbconfig/20250722-103213-root.json [10:33:17] (03PS1) 10Phuedx: Revert "InstrumentConfigsFetcher: Make updating configs asynchronous" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171560 [10:33:41] ^ marostegui [10:33:59] Just waiting for CI etc and I'll deploy the revert [10:35:29] phuedx: can you give me a ping when done, I'm wanting to deploy a quick config change but can wait for yours of course [10:36:45] TheresNoTime: Yup [10:36:45] 06SRE, 10vm-requests: eqiad: VMs requested for Data Persistence automation and testbeds - https://phabricator.wikimedia.org/T390087#11023255 (10FCeratto-WMF) [10:38:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171560 (owner: 10Phuedx) [10:38:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1253 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79576 and previous config saved to /var/cache/conftool/dbconfig/20250722-103840-root.json [10:39:31] (03Merged) 10jenkins-bot: Revert "InstrumentConfigsFetcher: Make updating configs asynchronous" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171560 (owner: 10Phuedx) [10:40:09] !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1171560|Revert "InstrumentConfigsFetcher: Make updating configs asynchronous"]] [10:40:47] (03CR) 10Multichill: [C:03+1] "Thanks for adding this. Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1171250 (https://phabricator.wikimedia.org/T388809) (owner: 10BCornwall) [10:41:25] (03CR) 10Filippo Giunchedi: [C:03+1] prom/metamon: add a dedicated sysuser for the daemons [puppet] - 10https://gerrit.wikimedia.org/r/1171546 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [10:41:46] (03CR) 10Tiziano Fogli: [C:03+2] prom/metamon: add a dedicated sysuser for the daemons [puppet] - 10https://gerrit.wikimedia.org/r/1171546 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [10:41:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane failure for wikikube-worker1243.eqiad.wmnet - https://phabricator.wikimedia.org/T397851#11023275 (10Clement_Goubert) Server set Active and repooled. [10:43:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T399249)', diff saved to https://phabricator.wikimedia.org/P79577 and previous config saved to /var/cache/conftool/dbconfig/20250722-104345-marostegui.json [10:43:50] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [10:44:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P79578 and previous config saved to /var/cache/conftool/dbconfig/20250722-104407-fceratto.json [10:44:32] (03CR) 10Phuedx: [C:04-1] mw::maintenance: ExperimentationLab periodic job [puppet] - 10https://gerrit.wikimedia.org/r/1171205 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx) [10:45:34] (03CR) 10Phuedx: [C:04-1] "Until we have reduced the mainstash read volume (by re-introducing the caching layer), there's no point in merging this change." [puppet] - 10https://gerrit.wikimedia.org/r/1171205 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx) [10:46:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:47:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2208 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79579 and previous config saved to /var/cache/conftool/dbconfig/20250722-104719-root.json [10:47:23] (03CR) 10Vgutierrez: haproxy: script to perform configuration validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur) [10:53:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1253 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79580 and previous config saved to /var/cache/conftool/dbconfig/20250722-105345-root.json [10:54:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:58:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P79581 and previous config saved to /var/cache/conftool/dbconfig/20250722-105852-marostegui.json [10:59:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T399728)', diff saved to https://phabricator.wikimedia.org/P79582 and previous config saved to /var/cache/conftool/dbconfig/20250722-105914-fceratto.json [10:59:19] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [10:59:30] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2163.codfw.wmnet with reason: Maintenance [10:59:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T399728)', diff saved to https://phabricator.wikimedia.org/P79583 and previous config saved to /var/cache/conftool/dbconfig/20250722-105936-fceratto.json [11:03:22] (03PS12) 10Fabfur: haproxy: script to perform configuration validation [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) [11:03:28] (03CR) 10Fabfur: haproxy: script to perform configuration validation (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur) [11:04:15] TheresNoTime: Sorry for the delay. CDB rebuild due to an i18n change in the patch [11:04:22] !log phuedx@deploy1003 phuedx: Backport for [[gerrit:1171560|Revert "InstrumentConfigsFetcher: Make updating configs asynchronous"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:04:26] no rush! :) [11:04:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T399728)', diff saved to https://phabricator.wikimedia.org/P79584 and previous config saved to /var/cache/conftool/dbconfig/20250722-110440-fceratto.json [11:04:45] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [11:04:59] (03PS12) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) [11:05:55] !log phuedx@deploy1003 phuedx: Continuing with sync [11:07:22] (03CR) 10CI reject: [V:04-1] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [11:09:26] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:09:40] (03PS13) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) [11:11:55] (03CR) 10CI reject: [V:04-1] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [11:14:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P79585 and previous config saved to /var/cache/conftool/dbconfig/20250722-111400-marostegui.json [11:18:42] !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1171560|Revert "InstrumentConfigsFetcher: Make updating configs asynchronous"]] (duration: 38m 33s) [11:19:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P79586 and previous config saved to /var/cache/conftool/dbconfig/20250722-111947-fceratto.json [11:20:13] TheresNoTime: All yours :) [11:20:27] phuedx: thanks! [11:20:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171175 (https://phabricator.wikimedia.org/T391064) (owner: 10Samtar) [11:21:44] (03Merged) 10jenkins-bot: IS: Set wgTemplateDataEnableFeaturedTemplates default true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171175 (https://phabricator.wikimedia.org/T391064) (owner: 10Samtar) [11:22:10] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1171175|IS: Set wgTemplateDataEnableFeaturedTemplates default true (T391064)]] [11:22:14] T391064: Enable template favoriting on all remaining WMF wikis - https://phabricator.wikimedia.org/T391064 [11:24:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [11:24:59] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [11:26:41] !log samtar@deploy1003 samtar: Backport for [[gerrit:1171175|IS: Set wgTemplateDataEnableFeaturedTemplates default true (T391064)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:26:49] (03Abandoned) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171248 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [11:26:58] * TheresNoTime is looking ^ [11:28:19] !log samtar@deploy1003 samtar: Continuing with sync [11:28:33] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db1246.eqiad.wmnet - https://phabricator.wikimedia.org/T399449#11023416 (10Jclark-ctr) a:05VRiley-WMF→03Jclark-ctr [11:28:37] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db1246.eqiad.wmnet - https://phabricator.wikimedia.org/T399449#11023418 (10Jclark-ctr) 05Open→03Resolved [11:29:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T399249)', diff saved to https://phabricator.wikimedia.org/P79587 and previous config saved to /var/cache/conftool/dbconfig/20250722-112907-marostegui.json [11:29:14] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [11:29:23] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2159.codfw.wmnet with reason: Maintenance [11:29:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T399249)', diff saved to https://phabricator.wikimedia.org/P79588 and previous config saved to /var/cache/conftool/dbconfig/20250722-112929-marostegui.json [11:34:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P79589 and previous config saved to /var/cache/conftool/dbconfig/20250722-113454-fceratto.json [11:35:26] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1171175|IS: Set wgTemplateDataEnableFeaturedTemplates default true (T391064)]] (duration: 13m 16s) [11:35:31] T391064: Enable template favoriting on all remaining WMF wikis - https://phabricator.wikimedia.org/T391064 [11:37:53] !log jynus@cumin1003 START - Cookbook sre.hosts.remove-downtime for db2239.codfw.wmnet [11:37:53] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2239.codfw.wmnet [11:39:02] jouncebot: nowandnext [11:39:02] No deployments scheduled for the next 0 hour(s) and 20 minute(s) [11:39:02] In 0 hour(s) and 20 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250722T1200) [11:39:30] Want to deploy a security patch [11:40:48] * TheresNoTime is done deploying [11:42:40] Thanks. Deploying patch now [11:44:05] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644#11023481 (10Jclark-ctr) a:03Jclark-ctr [11:45:12] (03CR) 10Michael Große: "I think that this is now ready to move forward" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164287 (https://phabricator.wikimedia.org/T386250) (owner: 10Michael Große) [11:45:34] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644#11023482 (10ayounsi) 05Open→03Resolved All done, thanks a lot! [11:45:42] PROBLEM - Host cloudsw2-d5-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [11:45:48] PROBLEM - Host cloudsw2-d5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [11:50:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T399728)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250722-115002-fceratto.json [11:50:14] !log dreamyjazz Deployed security patch for T399627 [11:50:22] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2164.codfw.wmnet with reason: Maintenance [11:50:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T399728)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250722-115029-fceratto.json [11:50:46] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [11:51:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:51:15] (03PS1) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) [11:51:42] (03PS1) 10Btullis: Fix an issue with disabling the hadoop-yarn-nodemanager service [puppet] - 10https://gerrit.wikimedia.org/r/1171563 (https://phabricator.wikimedia.org/T397160) [11:52:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): decommission an-conf100[1-3] - https://phabricator.wikimedia.org/T398013#11023517 (10Jclark-ctr) 05Open→03Resolved [11:53:03] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1171563 (https://phabricator.wikimedia.org/T397160) (owner: 10Btullis) [11:53:50] (03CR) 10CI reject: [V:04-1] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [11:54:46] (03CR) 10Dbrant: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170584 (owner: 10PipelineBot) [11:55:00] (03Abandoned) 10Dbrant: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170457 (owner: 10PipelineBot) [11:55:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T399728)', diff saved to https://phabricator.wikimedia.org/P79591 and previous config saved to /var/cache/conftool/dbconfig/20250722-115536-fceratto.json [11:55:41] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [11:55:48] (03PS2) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) [11:56:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:56:58] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170584 (owner: 10PipelineBot) [11:58:00] !log dreamyjazz Deployed security patch for T399627 [11:58:05] (03CR) 10CI reject: [V:04-1] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250722T1200) [12:00:56] !log dbrant@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply [12:01:21] !log dbrant@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [12:03:28] (03CR) 10Dbrant: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170455 (owner: 10PipelineBot) [12:03:31] (03PS3) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) [12:03:45] (03PS2) 10Btullis: Fix an issue with disabling the hadoop-yarn-nodemanager service [puppet] - 10https://gerrit.wikimedia.org/r/1171563 (https://phabricator.wikimedia.org/T397160) [12:05:25] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170455 (owner: 10PipelineBot) [12:05:43] (03CR) 10CI reject: [V:04-1] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [12:06:06] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "force sync after netmask changes netbox - cmooney@cumin1003" [12:06:13] (03Abandoned) 10Dbrant: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170345 (owner: 10PipelineBot) [12:06:24] (03Abandoned) 10Dbrant: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170326 (owner: 10PipelineBot) [12:06:38] (03Abandoned) 10Dbrant: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166812 (owner: 10PipelineBot) [12:06:46] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "force sync after netmask changes netbox - cmooney@cumin1003" [12:07:29] !log dbrant@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [12:08:00] !log dbrant@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [12:08:12] !log dbrant@deploy1003 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [12:08:41] !log dbrant@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [12:09:54] (03PS4) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) [12:10:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P79592 and previous config saved to /var/cache/conftool/dbconfig/20250722-121043-fceratto.json [12:11:12] !log dbrant@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:11:36] !log dbrant@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:12:06] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "force sync after netmask changes netbox - cmooney@cumin1003" [12:12:06] (03CR) 10CI reject: [V:04-1] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [12:12:44] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "force sync after netmask changes netbox - cmooney@cumin1003" [12:14:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:16:14] PROBLEM - Druid coordinator on an-druid1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:16:37] (03PS1) 10Ayounsi: Remove cloudsw1-d5 to cloudsw2-d5 Xlink allocation [dns] - 10https://gerrit.wikimedia.org/r/1171566 [12:17:16] (03CR) 10CI reject: [V:04-1] Remove cloudsw1-d5 to cloudsw2-d5 Xlink allocation [dns] - 10https://gerrit.wikimedia.org/r/1171566 (owner: 10Ayounsi) [12:18:22] PROBLEM - Juniper alarms on asw2-a-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.21 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [12:18:44] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "force sync after netmask changes netbox - cmooney@cumin1003" [12:18:54] PROBLEM - Host asw2-b-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:18:54] PROBLEM - Host asw2-a-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:18:54] PROBLEM - Host asw2-d-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:18:58] PROBLEM - Host asw2-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:19:16] PROBLEM - Host msw1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:19:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:19:54] PROBLEM - Host ps1-b5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:19:54] PROBLEM - Host ps1-a2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:19:54] PROBLEM - Host ps1-a8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:19:54] PROBLEM - Host ps1-a3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:19:54] PROBLEM - Host ps1-d2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:19:54] PROBLEM - Host ps1-a7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:19:55] PROBLEM - Host ps1-d1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:19:55] PROBLEM - Host ps1-c4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:19:56] PROBLEM - Host ps1-d7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:19:57] PROBLEM - Host ps1-c2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:20:10] PROBLEM - Host ps1-b2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:21:29] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [12:21:50] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "force sync after netmask changes netbox - cmooney@cumin1003" [12:23:42] (03CR) 10Elukey: [C:03+1] Fix an issue with disabling the hadoop-yarn-nodemanager service [puppet] - 10https://gerrit.wikimedia.org/r/1171563 (https://phabricator.wikimedia.org/T397160) (owner: 10Btullis) [12:23:54] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/1171566 (owner: 10Ayounsi) [12:25:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P79593 and previous config saved to /var/cache/conftool/dbconfig/20250722-122551-fceratto.json [12:26:04] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudsw2 decom - ayounsi@cumin1003" [12:26:09] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudsw2 decom - ayounsi@cumin1003" [12:26:09] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:26:12] (03CR) 10Ayounsi: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1171566 (owner: 10Ayounsi) [12:26:22] (03CR) 10Vgutierrez: haproxy: script to perform configuration validation (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur) [12:26:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:27:49] !log dbrant@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [12:28:30] (03CR) 10Ayounsi: [C:03+2] Remove cloudsw1-d5 to cloudsw2-d5 Xlink allocation [dns] - 10https://gerrit.wikimedia.org/r/1171566 (owner: 10Ayounsi) [12:28:48] !log dbrant@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [12:28:59] !log ayounsi@dns1004 START - running authdns-update [12:29:03] !log dbrant@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [12:29:14] FIRING: [2x] SystemdUnitFailed: opensearch-disable-readahead-production-search-eqiad.service on cirrussearch1081:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:29:47] !log dbrant@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [12:29:55] !log ayounsi@dns1004 END - running authdns-update [12:32:14] RECOVERY - Druid coordinator on an-druid1007 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:33:43] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Use new `sul` dblist for $wmgCampaignEventsUseCentralDB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167910 (owner: 10Daimona Eaytoy) [12:35:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T399249)', diff saved to https://phabricator.wikimedia.org/P79594 and previous config saved to /var/cache/conftool/dbconfig/20250722-123545-marostegui.json [12:35:50] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [12:36:10] PROBLEM - Host ps1-d4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:36:52] (03PS1) 10Jelto: add M247 to fetch_external_clouds:vendors_nets.py [puppet] - 10https://gerrit.wikimedia.org/r/1171569 (https://phabricator.wikimedia.org/T400138) [12:37:20] RECOVERY - Host ps1-d4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.27 ms [12:37:53] FIRING: GnmiTargetDown: cloudsw1-d5-eqiad is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [12:40:35] (03CR) 10Vgutierrez: traffic: new alerts for haproxykafka (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [12:40:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T399728)', diff saved to https://phabricator.wikimedia.org/P79595 and previous config saved to /var/cache/conftool/dbconfig/20250722-124058-fceratto.json [12:41:05] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [12:41:14] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2166.codfw.wmnet with reason: Maintenance [12:41:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2166 (T399728)', diff saved to https://phabricator.wikimedia.org/P79596 and previous config saved to /var/cache/conftool/dbconfig/20250722-124121-fceratto.json [12:41:58] (03CR) 10Stevemunene: [C:03+1] "Looks Good, Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1171563 (https://phabricator.wikimedia.org/T397160) (owner: 10Btullis) [12:46:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T399728)', diff saved to https://phabricator.wikimedia.org/P79597 and previous config saved to /var/cache/conftool/dbconfig/20250722-124626-fceratto.json [12:46:31] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [12:47:52] RESOLVED: GnmiTargetDown: cloudsw1-d5-eqiad is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [12:49:19] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [12:49:24] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [12:50:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P79598 and previous config saved to /var/cache/conftool/dbconfig/20250722-125052-marostegui.json [12:51:59] 06SRE-OnFire, 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11023953 (10dcaro) p:05Triage→03High [12:52:12] 06SRE-OnFire, 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): [k8s,infra,o11y] Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11023956 (10dcaro) [12:53:22] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS bookworm [12:53:30] RECOVERY - Host ps1-d2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.24 ms [12:54:17] 06SRE-OnFire, 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): [k8s,infra,o11y] Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11023963 (10taavi) > Ideally, we would have probes tracking a number of tools, and we could page when the per... [12:55:11] (03CR) 10Arnaudb: [C:03+1] add M247 to fetch_external_clouds:vendors_nets.py [puppet] - 10https://gerrit.wikimedia.org/r/1171569 (https://phabricator.wikimedia.org/T400138) (owner: 10Jelto) [12:55:14] (03CR) 10Btullis: [C:03+2] Fix an issue with disabling the hadoop-yarn-nodemanager service [puppet] - 10https://gerrit.wikimedia.org/r/1171563 (https://phabricator.wikimedia.org/T397160) (owner: 10Btullis) [12:55:38] RECOVERY - Host ps1-a8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.25 ms [12:55:48] RECOVERY - Host msw1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms [12:56:59] (03CR) 10Arnaudb: [C:03+1] gitlab: exclude packages from failover backup [puppet] - 10https://gerrit.wikimedia.org/r/1171554 (https://phabricator.wikimedia.org/T399306) (owner: 10Jelto) [12:57:35] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [12:59:47] (03PS1) 10Elukey: install_server: fix raid1-1dev-nvme recipe [puppet] - 10https://gerrit.wikimedia.org/r/1171571 (https://phabricator.wikimedia.org/T393044) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250722T1300) [13:00:05] Daimona: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:20] I can probably deploy in a few minutes [13:00:26] 06SRE-OnFire, 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): [k8s,infra,o11y] Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11023985 (10fnegri) > Instead of probes, what about measuring the percentage or rate of 5xx errors returned f... [13:01:20] o/ [13:01:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P79599 and previous config saved to /var/cache/conftool/dbconfig/20250722-130133-fceratto.json [13:01:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:02:31] ok, I can deploy! [13:02:35] RESOLVED: NetworkDeviceAlarmActive: Alarm active on cr2-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [13:02:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167910 (owner: 10Daimona Eaytoy) [13:03:45] (03Merged) 10jenkins-bot: Use new `sul` dblist for $wmgCampaignEventsUseCentralDB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167910 (owner: 10Daimona Eaytoy) [13:04:08] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1167910|Use new `sul` dblist for $wmgCampaignEventsUseCentralDB]] [13:06:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P79600 and previous config saved to /var/cache/conftool/dbconfig/20250722-130600-marostegui.json [13:06:20] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, daimona: Backport for [[gerrit:1167910|Use new `sul` dblist for $wmgCampaignEventsUseCentralDB]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:06:24] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Bugfix on rename - oblivian@cumin1003" [13:06:26] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfix on rename - oblivian@cumin1003 [13:06:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11024001 (10elukey) @jhathaway you rock thanks a lot! I verified with a reimage that d-i now correctly handles the new partman recipe. I didn't really think about using... [13:06:55] (03CR) 10Elukey: [C:03+2] install_server: fix raid1-1dev-nvme recipe [puppet] - 10https://gerrit.wikimedia.org/r/1171571 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey) [13:07:01] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfix on rename - oblivian@cumin1003 [13:07:02] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Bugfix on rename - oblivian@cumin1003" [13:07:56] RECOVERY - Host ps1-d1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.44 ms [13:08:00] (03CR) 10Arnaudb: [C:03+2] gitlab: exclude packages from failover backup [puppet] - 10https://gerrit.wikimedia.org/r/1171554 (https://phabricator.wikimedia.org/T399306) (owner: 10Jelto) [13:08:35] Daimona: anything to test here? [13:08:55] It should be a noop but I'll take a quick look to confirm,. [13:09:09] I checked that https://www.wikidata.org/wiki/Special:AllEvents still looks the same fwiw [13:09:24] (03CR) 10Ssingh: varnish: new policy to allow websockets and caching, apply to phab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1171263 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn) [13:10:03] Yup, I did similar tests and it looks fine! [13:10:17] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, daimona: Continuing with sync [13:10:18] alright :) [13:12:53] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [13:12:57] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [13:13:07] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [13:13:14] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [13:13:38] RECOVERY - Host ps1-c4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.22 ms [13:13:56] RECOVERY - Host ps1-d7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.21 ms [13:13:56] RECOVERY - Host asw2-d-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [13:14:14] PROBLEM - Host cloudsw2-d5-eqiad.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:16:02] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167910|Use new `sul` dblist for $wmgCampaignEventsUseCentralDB]] (duration: 11m 54s) [13:16:08] RECOVERY - Host ps1-b2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms [13:16:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/CampaignEvents] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171268 (https://phabricator.wikimedia.org/T397270) (owner: 10Daimona Eaytoy) [13:16:28] RECOVERY - Host asw2-b-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [13:16:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P79601 and previous config saved to /var/cache/conftool/dbconfig/20250722-131640-fceratto.json [13:17:51] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.topology-check Validate Gerrit topology (source=gerrit2002, replica=gerrit2003) [13:17:56] !log arnaudb@cumin1003 END (FAIL) - Cookbook sre.gerrit.topology-check (exit_code=99) Validate Gerrit topology (source=gerrit2002, replica=gerrit2003) [13:18:36] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.topology-check Validate Gerrit topology (source=gerrit2002, replica=gerrit2003) [13:18:42] !log arnaudb@cumin1003 END (FAIL) - Cookbook sre.gerrit.topology-check (exit_code=99) Validate Gerrit topology (source=gerrit2002, replica=gerrit2003) [13:18:46] RECOVERY - Host ps1-a2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.51 ms [13:18:49] 10ops-eqiad, 06DC-Ops: decom cloudsw2-d5-eqiad - https://phabricator.wikimedia.org/T400157 (10ayounsi) 03NEW [13:18:56] RECOVERY - Host ps1-b5-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.28 ms [13:18:56] 10ops-eqiad, 06DC-Ops: decom cloudsw2-d5-eqiad - https://phabricator.wikimedia.org/T400157#11024056 (10ayounsi) [13:19:26] 10SRE-SLO, 10Abstract Wikipedia team (25Q4 (Apr–Jun)), 07OKR-Work, 07Workstreams: Establish an SLO for the Wikifunctions integration into Wikimedia projects' wikitext pages, to assure reader experience quality is maintained during roll-out - https://phabricator.wikimedia.org/T390548#11024058 (10Jdforres... [13:19:32] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.topology-check Validate Gerrit topology (source=gerrit2002, replica=gerrit2003) [13:19:37] !log arnaudb@cumin1003 END (FAIL) - Cookbook sre.gerrit.topology-check (exit_code=99) Validate Gerrit topology (source=gerrit2002, replica=gerrit2003) [13:19:57] (03PS1) 10Ayounsi: Remove cloudsw2-d5 [homer/public] - 10https://gerrit.wikimedia.org/r/1171574 [13:20:00] RECOVERY - Host ps1-a3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.34 ms [13:20:00] 10ops-eqiad, 06DC-Ops: decom cloudsw2-d5-eqiad - https://phabricator.wikimedia.org/T400157#11024067 (10Jclark-ctr) a:03Jclark-ctr [13:20:22] (03PS1) 10Elukey: site.pp: fix insetup role for ml-serve10[12,13] [puppet] - 10https://gerrit.wikimedia.org/r/1171575 (https://phabricator.wikimedia.org/T393948) [13:21:05] (03PS1) 10Ayounsi: Remove cloudsw2-d5 from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1171576 (https://phabricator.wikimedia.org/T400157) [13:21:06] RECOVERY - Host ps1-a7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms [13:21:08] RECOVERY - Host asw2-a-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [13:21:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T399249)', diff saved to https://phabricator.wikimedia.org/P79602 and previous config saved to /var/cache/conftool/dbconfig/20250722-132107-marostegui.json [13:21:15] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [13:21:26] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2168.codfw.wmnet with reason: Maintenance [13:21:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2168 (T399249)', diff saved to https://phabricator.wikimedia.org/P79603 and previous config saved to /var/cache/conftool/dbconfig/20250722-132133-marostegui.json [13:23:07] (03CR) 10Elukey: [C:03+2] site.pp: fix insetup role for ml-serve10[12,13] [puppet] - 10https://gerrit.wikimedia.org/r/1171575 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [13:23:20] (03CR) 10Cathal Mooney: [C:03+2] Remove cloudsw2-d5 [homer/public] - 10https://gerrit.wikimedia.org/r/1171574 (owner: 10Ayounsi) [13:23:37] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.topology-check Validate Gerrit topology (source=gerrit2002, replica=gerrit2003) [13:23:42] !log arnaudb@cumin1003 END (FAIL) - Cookbook sre.gerrit.topology-check (exit_code=99) Validate Gerrit topology (source=gerrit2002, replica=gerrit2003) [13:24:07] (03Merged) 10jenkins-bot: Remove cloudsw2-d5 [homer/public] - 10https://gerrit.wikimedia.org/r/1171574 (owner: 10Ayounsi) [13:24:11] (03CR) 10Cathal Mooney: "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1171576 (https://phabricator.wikimedia.org/T400157) (owner: 10Ayounsi) [13:24:30] RECOVERY - Host ps1-c2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.22 ms [13:24:33] (03CR) 10Ayounsi: [C:03+2] Remove cloudsw2-d5 from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1171576 (https://phabricator.wikimedia.org/T400157) (owner: 10Ayounsi) [13:25:18] RECOVERY - Host asw2-c-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [13:25:28] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1012.eqiad.wmnet with reason: host reimage [13:27:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Discrepencies with cableid & ports on some msw in c/d <-> msw1-eqiad - https://phabricator.wikimedia.org/T400159 (10Jclark-ctr) 03NEW [13:27:55] (03Merged) 10jenkins-bot: Modifications to UpdateCountriesScript [extensions/CampaignEvents] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1171268 (https://phabricator.wikimedia.org/T397270) (owner: 10Daimona Eaytoy) [13:28:19] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1171268|Modifications to UpdateCountriesScript (T397270)]] [13:28:23] T397270: Create a script to update the country schema to the new format - https://phabricator.wikimedia.org/T397270 [13:28:47] 10SRE-swift-storage, 06Commons: File on Commons lost: File:LAGUNA DE ORURIILO.jpg - https://phabricator.wikimedia.org/T399389#11024131 (10MatthewVernon) @GPSLeo I was at a conference last week, so only just getting to this. It looks like the file page has been deleted; is there any action needed here now? [13:29:18] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1012.eqiad.wmnet with reason: host reimage [13:29:50] (03CR) 10Ayounsi: [C:03+1] "much leaner!" [puppet] - 10https://gerrit.wikimedia.org/r/1170543 (owner: 10Cathal Mooney) [13:30:28] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, daimona: Backport for [[gerrit:1171268|Modifications to UpdateCountriesScript (T397270)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:30:36] Daimona: not much to test here, I expect? [13:30:37] (03CR) 10Ayounsi: [C:03+1] sre.hosts.decommision: remove virtual interfaces from during decom [cookbooks] - 10https://gerrit.wikimedia.org/r/1170530 (https://phabricator.wikimedia.org/T398412) (owner: 10Cathal Mooney) [13:30:58] other than running the maintenance script once the backport is fully synced [13:31:26] Yup, exactly. I will run it later [13:31:36] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, daimona: Continuing with sync [13:31:36] ok [13:31:46] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [13:31:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T399728)', diff saved to https://phabricator.wikimedia.org/P79604 and previous config saved to /var/cache/conftool/dbconfig/20250722-133148-fceratto.json [13:31:53] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [13:32:04] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2167.codfw.wmnet with reason: Maintenance [13:32:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2167 (T399728)', diff saved to https://phabricator.wikimedia.org/P79605 and previous config saved to /var/cache/conftool/dbconfig/20250722-133211-fceratto.json [13:32:17] BTW, thank you! Apologies if I'm a bit absent but I'm in a call and I haven't unlocked the multithreading brain upgrade. [13:32:33] :D [13:34:27] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:36:57] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1171268|Modifications to UpdateCountriesScript (T397270)]] (duration: 08m 37s) [13:37:01] T397270: Create a script to update the country schema to the new format - https://phabricator.wikimedia.org/T397270 [13:37:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T399728)', diff saved to https://phabricator.wikimedia.org/P79606 and previous config saved to /var/cache/conftool/dbconfig/20250722-133716-fceratto.json [13:37:21] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [13:39:32] !log UTC afternoon backport+config window done [13:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:53] 06SRE-OnFire, 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): [k8s,infra,o11y] Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11024157 (10taavi) Not at the moment, I think. I see two options for collecting that: using [[ https://github... [13:42:37] (03CR) 10David Caro: [C:03+1] "Let's give it a try" [puppet] - 10https://gerrit.wikimedia.org/r/1171279 (owner: 10Andrew Bogott) [13:42:59] (03CR) 10Jelto: [C:03+2] add M247 to fetch_external_clouds:vendors_nets.py [puppet] - 10https://gerrit.wikimedia.org/r/1171569 (https://phabricator.wikimedia.org/T400138) (owner: 10Jelto) [13:43:49] (03PS1) 10Ayounsi: Add BGP to Inter.link [homer/public] - 10https://gerrit.wikimedia.org/r/1171582 (https://phabricator.wikimedia.org/T394043) [13:43:51] !log elukey@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [13:44:09] !log elukey@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin1003" [13:44:09] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1012.eqiad.wmnet with OS bookworm [13:44:36] (03Abandoned) 10Andrew Bogott: aptrepo: support ceph/quincy on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1171279 (owner: 10Andrew Bogott) [13:44:44] (03PS1) 10Jgreen: Switch payments.wikimedia.org to the BIRD/HAProxy balancers. [dns] - 10https://gerrit.wikimedia.org/r/1171583 (https://phabricator.wikimedia.org/T398321) [13:45:27] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm [13:45:44] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2006.codfw.wmnet with OS bookworm [13:46:58] (03PS1) 10Stevemunene: dse-k8s: deploy etcd service [puppet] - 10https://gerrit.wikimedia.org/r/1171584 (https://phabricator.wikimedia.org/T397293) [13:47:03] (03CR) 10Jgreen: [C:03+2] Switch payments.wikimedia.org to the BIRD/HAProxy balancers. [dns] - 10https://gerrit.wikimedia.org/r/1171583 (https://phabricator.wikimedia.org/T398321) (owner: 10Jgreen) [13:47:20] !log jgreen@dns1004 START - running authdns-update [13:47:59] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm [13:48:21] !log jgreen@dns1004 END - running authdns-update [13:50:19] (03PS3) 10David Caro: cloudceph: comment out ceph versions in 'common' [puppet] - 10https://gerrit.wikimedia.org/r/1171285 (owner: 10Andrew Bogott) [13:50:24] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1171285 (owner: 10Andrew Bogott) [13:51:27] (03CR) 10Cathal Mooney: "LGTM, one comment on the as-path filter" [homer/public] - 10https://gerrit.wikimedia.org/r/1171582 (https://phabricator.wikimedia.org/T394043) (owner: 10Ayounsi) [13:52:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P79607 and previous config saved to /var/cache/conftool/dbconfig/20250722-135223-fceratto.json [13:54:06] (03CR) 10Xcollazo: Disable all dumps timers on snapshot hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1170410 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [13:55:37] (03CR) 10David Caro: [C:03+1] "LGTM, we might want to delete instead of commenting, in any case +1" [puppet] - 10https://gerrit.wikimedia.org/r/1171285 (owner: 10Andrew Bogott) [13:57:15] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2006.codfw.wmnet with OS bookworm [13:59:00] (03PS1) 10Elukey: Revert "install_server: fix raid1-1dev-nvme recipe" [puppet] - 10https://gerrit.wikimedia.org/r/1171586 [13:59:05] (03PS2) 10Elukey: Revert "install_server: fix raid1-1dev-nvme recipe" [puppet] - 10https://gerrit.wikimedia.org/r/1171586 [13:59:09] (03CR) 10Elukey: [V:03+2 C:03+2] Revert "install_server: fix raid1-1dev-nvme recipe" [puppet] - 10https://gerrit.wikimedia.org/r/1171586 (owner: 10Elukey) [14:03:43] (03PS2) 10Ayounsi: Add BGP to Inter.link [homer/public] - 10https://gerrit.wikimedia.org/r/1171582 (https://phabricator.wikimedia.org/T394043) [14:04:13] (03CR) 10Ayounsi: Add BGP to Inter.link (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1171582 (https://phabricator.wikimedia.org/T394043) (owner: 10Ayounsi) [14:05:53] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/1171582 (https://phabricator.wikimedia.org/T394043) (owner: 10Ayounsi) [14:05:57] (03CR) 10Ayounsi: [C:03+2] Add BGP to Inter.link [homer/public] - 10https://gerrit.wikimedia.org/r/1171582 (https://phabricator.wikimedia.org/T394043) (owner: 10Ayounsi) [14:06:28] (03Merged) 10jenkins-bot: Add BGP to Inter.link [homer/public] - 10https://gerrit.wikimedia.org/r/1171582 (https://phabricator.wikimedia.org/T394043) (owner: 10Ayounsi) [14:07:27] !log setup BGP to inter.link in esams [14:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P79608 and previous config saved to /var/cache/conftool/dbconfig/20250722-140731-fceratto.json [14:08:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11024208 (10elukey) >>! In T393948#11024001, @elukey wrote: > @jhathaway you rock thanks a lot! I verified with a reimage that d-i now correctly handles the new partman... [14:10:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11024210 (10jhathaway) >>! In T393948#11024001, @elukey wrote: > @jhathaway you rock thanks a lot! I verified with a reimage that d-i now correctly handles the new part... [14:11:48] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:12:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: msw1-eqiad: cable me0 dedicated mgmt port directly to the switch itself - https://phabricator.wikimedia.org/T400161 (10cmooney) 03NEW p:05Triage→03Low [14:17:43] (03CR) 10Andrew Bogott: [C:03+2] cloudceph: comment out ceph versions in 'common' [puppet] - 10https://gerrit.wikimedia.org/r/1171285 (owner: 10Andrew Bogott) [14:18:22] (03CR) 10Herron: [C:03+1] Revert "logstash: remove event.duration when value is hyphen" [puppet] - 10https://gerrit.wikimedia.org/r/1168234 (owner: 10Cwhite) [14:19:12] (03PS1) 10TChin: [eventstreams] Bump version 0.17.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171588 (https://phabricator.wikimedia.org/T383977) [14:19:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1081-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:22:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T399728)', diff saved to https://phabricator.wikimedia.org/P79609 and previous config saved to /var/cache/conftool/dbconfig/20250722-142239-fceratto.json [14:22:44] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [14:22:55] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2181.codfw.wmnet with reason: Maintenance [14:23:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2181 (T399728)', diff saved to https://phabricator.wikimedia.org/P79610 and previous config saved to /var/cache/conftool/dbconfig/20250722-142302-fceratto.json [14:23:17] (03CR) 10Btullis: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171588 (https://phabricator.wikimedia.org/T383977) (owner: 10TChin) [14:24:08] (03CR) 10TChin: [C:03+2] [eventstreams] Bump version 0.17.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171588 (https://phabricator.wikimedia.org/T383977) (owner: 10TChin) [14:25:52] (03Merged) 10jenkins-bot: [eventstreams] Bump version 0.17.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171588 (https://phabricator.wikimedia.org/T383977) (owner: 10TChin) [14:26:25] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [14:26:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T399249)', diff saved to https://phabricator.wikimedia.org/P79611 and previous config saved to /var/cache/conftool/dbconfig/20250722-142637-marostegui.json [14:26:43] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [14:26:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:27:04] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [14:27:18] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [14:27:40] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11024307 (10BTullis) a:05BTullis→03None [14:28:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T399728)', diff saved to https://phabricator.wikimedia.org/P79612 and previous config saved to /var/cache/conftool/dbconfig/20250722-142808-fceratto.json [14:28:14] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [14:28:29] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [14:29:03] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [14:29:28] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250722T1430) [14:30:36] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [14:31:46] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cirrussearch1081 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:32:03] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:32:19] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host es2035 [14:32:31] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es2035 [14:36:18] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 89289320 and 13 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:36:23] !log mwscript-k8s --comment="T397270" -f --file /srv/mediawiki/php-1.45.0-wmf.9/extensions/CampaignEvents/maintenance/countryExceptionMappings.csv -- CampaignEvents:UpdateCountriesColumn --wiki metawiki --exceptions countryExceptionMappings.csv [14:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:27] T397270: Create a script to update the country schema to the new format - https://phabricator.wikimedia.org/T397270 [14:36:32] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1171591 [14:37:18] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 72528 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:37:43] (03PS1) 10Stevemunene: dns: Add a VIP for dse-k8s-ctrl.svc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/1171592 (https://phabricator.wikimedia.org/T397293) [14:39:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1081-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:41:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P79618 and previous config saved to /var/cache/conftool/dbconfig/20250722-144145-marostegui.json [14:42:53] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams: apply [14:42:58] 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858#11024407 (10Andrew) Our issue resembles this upstream report: https://serverfault.com/questions/1172161/osds-stability... [14:43:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P79619 and previous config saved to /var/cache/conftool/dbconfig/20250722-144316-fceratto.json [14:43:20] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [14:44:05] (03CR) 10Btullis: [V:03+1] Disable all dumps timers on snapshot hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1170410 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [14:46:36] (03PS5) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) [14:48:08] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [14:48:14] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11024439 (10Jhancock.wm) es2035 has been updated. Good news, I didn't have to physically move that one. There were no other 1G servers in the region on the s... [14:48:44] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1150.eqiad.wmnet with reason: upgrade mariadb [14:48:54] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [14:49:09] RESOLVED: SystemdUnitFailed: opensearch-disable-readahead-production-search-eqiad.service on cirrussearch1081:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:49:43] (03CR) 10CI reject: [V:04-1] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [14:53:23] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply [14:54:08] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [14:54:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:54:40] (03PS1) 10Jcrespo: mariadb: Upgrade backup source db1150 to MariaDB package 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171594 (https://phabricator.wikimedia.org/T394487) [14:54:56] (03PS1) 10Zabe: Set virtual domain for GlobalUsage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171595 (https://phabricator.wikimedia.org/T400169) [14:56:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P79621 and previous config saved to /var/cache/conftool/dbconfig/20250722-145652-marostegui.json [14:58:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P79622 and previous config saved to /var/cache/conftool/dbconfig/20250722-145823-fceratto.json [15:00:05] jelto, arnoldokoth, and mutante: #bothumor I � Unicode. All rise for SRE Collaboration Services office hours deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250722T1500). [15:00:18] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 180878144 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:01:18] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 22072 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:02:08] (03PS1) 10Andrew Bogott: profile::cloudceph::osd: explicitly install ceph-volume [puppet] - 10https://gerrit.wikimedia.org/r/1171596 [15:03:52] (03PS5) 10Herron: role::titan: install promtool [puppet] - 10https://gerrit.wikimedia.org/r/1171591 (https://phabricator.wikimedia.org/T349521) [15:04:03] (03CR) 10Elukey: [C:03+2] "Reverted it, clearly wrong :)" [puppet] - 10https://gerrit.wikimedia.org/r/1171571 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:49] (03CR) 10David Caro: [C:03+1] profile::cloudceph::osd: explicitly install ceph-volume [puppet] - 10https://gerrit.wikimedia.org/r/1171596 (owner: 10Andrew Bogott) [15:07:39] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: sync [15:07:42] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: sync [15:09:18] (03PS2) 10Andrew Bogott: profile::cloudceph::osd: explicitly install ceph-volume [puppet] - 10https://gerrit.wikimedia.org/r/1171596 [15:09:26] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:09:27] (03PS3) 10Andrew Bogott: profile::cloudceph::osd: explicitly install ceph-volume [puppet] - 10https://gerrit.wikimedia.org/r/1171596 [15:10:12] (03CR) 10Zabe: [C:03+2] Set virtual domain for GlobalUsage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171595 (https://phabricator.wikimedia.org/T400169) (owner: 10Zabe) [15:11:01] (03Merged) 10jenkins-bot: Set virtual domain for GlobalUsage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171595 (https://phabricator.wikimedia.org/T400169) (owner: 10Zabe) [15:11:49] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1171595|Set virtual domain for GlobalUsage (T400169)]] [15:11:53] T400169: Convert GlobalUsage to virtual domains - https://phabricator.wikimedia.org/T400169 [15:12:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T399249)', diff saved to https://phabricator.wikimedia.org/P79623 and previous config saved to /var/cache/conftool/dbconfig/20250722-151201-marostegui.json [15:12:06] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [15:12:17] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2182.codfw.wmnet with reason: Maintenance [15:12:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T399249)', diff saved to https://phabricator.wikimedia.org/P79624 and previous config saved to /var/cache/conftool/dbconfig/20250722-151224-marostegui.json [15:13:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T399728)', diff saved to https://phabricator.wikimedia.org/P79625 and previous config saved to /var/cache/conftool/dbconfig/20250722-151331-fceratto.json [15:13:38] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [15:13:48] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2195.codfw.wmnet with reason: Maintenance [15:13:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2195 (T399728)', diff saved to https://phabricator.wikimedia.org/P79626 and previous config saved to /var/cache/conftool/dbconfig/20250722-151355-fceratto.json [15:13:58] !log zabe@deploy1003 zabe: Backport for [[gerrit:1171595|Set virtual domain for GlobalUsage (T400169)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:14:31] 10SRE-SLO, 10Observability-Metrics: Clear & Backfill Tonecheck Pyrra Metrics - https://phabricator.wikimedia.org/T400071#11024572 (10herron) ` herron@prometheus1005:~/tmp/backfill/tonecheck$ promtool tsdb create-blocks-from rules --start=2025-03-01T00:00:00Z --end=2025-07-22T00:00:00Z --output-dir='output/' --... [15:14:47] !log zabe@deploy1003 zabe: Continuing with sync [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:18:47] (03CR) 10Btullis: "This is only the reverse DNS address. I think that you need to add the forward record as well." [dns] - 10https://gerrit.wikimedia.org/r/1171592 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene) [15:18:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T399728)', diff saved to https://phabricator.wikimedia.org/P79627 and previous config saved to /var/cache/conftool/dbconfig/20250722-151857-fceratto.json [15:19:02] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [15:20:15] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1171595|Set virtual domain for GlobalUsage (T400169)]] (duration: 08m 26s) [15:20:20] T400169: Convert GlobalUsage to virtual domains - https://phabricator.wikimedia.org/T400169 [15:21:59] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11024611 (10Jhancock.wm) I'll trash the optic. Good to close if there are no other points to cover. [15:22:16] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11024612 (10ayounsi) Ran homer and manually removed the config forcing it at 1G, host is up. [15:23:37] (03CR) 10Jcrespo: [C:03+2] mariadb: Upgrade backup source db1150 to MariaDB package 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171594 (https://phabricator.wikimedia.org/T394487) (owner: 10Jcrespo) [15:23:47] (03CR) 10Cathal Mooney: [C:03+2] WMF Plugin: do not process disabled ports for block speed setting [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1167564 (https://phabricator.wikimedia.org/T394333) (owner: 10Cathal Mooney) [15:24:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [15:24:59] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [15:25:03] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [15:31:51] !log cmooney@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin[1002-1003].eqiad.wmnet with reason: Release v0.10.1 - cmooney@cumin1003 [15:32:32] (03CR) 10Btullis: [V:03+1] Disable all dumps timers on snapshot hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1170410 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [15:33:04] (03CR) 10Dzahn: [C:03+2] microsites: update recipient email for home dir size warning mails [puppet] - 10https://gerrit.wikimedia.org/r/1171260 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn) [15:34:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P79628 and previous config saved to /var/cache/conftool/dbconfig/20250722-153404-fceratto.json [15:34:17] !log cmooney@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin[1002-1003].eqiad.wmnet with reason: Release v0.10.1 - cmooney@cumin1003 [15:38:04] (03PS6) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) [15:38:12] (03CR) 10Dzahn: [V:03+1] varnish: new policy to allow websockets and caching, apply to phab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1171263 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn) [15:38:32] 06SRE, 10LDAP-Access-Requests: Grant Access to nda & logstash for Novem Linguae - https://phabricator.wikimedia.org/T400176 (10Novem_Linguae) 03NEW [15:40:03] (03CR) 10Dzahn: [C:03+1] Gitlab: switchover between gitlab-replica-a and gitlab-replica-b [dns] - 10https://gerrit.wikimedia.org/r/1171537 (https://phabricator.wikimedia.org/T400121) (owner: 10Arnaudb) [15:40:20] (03CR) 10CI reject: [V:04-1] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [15:40:23] (03CR) 10Dzahn: [C:03+1] Gitlab: switchover between gitlab-replica-a and gitlab-replica-b [puppet] - 10https://gerrit.wikimedia.org/r/1171539 (https://phabricator.wikimedia.org/T400121) (owner: 10Arnaudb) [15:41:13] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11024741 (10elukey) The bootstrap-tile-storage script is memory hungry, and after a while it eats all the host's memory causing a nice freeze. I created a local copy with some modific... [15:41:16] (03PS7) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) [15:41:57] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Undeleted file is an incorrect version - https://phabricator.wikimedia.org/T399892#11024751 (10jcrespo) Backup metadata: ` Title (spaces will be converted to underscores, first letter normally in uppercase): Der_Schatz_(1923).jpg This is the list... [15:43:41] (03CR) 10CI reject: [V:04-1] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [15:49:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P79629 and previous config saved to /var/cache/conftool/dbconfig/20250722-154912-fceratto.json [16:00:05] jhathaway and moritzm: Time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250722T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:37] (03PS8) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) [16:01:40] (03PS1) 10Hnowlan: rest-gateway: route did-you-know endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171600 (https://phabricator.wikimedia.org/T400168) [16:02:05] (03CR) 10Ssingh: varnish: new policy to allow websockets and caching, apply to phab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1171263 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn) [16:02:52] (03CR) 10CI reject: [V:04-1] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [16:04:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T399728)', diff saved to https://phabricator.wikimedia.org/P79630 and previous config saved to /var/cache/conftool/dbconfig/20250722-160419-fceratto.json [16:04:26] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [16:04:35] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2198.codfw.wmnet with reason: Maintenance [16:06:15] !log sudo cumin "A:dnsbox" "disable-puppet 'merging CR 1170570'": T362392 [16:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:20] T362392: Routed Ganeti: Add support for VM BGP - https://phabricator.wikimedia.org/T362392 [16:07:16] XioNoX: ^ ready to go [16:07:21] (03PS9) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) [16:07:24] cool [16:07:35] (03CR) 10Ayounsi: [C:03+2] Bird: VM side - add support for Routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1170570 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [16:07:46] (03PS10) 10Effie Mouzeli: hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) [16:08:44] sukhe: puppet merging [16:08:47] thanks [16:09:55] (03CR) 10CI reject: [V:04-1] hcaptcha::proxy: use mtail for nginx- metrics [puppet] - 10https://gerrit.wikimedia.org/r/1171562 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [16:10:00] running puppet on doh7003 (to see if it does the right thing) and on doh4002 (to see if it does nothing as expected) [16:10:14] cool [16:10:19] I am running on dns7002 [16:10:19] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cirrussearch2091.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [16:10:32] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=dns7002.magru.wmnet [reason: testing] [16:10:37] ha magru [16:10:43] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=dns7002.wikimedia.org [reason: testing] [16:11:18] sukhe: all good for the 2 i mentionnent [16:13:16] all good on dns7002 as well, NOOP [16:13:18] (03PS9) 10Fabfur: traffic: new alerts for haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) [16:13:26] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: testing] [16:13:39] !log sukhe@dns7002 START - running authdns-update [16:13:47] (03CR) 10Fabfur: traffic: new alerts for haproxykafka (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [16:13:56] nice! [16:14:07] trying one more, just because ... :) [16:14:35] !log sukhe@dns7002 END - running authdns-update [16:15:20] (03CR) 10CI reject: [V:04-1] traffic: new alerts for haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) (owner: 10Fabfur) [16:15:39] XioNoX: all good :) [16:15:44] I am re-enabling and running there [16:15:46] sukhe: awesome! [16:15:54] all good on your end? [16:16:12] sukhe: yeah, we're all set for the Bird patch deploy [16:16:28] !log sudo cumin -b1 -s10 "A:dnsbox" "run-puppet-agent --enable 'merging CR 1170570'" [16:16:30] we can put the other magru durum/doh into service [16:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:12] cool, I will take care of those [16:18:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T399249)', diff saved to https://phabricator.wikimedia.org/P79631 and previous config saved to /var/cache/conftool/dbconfig/20250722-161823-marostegui.json [16:18:28] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [16:19:05] (03PS1) 10Btullis: Mark the dumpwikitech.sh script as executable [dumps] - 10https://gerrit.wikimedia.org/r/1171601 (https://phabricator.wikimedia.org/T398968) [16:20:44] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cirrussearch2091.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [16:24:32] (03PS1) 10Ssingh: site.pp: move durum700[34] and doh7004 to specific roles [puppet] - 10https://gerrit.wikimedia.org/r/1171603 (https://phabricator.wikimedia.org/T362392) [16:24:41] 10SRE-swift-storage, 06Commons: File on Commons lost: File:LAGUNA DE ORURIILO.jpg - https://phabricator.wikimedia.org/T399389#11024981 (10GPSLeo) This should not have been deleted, I have restored the page. We still need to find out what happened here and if the file can be restored. [16:30:53] !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cirrussearch2091'] [16:31:20] (03CR) 10Ayounsi: [C:03+1] site.pp: move durum700[34] and doh7004 to specific roles [puppet] - 10https://gerrit.wikimedia.org/r/1171603 (https://phabricator.wikimedia.org/T362392) (owner: 10Ssingh) [16:31:32] 10SRE-swift-storage, 06Commons: File on Commons lost: File:LAGUNA DE ORURIILO.jpg - https://phabricator.wikimedia.org/T399389#11025010 (10MatthewVernon) I've looked, and it's not in either swift cluster, and nor do the backups know anything about it (nor, indeed 'Laguna de Orurillo.jpg')) [16:33:22] 10SRE-swift-storage, 06Commons: File on Commons lost: File:LAGUNA DE ORURIILO.jpg - https://phabricator.wikimedia.org/T399389#11025014 (10MatthewVernon) [which leads me to suspect the original is long gone, and we've only just discovered this due to a recent removal of old thumbnails] [16:33:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P79632 and previous config saved to /var/cache/conftool/dbconfig/20250722-163330-marostegui.json [16:45:18] 10SRE-swift-storage, 06Commons: File on Commons lost: File:LAGUNA DE ORURIILO.jpg - https://phabricator.wikimedia.org/T399389#11025044 (10MatthewVernon) Courtesy of [[https://web.archive.org/web/20240306000000*/https://commons.wikimedia.org/wiki/File:LAGUNA_DE_ORURILLO.jpg | the web archive]], here's the file.... [16:47:54] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cirrussearch2091'] [16:48:21] 10SRE-swift-storage, 06Commons: File on Commons lost: File:LAGUNA DE ORURIILO.jpg - https://phabricator.wikimedia.org/T399389#11025050 (10MatthewVernon) Which is smaller than what the current file page thinks the original was, so something strange has happened at some point in the past. [16:48:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P79633 and previous config saved to /var/cache/conftool/dbconfig/20250722-164838-marostegui.json [16:53:19] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Undeleted file is an incorrect version - https://phabricator.wikimedia.org/T399892#11025088 (10jcrespo) I found references to the old file: ` cumin2024@db1204.eqiad.wmnet[mediabackups]> select * FROM file_history where upload_name='Der_Schatz_(1923... [16:53:36] !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cirrussearch2091'] [16:54:33] (03CR) 10Scott French: [C:03+1] rest-gateway: route did-you-know endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171600 (https://phabricator.wikimedia.org/T400168) (owner: 10Hnowlan) [16:55:32] jouncebot: nowandnext [16:55:32] For the next 0 hour(s) and 4 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250722T1600) [16:55:32] In 0 hour(s) and 4 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250722T1700) [16:56:53] (03CR) 10Ssingh: [C:03+2] site.pp: move durum700[34] and doh7004 to specific roles [puppet] - 10https://gerrit.wikimedia.org/r/1171603 (https://phabricator.wikimedia.org/T362392) (owner: 10Ssingh) [16:58:53] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm [16:58:55] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host doh7004.wikimedia.org with OS bookworm [17:00:01] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#11025130 (10jhathaway) >>! In T378028#10995390, @Arnoldokoth wrote: > Thanks @MoritzMuehlenhoff We'll consider that... But I'm doubtful... [17:00:03] 10SRE-swift-storage, 06Commons: File on Commons lost: File:LAGUNA DE ORURIILO.jpg - https://phabricator.wikimedia.org/T399389#11025131 (10GPSLeo) This is interesting. There is a very [[ https://commons.wikimedia.org/wiki/File:PEQUE%C3%91O_TITICACA.jpg | similar file ]] with only minor different crop. As the cl... [17:00:05] swfrench-wmf: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250722T1700). [17:00:13] o/ [17:01:33] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cirrussearch2091'] [17:03:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T399249)', diff saved to https://phabricator.wikimedia.org/P79634 and previous config saved to /var/cache/conftool/dbconfig/20250722-170347-marostegui.json [17:03:52] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2198.codfw.wmnet with reason: Maintenance [17:03:53] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [17:04:15] (03CR) 10Hnowlan: [C:03+2] rest-gateway: route did-you-know endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171600 (https://phabricator.wikimedia.org/T400168) (owner: 10Hnowlan) [17:05:39] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2023.codfw.wmnet with OS bookworm [17:05:45] (03CR) 10Scott French: "Thanks, Eric!" [puppet] - 10https://gerrit.wikimedia.org/r/1113581 (https://phabricator.wikimedia.org/T368096) (owner: 10Eevans) [17:05:48] (03CR) 10Scott French: [C:03+2] Add data-gateway listener to mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1113581 (https://phabricator.wikimedia.org/T368096) (owner: 10Eevans) [17:06:24] !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cirrussearch2091'] [17:07:59] (03Merged) 10jenkins-bot: rest-gateway: route did-you-know endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171600 (https://phabricator.wikimedia.org/T400168) (owner: 10Hnowlan) [17:09:54] jhancock@cumin1003 upgrade-firmware (PID 2957034) is awaiting input [17:12:24] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [17:12:31] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [17:12:34] !log jynus@cumin1003 START - Cookbook sre.hosts.remove-downtime for db1150.eqiad.wmnet [17:12:34] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1150.eqiad.wmnet [17:13:28] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [17:13:29] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cirrussearch2091'] [17:13:36] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [17:14:26] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [17:14:38] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [17:16:09] PROBLEM - Bird Internet Routing Daemon on durum7004 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [17:19:49] ^ reimaging [17:20:33] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Undeleted file is an incorrect version - https://phabricator.wikimedia.org/T399892#11025179 (10jcrespo) No files were found, I am afraid the best I can offer is a crude recreation of the original, using the same method: {F65589204} [17:20:51] thanks, sukhe! [17:23:21] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2023.codfw.wmnet with reason: host reimage [17:24:56] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): cirrussearch2089 (A4) and cirrussearch2091 (A7) possible hardware issues - https://phabricator.wikimedia.org/T400099#11025215 (10Jhancock.wm) @bking looks like cirrussearch2089 won't boot at all. it is under warranty, so I'm star... [17:26:11] !log swfrench@deploy1003 Started scap sync-world: Make data-gateway mesh listener available - T368096 [17:26:15] T368096: mediawiki: migrate from image-suggestion to data-gateway - https://phabricator.wikimedia.org/T368096 [17:27:06] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host durum7004.magru.wmnet with OS bookworm [17:27:45] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2023.codfw.wmnet with reason: host reimage [17:29:05] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on doh7004.wikimedia.org with reason: host reimage [17:29:49] 06SRE: Increase the capacity of /var/cache/archiva on the appropriate archiva.wikimedia.org server(s) - https://phabricator.wikimedia.org/T400188 (10amastilovic) 03NEW [17:30:05] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11025252 (10cmooney) 05Open→03Resolved [17:31:16] (03PS1) 10Jcrespo: mariadb/dbbackups: Upgrade db2141, dbprov1003, dbprov2003 to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171610 (https://phabricator.wikimedia.org/T394487) [17:31:48] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2141.codfw.wmnet with reason: upgrade mariadb [17:32:17] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7004.wikimedia.org with reason: host reimage [17:32:22] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbprov2003.codfw.wmnet,dbprov1003.eqiad.wmnet with reason: upgrade mariadb [17:32:57] !log swfrench@deploy1003 Finished scap sync-world: Make data-gateway mesh listener available - T368096 (duration: 06m 46s) [17:33:02] T368096: mediawiki: migrate from image-suggestion to data-gateway - https://phabricator.wikimedia.org/T368096 [17:33:26] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for cirrussearch2089.mgmt:22 - https://phabricator.wikimedia.org/T399943#11025275 (10Jhancock.wm) [17:33:27] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): cirrussearch2089 (A4) and cirrussearch2091 (A7) possible hardware issues - https://phabricator.wikimedia.org/T400099#11025276 (10Jhancock.wm) [17:34:13] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for mc-misc2001.mgmt:22 - https://phabricator.wikimedia.org/T399494#11025278 (10Jhancock.wm) [17:34:14] 10ops-codfw, 06SRE, 06DC-Ops: mc-misc2001 won't power up - https://phabricator.wikimedia.org/T395526#11025279 (10Jhancock.wm) [17:34:30] (03PS10) 10Fabfur: traffic: new alerts for haproxykafka [alerts] - 10https://gerrit.wikimedia.org/r/1171176 (https://phabricator.wikimedia.org/T400039) [17:35:09] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Fix PXE miss-configurations - https://phabricator.wikimedia.org/T396717#11025280 (10Jhancock.wm) fyi i did a test on a server where the only thing i changed in the bios was the pxe and it rebooted to commit. so these will all need to be rebooted to commit these ch... [17:35:13] RECOVERY - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1453, active_shards: 3611, relocating_shards: 0, initializing_shards: 80, unassigned_shards: 818, delayed_unassigned_shards: 0, numbe [17:35:13] ding_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.08427589265914 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:37:35] (03CR) 10Jcrespo: [C:03+2] mariadb/dbbackups: Upgrade db2141, dbprov1003, dbprov2003 to 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1171610 (https://phabricator.wikimedia.org/T394487) (owner: 10Jcrespo) [17:39:08] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage [17:42:49] I'm done with work planned for the infra window [17:44:09] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage [17:44:20] !log un-drain Arelion 100G transport circuit IC-374549 cr1-eqiad <-> cr1-codfw after service restoration T399097 [17:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:25] T399097: Arelion IC-374549 100G Transport outage (cr1-codfw -> cr1-eqiad) July 2025 - https://phabricator.wikimedia.org/T399097 [17:47:00] !log jynus@cumin1003 START - Cookbook sre.hosts.remove-downtime for db2141.codfw.wmnet [17:47:01] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2141.codfw.wmnet [17:47:23] !log jynus@cumin1003 START - Cookbook sre.hosts.remove-downtime for dbprov2003.codfw.wmnet,dbprov1003.eqiad.wmnet [17:47:24] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dbprov2003.codfw.wmnet,dbprov1003.eqiad.wmnet [17:48:49] 10ops-codfw, 06SRE, 06DC-Ops: Arelion IC-374549 100G Transport outage (cr1-codfw -> cr1-eqiad) July 2025 - https://phabricator.wikimedia.org/T399097#11025312 (10cmooney) 05Open→03Resolved Circuit is still stable so I have un-drained and told Arelion they can close the ticket. I've asked for a full R... [17:54:15] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on durum7004.magru.wmnet with reason: host reimage [17:56:43] (03PS1) 10Cathal Mooney: Add ASN mapping and import policy for dse-k8s codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1171621 (https://phabricator.wikimedia.org/T400037) [17:57:45] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2141.codfw.wmnet with reason: Maintenance [17:58:10] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1125 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1453, active_shards: 3834, relocating_shards: 0, initializing_shards: 64, unassigned_shards: 611, delayed_unassigned_shards: 0, number_of_pend [17:58:10] s: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.02994011976048 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:58:10] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1099 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1453, active_shards: 3834, relocating_shards: 0, initializing_shards: 64, unassigned_shards: 611, delayed_unassigned_shards: 0, number_of_pend [17:58:10] s: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.02994011976048 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:58:10] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1119 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1453, active_shards: 3834, relocating_shards: 0, initializing_shards: 64, unassigned_shards: 611, delayed_unassigned_shards: 0, number_of_pend [17:58:11] s: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.02994011976048 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:58:11] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1109 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1453, active_shards: 3834, relocating_shards: 0, initializing_shards: 64, unassigned_shards: 611, delayed_unassigned_shards: 0, number_of_pend [17:58:12] s: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.02994011976048 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:58:12] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1114 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1453, active_shards: 3835, relocating_shards: 0, initializing_shards: 64, unassigned_shards: 610, delayed_unassigned_shards: 0, number_of_pend [17:58:12] s: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.05211798624973 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:58:13] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1092 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1453, active_shards: 3835, relocating_shards: 0, initializing_shards: 65, unassigned_shards: 609, delayed_unassigned_shards: 0, number_of_pend [17:58:53] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1118 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1453, active_shards: 3853, relocating_shards: 0, initializing_shards: 65, unassigned_shards: 591, delayed_unassigned_shards: 0, number_of_pend [17:58:53] s: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.4513195830561 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:58:54] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1074 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1453, active_shards: 3853, relocating_shards: 0, initializing_shards: 65, unassigned_shards: 591, delayed_unassigned_shards: 0, number_of_pend [17:58:54] s: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.4513195830561 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:58:54] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1093 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1453, active_shards: 3853, relocating_shards: 0, initializing_shards: 65, unassigned_shards: 591, delayed_unassigned_shards: 0, number_of_pend [17:58:54] s: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.4513195830561 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:59:02] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1068 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1453, active_shards: 3856, relocating_shards: 0, initializing_shards: 65, unassigned_shards: 588, delayed_unassigned_shards: 0, number_of_pend [17:59:02] s: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.51785318252384 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:59:19] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7004.magru.wmnet with reason: host reimage [17:59:36] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2145.codfw.wmnet with reason: Maintenance [17:59:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T399728)', diff saved to https://phabricator.wikimedia.org/P79635 and previous config saved to /var/cache/conftool/dbconfig/20250722-175943-fceratto.json [17:59:48] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [17:59:53] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2053 - https://phabricator.wikimedia.org/T400195 (10RobH) 03NEW [18:00:05] dduvall and dancy: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250722T1800). [18:00:15] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2053 - https://phabricator.wikimedia.org/T400195#11025411 (10RobH) [18:00:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:01:02] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2053 - https://phabricator.wikimedia.org/T400195#11025415 (10RobH) a:03Marostegui @Marostegui, Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) an... [18:02:12] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2055 - https://phabricator.wikimedia.org/T400195#11025431 (10RobH) [18:02:29] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2200.codfw.wmnet with reason: Maintenance [18:03:16] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198 (10RobH) 03NEW [18:03:34] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11025460 (10RobH) a:03Marostegui Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add the new... [18:03:36] (03PS1) 10Dzahn: use /sbin/tini as entrypoint [container/codesearch] - 10https://gerrit.wikimedia.org/r/1171630 (https://phabricator.wikimedia.org/T268199) [18:03:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T399728)', diff saved to https://phabricator.wikimedia.org/P79636 and previous config saved to /var/cache/conftool/dbconfig/20250722-180347-fceratto.json [18:03:52] (03CR) 10CI reject: [V:04-1] use /sbin/tini as entrypoint [container/codesearch] - 10https://gerrit.wikimedia.org/r/1171630 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [18:04:03] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11025472 (10RobH) [18:04:59] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11025473 (10RobH) [18:05:21] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11025475 (10RobH) @Marostegui please note the racking details didn't list 9 hostnames, but the order is for 9 hosts. I've appended 2 additional hostnames to the list. [18:06:09] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11025476 (10RobH) [18:06:31] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06Infrastructure-Foundations: decommission puppetserver2003 - https://phabricator.wikimedia.org/T398607#11025477 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [18:08:33] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171644 (https://phabricator.wikimedia.org/T396372) [18:08:35] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.45.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171644 (https://phabricator.wikimedia.org/T396372) (owner: 10TrainBranchBot) [18:09:26] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171644 (https://phabricator.wikimedia.org/T396372) (owner: 10TrainBranchBot) [18:17:30] !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.11 refs T396372 [18:17:35] T396372: 1.45.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T396372 [18:18:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P79637 and previous config saved to /var/cache/conftool/dbconfig/20250722-181854-fceratto.json [18:23:43] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission thanos-be200[1-4].codfw.wmnet - https://phabricator.wikimedia.org/T398849#11025533 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [18:25:59] 06SRE, 06Commons, 10MediaWiki-Uploading, 06Traffic: HTTP 503 error when uploading images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#11025546 (10BCornwall) 05Open→03Stalled I see. Thank you for the response. I'll set this as "stalled". Please do report back if this is continuing! [18:30:52] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission backup2001 and its 2 disk arrays - https://phabricator.wikimedia.org/T398188#11025569 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [18:30:58] (03CR) 10Andrew Bogott: [C:03+2] profile::cloudceph::osd: explicitly install ceph-volume [puppet] - 10https://gerrit.wikimedia.org/r/1171596 (owner: 10Andrew Bogott) [18:34:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P79638 and previous config saved to /var/cache/conftool/dbconfig/20250722-183402-fceratto.json [18:44:33] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7004.magru.wmnet with OS bookworm [18:45:32] (03CR) 10Ottomata: [C:03+1] eventbus: register with team-data-engineering. [alerts] - 10https://gerrit.wikimedia.org/r/1168119 (https://phabricator.wikimedia.org/T398437) (owner: 10Gmodena) [18:45:37] (03CR) 10Ottomata: [C:03+1] eventgate: alert on traffic deviation. [alerts] - 10https://gerrit.wikimedia.org/r/1167620 (https://phabricator.wikimedia.org/T398437) (owner: 10Gmodena) [18:46:38] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm [18:49:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T399728)', diff saved to https://phabricator.wikimedia.org/P79639 and previous config saved to /var/cache/conftool/dbconfig/20250722-184909-fceratto.json [18:49:15] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [18:49:22] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7004.wikimedia.org with OS bookworm [18:49:25] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2146.codfw.wmnet with reason: Maintenance [18:49:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T399728)', diff saved to https://phabricator.wikimedia.org/P79640 and previous config saved to /var/cache/conftool/dbconfig/20250722-184932-fceratto.json [18:53:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T399728)', diff saved to https://phabricator.wikimedia.org/P79641 and previous config saved to /var/cache/conftool/dbconfig/20250722-185323-fceratto.json [18:53:53] 06SRE, 06Infrastructure-Foundations, 10netops: Inaccurate stats showing for some gnmic sourced metrics in codfw - https://phabricator.wikimedia.org/T400205 (10cmooney) 03NEW p:05Triage→03Medium [18:54:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:57:55] 06SRE, 06Infrastructure-Foundations, 10netops: Inaccurate stats showing for some gnmic sourced metrics in codfw - https://phabricator.wikimedia.org/T400205#11025679 (10cmooney) Hmm so looking a bit closer the issue seems to be counters on cr2-codfw itself ` cmooney@re0.cr2-codfw> show interfaces xe-0/1/1:1 |... [18:58:11] 06SRE, 06Infrastructure-Foundations, 10netops: Inaccurate stats showing for cr2-codfw stats in codfw - https://phabricator.wikimedia.org/T400205#11025680 (10cmooney) [18:58:42] 06SRE, 06Infrastructure-Foundations, 10netops: Inaccurate stats showing for cr2-codfw stats in codfw - https://phabricator.wikimedia.org/T400205#11025681 (10cmooney) [19:00:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:01:16] (03PS13) 10Fabfur: haproxy: script to perform configuration validation [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) [19:01:20] (03CR) 10BCornwall: [C:03+2] ncredir: Funnel pywikipedia.org to toolforge [puppet] - 10https://gerrit.wikimedia.org/r/1171250 (https://phabricator.wikimedia.org/T388809) (owner: 10BCornwall) [19:01:36] (03CR) 10Fabfur: haproxy: script to perform configuration validation (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur) [19:01:37] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2208.codfw.wmnet with reason: Maintenance [19:01:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T399249)', diff saved to https://phabricator.wikimedia.org/P79642 and previous config saved to /var/cache/conftool/dbconfig/20250722-190144-marostegui.json [19:01:45] 06SRE, 06Infrastructure-Foundations, 10netops: Inaccurate stats showing for cr2-codfw stats in codfw - https://phabricator.wikimedia.org/T400205#11025685 (10cmooney) >>! In T400205#11025679, @cmooney wrote: > Perhaps some odd bug to do with the new MPC10E card? This possibly? https://supportportal.juniper.... [19:01:49] (03CR) 10Fabfur: haproxy: script to perform configuration validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur) [19:01:49] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [19:05:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:08:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P79643 and previous config saved to /var/cache/conftool/dbconfig/20250722-190830-fceratto.json [19:08:39] 06SRE, 06Infrastructure-Foundations, 10netops: Inaccurate stats repoted by cr2-codfw - https://phabricator.wikimedia.org/T400205#11025713 (10cmooney) [19:09:26] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:09:34] 06SRE, 06Infrastructure-Foundations, 10netops: Inaccurate stats reported by cr2-codfw - https://phabricator.wikimedia.org/T400205#11025714 (10cmooney) [19:10:51] andrewbogott: just from the timing and hostnames, that WidespreadPuppetFailure looks like https://gerrit.wikimedia.org/r/1171596, is that known? [19:10:53] (e.g. https://puppetboard.wikimedia.org/report/cloudcephosd1020.eqiad.wmnet/6ab923f57398a544cb6c6e2fdc7c20af4053a961) [19:11:43] 06SRE, 10Pywikibot, 13Patch-For-Review: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809#11025715 (10BCornwall) Thanks for your patience. Hopefully we're done-done now. :) [19:11:51] (03PS1) 10Ottomata: eventgate - bump to version 1.15.0 for -external facing deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171697 (https://phabricator.wikimedia.org/T376026) [19:13:09] (03PS1) 10Jgreen: Switch payments.wm.o back to pybal/lvs. [dns] - 10https://gerrit.wikimedia.org/r/1171698 (https://phabricator.wikimedia.org/T398321) [19:13:32] rzl: looking. It definitely didn't break puppet on the actual host I intended it the fix for... [19:13:57] oh, I see the problem [19:13:58] (03CR) 10Jgreen: [C:03+2] Switch payments.wm.o back to pybal/lvs. [dns] - 10https://gerrit.wikimedia.org/r/1171698 (https://phabricator.wikimedia.org/T398321) (owner: 10Jgreen) [19:14:11] I guess I need a version check, for some reason they changed the package names between versions. [19:14:11] !log jgreen@dns1004 START - running authdns-update [19:14:17] Thanks for the ping, I will fix! [19:14:36] thanks! [19:15:16] !log jgreen@dns1004 END - running authdns-update [19:16:01] (03PS1) 10Andrew Bogott: Revert "profile::cloudceph::osd: explicitly install ceph-volume" [puppet] - 10https://gerrit.wikimedia.org/r/1171699 [19:18:08] (03CR) 10Andrew Bogott: [C:03+2] Revert "profile::cloudceph::osd: explicitly install ceph-volume" [puppet] - 10https://gerrit.wikimedia.org/r/1171699 (owner: 10Andrew Bogott) [19:21:33] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Supermicro: test if Intel card exhibits the same cold boot behavior - https://phabricator.wikimedia.org/T394847#11025741 (10jhathaway) 05Open→03Resolved Closing for now, as the testing of the intel card is complete for the moment [19:21:56] 06SRE, 06Infrastructure-Foundations, 10netops: Inaccurate stats reported by cr2-codfw - https://phabricator.wikimedia.org/T400205#11025744 (10cmooney) The linked PR on the Juniper site says it was fixed in 23.4R1, we are on 23.4R2, so in theory shouldn't be it. I guess we could try the same fix, probably th... [19:23:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P79644 and previous config saved to /var/cache/conftool/dbconfig/20250722-192338-fceratto.json [19:24:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [19:24:59] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [19:25:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:28:28] (03PS1) 10Andrew Bogott: Explicitly install ceph-volume for new ceph versions but not for Pacific, which doesn't provide that package. [puppet] - 10https://gerrit.wikimedia.org/r/1171700 [19:28:51] (03PS2) 10Andrew Bogott: Install ceph-volume for new ceph versions but not for Pacific, which doesn't provide that package. [puppet] - 10https://gerrit.wikimedia.org/r/1171700 [19:28:52] (03CR) 10CI reject: [V:04-1] Install ceph-volume for new ceph versions but not for Pacific, which doesn't provide that package. [puppet] - 10https://gerrit.wikimedia.org/r/1171700 (owner: 10Andrew Bogott) [19:29:15] (03CR) 10CI reject: [V:04-1] Install ceph-volume for new ceph versions but not for Pacific, which doesn't provide that package. [puppet] - 10https://gerrit.wikimedia.org/r/1171700 (owner: 10Andrew Bogott) [19:30:01] (03PS3) 10Andrew Bogott: Install ceph-volume for new ceph versions but not for Pacific [puppet] - 10https://gerrit.wikimedia.org/r/1171700 [19:30:26] (03CR) 10CI reject: [V:04-1] Install ceph-volume for new ceph versions but not for Pacific [puppet] - 10https://gerrit.wikimedia.org/r/1171700 (owner: 10Andrew Bogott) [19:32:51] (03CR) 10Ottomata: [C:03+2] eventgate - bump to version 1.15.0 for -external facing deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171697 (https://phabricator.wikimedia.org/T376026) (owner: 10Ottomata) [19:34:16] (03PS4) 10Andrew Bogott: Install ceph-volume for new ceph versions but not for Pacific [puppet] - 10https://gerrit.wikimedia.org/r/1171700 [19:34:50] dduvall: dancy is train done? I'd like to deploy an unrelated eventgate thing, but I don't want to step on toes [19:35:05] ottomata: all done [19:35:43] ty [19:36:00] !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [19:36:29] (03PS1) 10Eevans: image-suggestion: reconfigure for data-gateway listener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171702 (https://phabricator.wikimedia.org/T368096) [19:36:31] (03PS1) 10Eevans: image-suggestion: cleanup unused refs to service listener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171703 (https://phabricator.wikimedia.org/T368096) [19:37:27] !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [19:37:53] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1171700 (owner: 10Andrew Bogott) [19:38:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T399728)', diff saved to https://phabricator.wikimedia.org/P79645 and previous config saved to /var/cache/conftool/dbconfig/20250722-193846-fceratto.json [19:38:50] !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [19:38:51] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [19:39:01] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2153.codfw.wmnet with reason: Maintenance [19:39:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T399728)', diff saved to https://phabricator.wikimedia.org/P79646 and previous config saved to /var/cache/conftool/dbconfig/20250722-193908-fceratto.json [19:39:34] !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [19:39:38] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations: Install serial port breakout card on sretest2001 - https://phabricator.wikimedia.org/T400211#11025805 (10jhathaway) a:05jhathaway→03Papaul [19:39:49] !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [19:40:36] !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [19:41:11] !log deploying eventgate-logging-external and eventgate-analytics-external to pick up meta.dt change - T376026 [19:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:15] T376026: Update event-producing tools to overwrite `meta.dt` - https://phabricator.wikimedia.org/T376026 [19:41:29] !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [19:41:51] !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [19:42:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T399728)', diff saved to https://phabricator.wikimedia.org/P79647 and previous config saved to /var/cache/conftool/dbconfig/20250722-194250-fceratto.json [19:43:23] !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [19:44:36] !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [19:46:20] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171705 [19:50:34] 10ops-eqiad, 06SRE, 06DC-Ops: RMA Damaged Pdu E14 - https://phabricator.wikimedia.org/T395971#11025823 (10VRiley-WMF) We have recieved the PDU and they have sent a new one in it's place. It has been unboxed and in the cage. Will need to scheduale Equinix to plug it in. [19:51:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11025824 (10VRiley-WMF) a:03VRiley-WMF [19:51:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11025826 (10VRiley-WMF) [19:57:38] (03CR) 10Subramanya Sastry: "Still needed after https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ParserMigration/+/1170511/7/extension.json merges?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165094 (https://phabricator.wikimedia.org/T363484) (owner: 10C. Scott Ananian) [19:57:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P79648 and previous config saved to /var/cache/conftool/dbconfig/20250722-195757-fceratto.json [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250722T2000). [20:00:04] No Gerrit patches in the queue for this window AFAICS. [20:00:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11025850 (10VRiley-WMF) clouddb1022 Rack A4 U33 CableID: 20220030 Port: 43 clouddb1023 Rack B2 U32 CableID: 5253 Port: 39 clouddb1024 Rack E8 U35 clouddb1025... [20:00:21] 10ops-eqiad, 06SRE, 06DC-Ops: RMA Damaged Pdu E14 - https://phabricator.wikimedia.org/T395971#11025853 (10Jclark-ctr) @VRiley-WMF There is no power to this PDU, and no power receptacle is installed above the rack E 14. At this time, nothing is required from Equinix to complete the RMA. [20:04:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T399249)', diff saved to https://phabricator.wikimedia.org/P79649 and previous config saved to /var/cache/conftool/dbconfig/20250722-200454-marostegui.json [20:05:00] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [20:13:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P79650 and previous config saved to /var/cache/conftool/dbconfig/20250722-201305-fceratto.json [20:13:56] PROBLEM - Disk space on stat1008 is CRITICAL: DISK CRITICAL - free space: / 2688 MB (3% inode=89%): /tmp 2688 MB (3% inode=89%): /var/tmp 2688 MB (3% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [20:15:48] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephmon2004-dev.codfw.wmnet with OS bullseye [20:17:39] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213 (10RobH) 03NEW [20:18:42] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11025913 (10RobH) [20:19:21] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11025914 (10RobH) a:03Marostegui Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add the new se... [20:20:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P79651 and previous config saved to /var/cache/conftool/dbconfig/20250722-202002-marostegui.json [20:20:47] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db126[0-3] - https://phabricator.wikimedia.org/T400214 (10RobH) 03NEW [20:21:24] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db126[0-3] - https://phabricator.wikimedia.org/T400214#11025935 (10RobH) [20:22:01] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db126[0-3] - https://phabricator.wikimedia.org/T400214#11025937 (10RobH) a:03Marostegui Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add the new se... [20:24:59] jclark@cumin1002 reimage (PID 1284441) is awaiting input [20:27:56] (03CR) 10Andrew Bogott: [C:03+2] Install ceph-volume for new ceph versions but not for Pacific [puppet] - 10https://gerrit.wikimedia.org/r/1171700 (owner: 10Andrew Bogott) [20:28:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T399728)', diff saved to https://phabricator.wikimedia.org/P79653 and previous config saved to /var/cache/conftool/dbconfig/20250722-202813-fceratto.json [20:28:18] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [20:28:27] FIRING: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:28:29] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2170.codfw.wmnet with reason: Maintenance [20:28:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T399728)', diff saved to https://phabricator.wikimedia.org/P79654 and previous config saved to /var/cache/conftool/dbconfig/20250722-202835-fceratto.json [20:31:43] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage [20:32:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T399728)', diff saved to https://phabricator.wikimedia.org/P79655 and previous config saved to /var/cache/conftool/dbconfig/20250722-203217-fceratto.json [20:33:27] RESOLVED: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:35:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P79656 and previous config saved to /var/cache/conftool/dbconfig/20250722-203509-marostegui.json [20:38:15] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage [20:47:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P79657 and previous config saved to /var/cache/conftool/dbconfig/20250722-204725-fceratto.json [20:50:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T399249)', diff saved to https://phabricator.wikimedia.org/P79658 and previous config saved to /var/cache/conftool/dbconfig/20250722-205017-marostegui.json [20:50:22] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [20:50:32] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2220.codfw.wmnet with reason: Maintenance [20:50:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2220 (T399249)', diff saved to https://phabricator.wikimedia.org/P79659 and previous config saved to /var/cache/conftool/dbconfig/20250722-205039-marostegui.json [20:56:17] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2004-dev.codfw.wmnet with OS bullseye [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250722T2100) [21:02:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P79660 and previous config saved to /var/cache/conftool/dbconfig/20250722-210232-fceratto.json [21:03:54] (03CR) 10Subramanya Sastry: "We could backport that patch instead." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165094 (https://phabricator.wikimedia.org/T363484) (owner: 10C. Scott Ananian) [21:17:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T399728)', diff saved to https://phabricator.wikimedia.org/P79661 and previous config saved to /var/cache/conftool/dbconfig/20250722-211739-fceratto.json [21:17:45] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [21:17:56] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2173.codfw.wmnet with reason: Maintenance [21:18:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T399728)', diff saved to https://phabricator.wikimedia.org/P79662 and previous config saved to /var/cache/conftool/dbconfig/20250722-211803-fceratto.json [21:21:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T399728)', diff saved to https://phabricator.wikimedia.org/P79664 and previous config saved to /var/cache/conftool/dbconfig/20250722-212144-fceratto.json [21:28:39] (03PS1) 10Cwhite: opensearch: curator instance config to follow $enable_curator [puppet] - 10https://gerrit.wikimedia.org/r/1171713 (https://phabricator.wikimedia.org/T353912) [21:29:10] (03CR) 10CI reject: [V:04-1] opensearch: curator instance config to follow $enable_curator [puppet] - 10https://gerrit.wikimedia.org/r/1171713 (https://phabricator.wikimedia.org/T353912) (owner: 10Cwhite) [21:29:52] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephmon2005-dev.codfw.wmnet with OS bullseye [21:31:19] (03PS2) 10Cwhite: opensearch: curator instance config to follow $enable_curator [puppet] - 10https://gerrit.wikimedia.org/r/1171713 (https://phabricator.wikimedia.org/T353912) [21:31:58] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11026050 (10dancy) 05Resolved→03Open The `docker-registry.wikimedia.org/bullseye:latest@sha256:c9fff9943e3e3d42774f94b6ef07c1c53c417fc9fa964400d769b21a4f3ae28f` image (... [21:36:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P79665 and previous config saved to /var/cache/conftool/dbconfig/20250722-213652-fceratto.json [21:37:36] (03CR) 10Cwhite: "PCC OK:https://puppet-compiler.wmflabs.org/output/1171713/6397/" [puppet] - 10https://gerrit.wikimedia.org/r/1171713 (https://phabricator.wikimedia.org/T353912) (owner: 10Cwhite) [21:40:36] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11026072 (10Scott_French) I just ran into this as well. It appears that we're still configuring bullseye base image builds to include bullseye-backports in the sources list... [21:41:05] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2023.codfw.wmnet with OS bookworm [21:41:36] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cirrussearch2089.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [21:45:56] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cirrussearch2089.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [21:46:16] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cirrussearch2089.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [21:49:19] (03PS1) 10Dzahn: set some ARGs as literals [container/codesearch] - 10https://gerrit.wikimedia.org/r/1171715 [21:49:44] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2005-dev.codfw.wmnet with reason: host reimage [21:50:11] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2024.codfw.wmnet with OS bookworm [21:50:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T399249)', diff saved to https://phabricator.wikimedia.org/P79666 and previous config saved to /var/cache/conftool/dbconfig/20250722-215018-marostegui.json [21:50:24] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [21:51:06] jclark@cumin1002 provision (PID 1369861) is awaiting input [21:52:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P79667 and previous config saved to /var/cache/conftool/dbconfig/20250722-215200-fceratto.json [21:53:34] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2005-dev.codfw.wmnet with reason: host reimage [21:56:33] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cirrussearch2089.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [21:58:37] (03CR) 10Dzahn: [C:03+2] set some ARGs as literals [container/codesearch] - 10https://gerrit.wikimedia.org/r/1171715 (owner: 10Dzahn) [21:58:51] (03Merged) 10jenkins-bot: set some ARGs as literals [container/codesearch] - 10https://gerrit.wikimedia.org/r/1171715 (owner: 10Dzahn) [21:59:09] (03CR) 10Cwhite: [C:03+2] Revert "logstash: remove event.duration when value is hyphen" [puppet] - 10https://gerrit.wikimedia.org/r/1168234 (owner: 10Cwhite) [22:00:18] (03PS1) 10Scott French: docker: remove bullseye-backports from sources.list [puppet] - 10https://gerrit.wikimedia.org/r/1171716 (https://phabricator.wikimedia.org/T383557) [22:02:16] (03CR) 10Scott French: "PCC diff looks like what I would expect (i.e., drops `bullseye-backports` from `/srv/images/base/sources/bullseye.sources.list`)." [puppet] - 10https://gerrit.wikimedia.org/r/1171716 (https://phabricator.wikimedia.org/T383557) (owner: 10Scott French) [22:04:11] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 10Mail: Add DMarcian trial-account address to the dmarc-ruf@wikimedia.org postfix mailing list - https://phabricator.wikimedia.org/T396062#11026117 (10jhathaway) 05Open→03Resolved a:03jhathaway ysuu9wx7@ag.us.dmarcian.com has been ad... [22:05:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P79668 and previous config saved to /var/cache/conftool/dbconfig/20250722-220525-marostegui.json [22:07:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T399728)', diff saved to https://phabricator.wikimedia.org/P79669 and previous config saved to /var/cache/conftool/dbconfig/20250722-220707-fceratto.json [22:07:13] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [22:07:23] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2174.codfw.wmnet with reason: Maintenance [22:07:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T399728)', diff saved to https://phabricator.wikimedia.org/P79670 and previous config saved to /var/cache/conftool/dbconfig/20250722-220730-fceratto.json [22:08:03] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2024.codfw.wmnet with reason: host reimage [22:10:53] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#11026153 (10jhathaway) >>! In T394788#11017745, @nisrael wrote: > Hi SRE team, > > Checking in on this task. Do you have an approximate t... [22:11:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T399728)', diff saved to https://phabricator.wikimedia.org/P79671 and previous config saved to /var/cache/conftool/dbconfig/20250722-221111-fceratto.json [22:12:45] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2024.codfw.wmnet with reason: host reimage [22:17:31] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#11026158 (10Scott_French) I'm no longer seeing any references to bullseye-backports in puppet, so I believe Moritz took care of all of those. Once https://... [22:20:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P79672 and previous config saved to /var/cache/conftool/dbconfig/20250722-222033-marostegui.json [22:26:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P79673 and previous config saved to /var/cache/conftool/dbconfig/20250722-222619-fceratto.json [22:35:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T399249)', diff saved to https://phabricator.wikimedia.org/P79674 and previous config saved to /var/cache/conftool/dbconfig/20250722-223540-marostegui.json [22:35:46] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [22:35:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2221.codfw.wmnet with reason: Maintenance [22:36:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2221 (T399249)', diff saved to https://phabricator.wikimedia.org/P79675 and previous config saved to /var/cache/conftool/dbconfig/20250722-223603-marostegui.json [22:41:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P79676 and previous config saved to /var/cache/conftool/dbconfig/20250722-224126-fceratto.json [22:47:17] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2024.codfw.wmnet with OS bookworm [22:49:26] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2025.codfw.wmnet with OS bookworm [22:54:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:56:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T399728)', diff saved to https://phabricator.wikimedia.org/P79677 and previous config saved to /var/cache/conftool/dbconfig/20250722-225634-fceratto.json [22:56:40] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [22:56:45] andrew@cumin1003 reimage (PID 2990330) is awaiting input [22:56:50] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2176.codfw.wmnet with reason: Maintenance [22:56:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T399728)', diff saved to https://phabricator.wikimedia.org/P79678 and previous config saved to /var/cache/conftool/dbconfig/20250722-225657-fceratto.json [22:58:29] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#11026237 (10wiki_willy) 05Resolved→03Open Re-opening. @Jhancock.wm - per @Marostegui's previous comment: > Never mind this, I was using the wrong DC. > Can you do the RAID10 for u... [23:00:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T399728)', diff saved to https://phabricator.wikimedia.org/P79679 and previous config saved to /var/cache/conftool/dbconfig/20250722-230039-fceratto.json [23:08:20] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2025.codfw.wmnet with reason: host reimage [23:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:14:18] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2025.codfw.wmnet with reason: host reimage [23:15:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P79680 and previous config saved to /var/cache/conftool/dbconfig/20250722-231547-fceratto.json [23:21:38] (03PS1) 10Aleksandar Mastilovic: Blunderbuss helm chart that works with the new Blunderbuss versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1171732 (https://phabricator.wikimedia.org/T392244) [23:24:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [23:24:59] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [23:30:44] (03Abandoned) 10Aleksandar Mastilovic: Rename to Blunderbuss [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091389 (owner: 10Aleksandar Mastilovic) [23:30:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P79681 and previous config saved to /var/cache/conftool/dbconfig/20250722-233055-fceratto.json [23:31:02] (03Abandoned) 10Aleksandar Mastilovic: All the necessary changes and missing files to make helm linter happy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092339 (owner: 10Aleksandar Mastilovic) [23:35:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T399249)', diff saved to https://phabricator.wikimedia.org/P79682 and previous config saved to /var/cache/conftool/dbconfig/20250722-233543-marostegui.json [23:35:48] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [23:38:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1171733 [23:38:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1171733 (owner: 10TrainBranchBot) [23:40:31] PROBLEM - Disk space on an-worker1121 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 167889 MB (4% inode=99%): /var/lib/hadoop/data/h 146789 MB (3% inode=99%): /var/lib/hadoop/data/b 160783 MB (4% inode=99%): /var/lib/hadoop/data/k 170430 MB (4% inode=99%): /var/lib/hadoop/data/m 155988 MB (4% inode=99%): /var/lib/hadoop/data/f 140071 MB (3% inode=99%): /var/lib/hadoop/data/j 158482 MB (4% inode=99%): /var/lib/hadoop/data [23:40:31] 5 MB (4% inode=99%): /var/lib/hadoop/data/l 165237 MB (4% inode=99%): /var/lib/hadoop/data/i 170135 MB (4% inode=99%): /var/lib/hadoop/data/g 175001 MB (4% inode=99%): /var/lib/hadoop/data/c 164807 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1121&var-datasource=eqiad+prometheus/ops [23:46:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T399728)', diff saved to https://phabricator.wikimedia.org/P79683 and previous config saved to /var/cache/conftool/dbconfig/20250722-234602-fceratto.json [23:46:07] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [23:46:17] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2188.codfw.wmnet with reason: Maintenance [23:46:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T399728)', diff saved to https://phabricator.wikimedia.org/P79684 and previous config saved to /var/cache/conftool/dbconfig/20250722-234625-fceratto.json [23:50:03] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2025.codfw.wmnet with OS bookworm [23:50:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T399728)', diff saved to https://phabricator.wikimedia.org/P79685 and previous config saved to /var/cache/conftool/dbconfig/20250722-235009-fceratto.json [23:50:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P79686 and previous config saved to /var/cache/conftool/dbconfig/20250722-235051-marostegui.json [23:53:34] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1171733 (owner: 10TrainBranchBot)