[00:10:28] (03PS1) 10Cwhite: move statsd config to statsd-global, bump statsd chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117638 (https://phabricator.wikimedia.org/T359497) [00:11:03] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 656.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:38:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1117641 [00:38:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1117641 (owner: 10TrainBranchBot) [00:49:59] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1117641 (owner: 10TrainBranchBot) [01:08:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1117644 [01:08:30] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1117644 (owner: 10TrainBranchBot) [01:25:04] (03CR) 10Scott French: "One problem, but otherwise looks good. Thanks, @cwhite@wikimedia.org!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117638 (https://phabricator.wikimedia.org/T359497) (owner: 10Cwhite) [01:26:35] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1117644 (owner: 10TrainBranchBot) [01:41:07] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [01:46:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [01:53:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2085-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:14:05] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.40 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:36:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T384592)', diff saved to https://phabricator.wikimedia.org/P73271 and previous config saved to /var/cache/conftool/dbconfig/20250206-023626-marostegui.json [02:36:30] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:19] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:51:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P73272 and previous config saved to /var/cache/conftool/dbconfig/20250206-025134-marostegui.json [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:55] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:06:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P73273 and previous config saved to /var/cache/conftool/dbconfig/20250206-030641-marostegui.json [03:09:24] (03CR) 10RLazarus: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1117603 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [03:11:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:21:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T384592)', diff saved to https://phabricator.wikimedia.org/P73274 and previous config saved to /var/cache/conftool/dbconfig/20250206-032148-marostegui.json [03:21:52] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [03:22:04] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [05:04:21] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:04:21] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:04:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117113 (https://phabricator.wikimedia.org/T383789) (owner: 10KartikMistry) [05:05:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117594 (https://phabricator.wikimedia.org/T385185) (owner: 10Pppery) [05:16:23] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:35:03] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:35:23] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10527641 (10HormigasAIS) cloudgw1003.eqiad.wmnet: C8 cloudgw1004.eqiad.wmnet: D5{F58366376} [05:38:41] RECOVERY - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is OK: wikitech-static OK - wikitech and wikitech-static in sync (17955 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [05:49:05] (03CR) 10Kevin Bazira: [C:03+1] ml-services: increase cpu and memory for reference-quality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117585 (https://phabricator.wikimedia.org/T384172) (owner: 10AikoChou) [05:49:49] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Idle - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:53:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2085-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:12:21] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:54:21] (03PS1) 10Marostegui: Revert "db1237: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1117786 [06:54:33] (03PS1) 10Marostegui: Revert "db1179: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1117787 [06:54:56] (03CR) 10Marostegui: [C:03+2] Revert "db1237: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1117786 (owner: 10Marostegui) [06:55:05] (03CR) 10Marostegui: [C:03+2] Revert "db1179: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1117787 (owner: 10Marostegui) [06:58:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2208 db1194 T385550', diff saved to https://phabricator.wikimedia.org/P73275 and previous config saved to /var/cache/conftool/dbconfig/20250206-065759-marostegui.json [06:58:03] T385550: Upgrade and rebuild s7 - https://phabricator.wikimedia.org/T385550 [06:58:11] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1194.eqiad.wmnet [06:58:16] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2208.codfw.wmnet [06:59:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2192 with weight 0 T385148', diff saved to https://phabricator.wikimedia.org/P73276 and previous config saved to /var/cache/conftool/dbconfig/20250206-065925-root.json [06:59:29] T385148: Switchover s5 master (db2213 -> db2192) - https://phabricator.wikimedia.org/T385148 [06:59:39] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s5 T385148 [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250206T0700) [07:00:05] marostegui, Amir1, and federico3: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250206T0700). [07:00:07] (03PS2) 10Gerrit maintenance bot: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1115329 (https://phabricator.wikimedia.org/T385148) [07:00:28] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2192 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1115328 (https://phabricator.wikimedia.org/T385148) (owner: 10Gerrit maintenance bot) [07:02:52] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2208.codfw.wmnet [07:04:55] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:05:22] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1194.eqiad.wmnet [07:07:03] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2208.codfw.wmnet with reason: Index rebuild [07:07:24] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1194.eqiad.wmnet with reason: Index rebuild [07:11:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:18:24] !log Starting s5 codfw failover from db2213 to db2192 - T385148 [07:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:27] T385148: Switchover s5 master (db2213 -> db2192) - https://phabricator.wikimedia.org/T385148 [07:18:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s5 codfw as read-only for maintenance - T385148', diff saved to https://phabricator.wikimedia.org/P73277 and previous config saved to /var/cache/conftool/dbconfig/20250206-071836-root.json [07:19:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2192 to s5 primary and set section read-write T385148', diff saved to https://phabricator.wikimedia.org/P73278 and previous config saved to /var/cache/conftool/dbconfig/20250206-071902-root.json [07:19:39] (03CR) 10Marostegui: [C:03+2] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1115329 (https://phabricator.wikimedia.org/T385148) (owner: 10Gerrit maintenance bot) [07:19:42] !log marostegui@dns1006 START - running authdns-update [07:20:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2213 T385148', diff saved to https://phabricator.wikimedia.org/P73279 and previous config saved to /var/cache/conftool/dbconfig/20250206-072020-marostegui.json [07:21:36] !log marostegui@dns1006 END - running authdns-update [07:23:57] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2213.codfw.wmnet [07:28:12] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2213.codfw.wmnet [07:28:45] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2213.codfw.wmnet with reason: Index rebuild [07:35:04] (03CR) 10Muehlenhoff: [C:03+2] logstash: Grant access to cn=ops [puppet] - 10https://gerrit.wikimedia.org/r/1115808 (https://phabricator.wikimedia.org/T376790) (owner: 10Muehlenhoff) [07:40:18] (03CR) 10Muehlenhoff: "Doh! Of course, fixed." [puppet] - 10https://gerrit.wikimedia.org/r/1115863 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:40:21] (03PS2) 10Muehlenhoff: Enable maps-test2003 to maps-test2006 as additional maps bookworm replicas [puppet] - 10https://gerrit.wikimedia.org/r/1115863 (https://phabricator.wikimedia.org/T381565) [07:44:28] (03CR) 10Alexandros Kosiaris: [C:03+1] "Nicely written commit message, thanks! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1117603 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [08:00:04] Amir1, Urbanecm, and awight: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250206T0800). [08:00:04] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:02:05] * kart_ is here.. [08:02:10] Self deploying.. [08:03:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117113 (https://phabricator.wikimedia.org/T383789) (owner: 10KartikMistry) [08:03:39] (03CR) 10Muehlenhoff: [C:03+1] "Both components are no longer used, decommenting or simply removing them entirely are both fine" [puppet] - 10https://gerrit.wikimedia.org/r/1117624 (owner: 10CDanis) [08:04:08] (03Merged) 10jenkins-bot: Make MT limit more strict by 10 Percentage Point in Bhojpuri Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117113 (https://phabricator.wikimedia.org/T383789) (owner: 10KartikMistry) [08:05:00] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1117113|Make MT limit more strict by 10 Percentage Point in Bhojpuri Wikipedia (T383789)]] [08:05:04] T383789: Make MT limit more strict by 10% in Bhojpuri Wikipedia - https://phabricator.wikimedia.org/T383789 [08:06:02] (03PS1) 10Muehlenhoff: Remove use of openstack-db repository component [puppet] - 10https://gerrit.wikimedia.org/r/1117838 [08:08:12] (03CR) 10Muehlenhoff: "Not yet, but soon: Once https://phabricator.wikimedia.org/T381576 is resolved by DC ops, we can test the patch when we add them." [cookbooks] - 10https://gerrit.wikimedia.org/r/1117554 (https://phabricator.wikimedia.org/T381175) (owner: 10Ayounsi) [08:09:36] !log kartik@deploy2002 kartik: Backport for [[gerrit:1117113|Make MT limit more strict by 10 Percentage Point in Bhojpuri Wikipedia (T383789)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:10:57] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1117568 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [08:11:55] !log kartik@deploy2002 kartik: Continuing with sync [08:12:33] (03PS2) 10Pppery: Enable section translation on Kanuri Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117594 (https://phabricator.wikimedia.org/T385185) [08:12:38] !log T385770 Ran mwscript-k8s extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=dawiki --logwiki=metawiki 'Sprucecopse' 'Renamed user 7cf752558fab818efdcacff8255d91ca' [08:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:41] T385770: Unblock stuck global rename of Renamed user 7cf752558fab818efdcacff8255d91ca - https://phabricator.wikimedia.org/T385770 [08:16:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1038.eqiad.wmnet [08:16:59] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10527817 (10ops-monitoring-bot) Draining ganeti1038.eqiad.wmnet of running VMs [08:17:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1038.eqiad.wmnet [08:18:35] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1117113|Make MT limit more strict by 10 Percentage Point in Bhojpuri Wikipedia (T383789)]] (duration: 13m 34s) [08:18:38] T383789: Make MT limit more strict by 10% in Bhojpuri Wikipedia - https://phabricator.wikimedia.org/T383789 [08:21:04] (03CR) 10Elukey: [C:03+1] Enable maps-test2003 to maps-test2006 as additional maps bookworm replicas [puppet] - 10https://gerrit.wikimedia.org/r/1115863 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:21:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73280 and previous config saved to /var/cache/conftool/dbconfig/20250206-082117-root.json [08:22:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117594 (https://phabricator.wikimedia.org/T385185) (owner: 10Pppery) [08:22:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1038.eqiad.wmnet [08:22:54] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10527820 (10ops-monitoring-bot) Draining ganeti1038.eqiad.wmnet of running VMs [08:23:15] (03Merged) 10jenkins-bot: Enable section translation on Kanuri Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117594 (https://phabricator.wikimedia.org/T385185) (owner: 10Pppery) [08:23:44] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1117594|Enable section translation on Kanuri Wikipedia (T385185)]] [08:23:46] T385185: Post-creation work for kncwiki - https://phabricator.wikimedia.org/T385185 [08:24:39] !log rebalance codfw/B following OS updates T382508 [08:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:41] T382508: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508 [08:25:33] (03CR) 10Elukey: [C:03+2] conftool-data: add wikikube workers to kartotherian-k8s-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1117568 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [08:26:42] !log kartik@deploy2002 kartik, pppery: Backport for [[gerrit:1117594|Enable section translation on Kanuri Wikipedia (T385185)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:27:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid/main (k8s) 1.712s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:28:49] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: name=wikikube-worker1002.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [08:29:35] !log kartik@deploy2002 kartik, pppery: Continuing with sync [08:30:28] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: name=wikikube-worker2001.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [08:31:40] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2147.codfw.wmnet with reason: Maintenance [08:31:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2147 (T384592)', diff saved to https://phabricator.wikimedia.org/P73281 and previous config saved to /var/cache/conftool/dbconfig/20250206-083145-marostegui.json [08:31:50] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [08:32:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid/main (k8s) 1.712s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:35:46] (03PS1) 10Muehlenhoff: Switch ganeti1038 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1117839 [08:36:09] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1117594|Enable section translation on Kanuri Wikipedia (T385185)]] (duration: 12m 25s) [08:36:12] T385185: Post-creation work for kncwiki - https://phabricator.wikimedia.org/T385185 [08:36:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73282 and previous config saved to /var/cache/conftool/dbconfig/20250206-083623-root.json [08:36:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1040', diff saved to https://phabricator.wikimedia.org/P73283 and previous config saved to /var/cache/conftool/dbconfig/20250206-083654-marostegui.json [08:37:03] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for es1040.eqiad.wmnet [08:37:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2236', diff saved to https://phabricator.wikimedia.org/P73284 and previous config saved to /var/cache/conftool/dbconfig/20250206-083758-marostegui.json [08:38:10] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2236.codfw.wmnet [08:41:11] * kart_ done with config deployments.. [08:43:06] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for es1040.eqiad.wmnet [08:44:38] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: maintenance [08:47:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73285 and previous config saved to /var/cache/conftool/dbconfig/20250206-084703-root.json [08:47:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: maintenance [08:51:01] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2236.codfw.wmnet [08:51:15] PROBLEM - MariaDB Replica Lag: s4 #page on db2236 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 728.38 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:51:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73286 and previous config saved to /var/cache/conftool/dbconfig/20250206-085129-root.json [08:51:33] ^ I guess silence didn't go through again [08:52:54] sigh [08:53:19] (03CR) 10Gergő Tisza: "[Yes it is](https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/5353024b9fb2d4e9fd379545b0093dab691edb26/wmf-config/CommonSetting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116795 (https://phabricator.wikimedia.org/T322944) (owner: 10Lucas Werkmeister) [08:53:30] FIRING: [2x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:54:15] RECOVERY - MariaDB Replica Lag: s4 #page on db2236 is OK: OK slave_sql_lag Replication lag: 13.38 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:54:22] !incidents [08:54:22] 5662 (RESOLVED) db2236 (paged)/MariaDB Replica Lag: s4 (paged) [08:54:22] 5654 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [08:54:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73287 and previous config saved to /var/cache/conftool/dbconfig/20250206-085454-root.json [08:55:19] Amir1: should I blame marostegui ?? [08:55:29] :D [08:55:37] elukey: Nope, icinga :) [08:55:40] we always blame him :P [08:55:42] I have the downtime in front of me [08:55:43] yes yes suuuureee [08:56:01] we love you anyway even if you told us that you didn't downtime [08:56:09] <3 [08:56:18] elukey I have it in front of me in CUMIN host [08:56:39] The cookbook issued the downtime but I guess it wasn't processed (this is happening a lot lately) [08:57:02] :( [08:58:30] RESOLVED: [2x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:59:09] (03PS1) 10Aklapper: idp-test: add Phabricator test instance client [puppet] - 10https://gerrit.wikimedia.org/r/1117842 (https://phabricator.wikimedia.org/T377061) [08:59:44] (03CR) 10Aklapper: "Note that this patch is merely a manifestation of my search and copy&paste skills, not of any understanding of configuration parameters." [puppet] - 10https://gerrit.wikimedia.org/r/1117842 (https://phabricator.wikimedia.org/T377061) (owner: 10Aklapper) [09:00:04] jnuche and jeena: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250206T0900) [09:00:22] morning, deploying the train in a bit [09:00:35] marostegui: icinga alert or prometheus alert? [09:01:14] (03PS2) 10Arnaudb: rt: removing informations about moscovium [puppet] - 10https://gerrit.wikimedia.org/r/1117529 (https://phabricator.wikimedia.org/T384595) [09:02:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73288 and previous config saved to /var/cache/conftool/dbconfig/20250206-090208-root.json [09:02:12] volans: icinga [09:02:13] also do you have handy the silence ID if it was on AM? [09:03:05] (03CR) 10Btullis: [C:03+1] "Many thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1117187 (https://phabricator.wikimedia.org/T385565) (owner: 10Jcrespo) [09:04:07] (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117843 (https://phabricator.wikimedia.org/T382366) [09:04:09] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117843 (https://phabricator.wikimedia.org/T382366) (owner: 10TrainBranchBot) [09:05:06] (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117843 (https://phabricator.wikimedia.org/T382366) (owner: 10TrainBranchBot) [09:06:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73289 and previous config saved to /var/cache/conftool/dbconfig/20250206-090634-root.json [09:08:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2085-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:10:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73290 and previous config saved to /var/cache/conftool/dbconfig/20250206-090959-root.json [09:10:44] (03CR) 10Arnaudb: [C:03+1] site: remove requesttracker role from host moscovium [puppet] - 10https://gerrit.wikimedia.org/r/1117598 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [09:14:26] (03CR) 10Muehlenhoff: site: remove requesttracker role from host moscovium (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1117598 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [09:14:40] marostegui: the downtime was removed at 08:51:01 (EXTERNAL COMMAND: DEL_DOWNTIME_BY_HOST_NAME;db2236) and the alerts fired after that, the one that paged at 08:51:14 [09:17:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73291 and previous config saved to /var/cache/conftool/dbconfig/20250206-091713-root.json [09:17:26] the cookbook removes the downtime after issuing all the commands, it doesn't wait for the replicat to catch up AFAICT [09:18:05] you could add a one-liner to do so with https://doc.wikimedia.org/spicerack/master/api/spicerack.mysql.html#spicerack.mysql.Instance.wait_for_replication [09:19:19] volans: I manually issued one at .47 [09:19:28] (the cookbook should be migrated to spicerack's mysql module as it currently doesn't use it) [09:19:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10527928 (10phaultfinder) [09:19:44] 09:47:56 <+ logmsgbot> !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: maintenance [09:19:49] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=10; selector: name=wikikube-worker2001.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [09:19:54] marostegui: doesn't matter, the delete of a downtime in icinga removes all downtimes [09:19:54] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=10; selector: name=wikikube-worker1002.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [09:20:12] IIRC from teh command file there is no way to remove just the one you issued [09:20:33] like in alertmanager where we have the ID of the silence [09:21:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 17.86% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:21:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73293 and previous config saved to /var/cache/conftool/dbconfig/20250206-092139-root.json [09:21:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1129:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:25:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73294 and previous config saved to /var/cache/conftool/dbconfig/20250206-092504-root.json [09:26:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 17.86% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:26:40] FIRING: [5x] KubernetesRsyslogDown: rsyslog on wikikube-worker1046:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:32:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73295 and previous config saved to /var/cache/conftool/dbconfig/20250206-093218-root.json [09:33:29] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.15 refs T382366 [09:33:33] T382366: 1.44.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T382366 [09:36:40] FIRING: [5x] KubernetesRsyslogDown: rsyslog on wikikube-worker1046:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:37:41] bacula clogging again at the start of the month, should clear soon [09:40:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73296 and previous config saved to /var/cache/conftool/dbconfig/20250206-094009-root.json [09:41:25] (03CR) 10AikoChou: [C:03+2] ml-services: increase cpu and memory for reference-quality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117585 (https://phabricator.wikimedia.org/T384172) (owner: 10AikoChou) [09:41:38] (03CR) 10Arnaudb: [C:03+1] "this could be merged early as monitoring should not be needed once you're done exporting?" [puppet] - 10https://gerrit.wikimedia.org/r/1117580 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [09:42:09] marostegui: sorry I stand corrected, there is a command to delete a downtime by ID bu there is no easy way to get the downtime ID when setting it via the command file. We would have to parse the status.dat file to search for it. If it's important is something that can be explored [09:42:39] (03Merged) 10jenkins-bot: ml-services: increase cpu and memory for reference-quality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117585 (https://phabricator.wikimedia.org/T384172) (owner: 10AikoChou) [09:47:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73297 and previous config saved to /var/cache/conftool/dbconfig/20250206-094724-root.json [09:48:57] (03CR) 10FNegri: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1117631 (https://phabricator.wikimedia.org/T384118) (owner: 10Andrew Bogott) [09:52:17] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [09:53:00] (03PS1) 10Jelto: trafficserver: keep querybuilder path [puppet] - 10https://gerrit.wikimedia.org/r/1117851 (https://phabricator.wikimedia.org/T385728) [09:55:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73298 and previous config saved to /var/cache/conftool/dbconfig/20250206-095515-root.json [09:58:01] (03CR) 10Jelto: [C:03+2] trafficserver: keep querybuilder path [puppet] - 10https://gerrit.wikimedia.org/r/1117851 (https://phabricator.wikimedia.org/T385728) (owner: 10Jelto) [10:06:59] (03CR) 10JMeybohm: Add interative.ask_yesno (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1115767 (owner: 10JMeybohm) [10:12:25] (03CR) 10Clément Goubert: kube-state-metrics: export extra jobs labels (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117574 (https://phabricator.wikimedia.org/T385709) (owner: 10Kamila Součková) [10:15:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1194 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73299 and previous config saved to /var/cache/conftool/dbconfig/20250206-101538-root.json [10:23:29] (03PS1) 10Jgiannelos: changeprop: Add testwiki rule for native PCS pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117856 [10:29:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10528111 (10phaultfinder) [10:29:51] (03PS2) 10Jgiannelos: changeprop: Add testwiki rule for native PCS pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117856 (https://phabricator.wikimedia.org/T385719) [10:30:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1194 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73300 and previous config saved to /var/cache/conftool/dbconfig/20250206-103044-root.json [10:37:50] (03CR) 10Clément Goubert: "I'm in the process of documenting it on wikitech, but I will probably change this CR to be only the include first, so that all no-op resou" [puppet] - 10https://gerrit.wikimedia.org/r/1117234 (https://phabricator.wikimedia.org/T377963) (owner: 10Clément Goubert) [10:45:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1194 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73301 and previous config saved to /var/cache/conftool/dbconfig/20250206-104549-root.json [10:47:51] (03PS18) 10Clément Goubert: mediawiki: Prepare P:mediawiki::maintenance::growthexperiments [puppet] - 10https://gerrit.wikimedia.org/r/1117234 (https://phabricator.wikimedia.org/T377963) [10:47:52] (03PS1) 10Clément Goubert: mediawiki: Migrate one dry-run job to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1117862 (https://phabricator.wikimedia.org/T385782) [10:52:55] (03CR) 10Muehlenhoff: [C:03+2] Make maps-test2002 a bookworm maps replica [puppet] - 10https://gerrit.wikimedia.org/r/1115850 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:53:15] (03PS3) 10Muehlenhoff: Enable maps-test2003 to maps-test2006 as additional maps bookworm replicas [puppet] - 10https://gerrit.wikimedia.org/r/1115863 (https://phabricator.wikimedia.org/T381565) [10:53:31] (03CR) 10Hnowlan: [C:03+1] mobileapps: Fix typo in event stream name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117591 (https://phabricator.wikimedia.org/T385718) (owner: 10Jgiannelos) [10:55:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2208 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73303 and previous config saved to /var/cache/conftool/dbconfig/20250206-105536-root.json [10:57:52] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Fix typo in event stream name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117591 (https://phabricator.wikimedia.org/T385718) (owner: 10Jgiannelos) [10:58:42] (03PS1) 10Hnowlan: changeprop: add support for multiple wikis in pcs prerender [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117865 (https://phabricator.wikimedia.org/T385719) [10:59:12] (03Merged) 10jenkins-bot: mobileapps: Fix typo in event stream name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117591 (https://phabricator.wikimedia.org/T385718) (owner: 10Jgiannelos) [11:00:17] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250206T1100) [11:00:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1194 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73306 and previous config saved to /var/cache/conftool/dbconfig/20250206-110054-root.json [11:04:51] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1117234 (https://phabricator.wikimedia.org/T377963) (owner: 10Clément Goubert) [11:04:54] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1117862 (https://phabricator.wikimedia.org/T385782) (owner: 10Clément Goubert) [11:04:55] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:06:13] (03CR) 10Hnowlan: "lgtm, one nit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117856 (https://phabricator.wikimedia.org/T385719) (owner: 10Jgiannelos) [11:07:24] (03Abandoned) 10Hnowlan: changeprop: add support for multiple wikis in pcs prerender [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117865 (https://phabricator.wikimedia.org/T385719) (owner: 10Hnowlan) [11:10:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2208 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73307 and previous config saved to /var/cache/conftool/dbconfig/20250206-111041-root.json [11:11:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:16:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1194 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73308 and previous config saved to /var/cache/conftool/dbconfig/20250206-111559-root.json [11:17:54] (03PS1) 10Jgiannelos: changeprop: Add testwiki rule for native PCS pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117871 [11:19:52] (03PS1) 10AikoChou: admin_ng: bump limitranges for ml-serve's revision-models namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117873 (https://phabricator.wikimedia.org/T384172) [11:20:44] (03PS3) 10Jgiannelos: changeprop: Add testwiki rule for native PCS pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117856 (https://phabricator.wikimedia.org/T385719) [11:20:48] (03Abandoned) 10Jgiannelos: changeprop: Add testwiki rule for native PCS pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117871 (owner: 10Jgiannelos) [11:21:18] (03CR) 10Jgiannelos: changeprop: Add testwiki rule for native PCS pregeneration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117856 (https://phabricator.wikimedia.org/T385719) (owner: 10Jgiannelos) [11:25:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2208 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73309 and previous config saved to /var/cache/conftool/dbconfig/20250206-112546-root.json [11:27:11] (03CR) 10Klausman: "A quick live check shows this would change:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117873 (https://phabricator.wikimedia.org/T384172) (owner: 10AikoChou) [11:27:59] (03PS1) 10Brouberol: airflow: restore log upload to s3 by disabling botocore feature not implemented in radosgw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117874 (https://phabricator.wikimedia.org/T385785) [11:29:56] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10528284 (10phaultfinder) [11:31:07] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117875 [11:32:41] 10ops-eqiad, 06SRE, 06DC-Ops: analytics1073 is unreachable since eight days - https://phabricator.wikimedia.org/T385786 (10MoritzMuehlenhoff) 03NEW [11:32:41] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1016.eqiad.wmnet,service=s5 [11:32:52] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1016.eqiad.wmnet,service=s8 [11:34:00] !log fnegri@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Rebooting clouddb1016 T384946 [11:35:05] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:38:10] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117875 (owner: 10PipelineBot) [11:39:18] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117875 (owner: 10PipelineBot) [11:40:27] !log installing iperf3 security updates [11:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2208 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73310 and previous config saved to /var/cache/conftool/dbconfig/20250206-114051-root.json [11:41:34] (03CR) 10Klausman: [V:03+2 C:03+2] admin_ng: bump limitranges for ml-serve's revision-models namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117873 (https://phabricator.wikimedia.org/T384172) (owner: 10AikoChou) [11:43:21] (03CR) 10Hnowlan: [C:03+1] mediawiki: Prepare P:mediawiki::maintenance::growthexperiments [puppet] - 10https://gerrit.wikimedia.org/r/1117234 (https://phabricator.wikimedia.org/T377963) (owner: 10Clément Goubert) [11:43:48] (03CR) 10Clément Goubert: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1117234 (https://phabricator.wikimedia.org/T377963) (owner: 10Clément Goubert) [11:45:07] (03CR) 10Hnowlan: [C:03+1] changeprop: Add testwiki rule for native PCS pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117856 (https://phabricator.wikimedia.org/T385719) (owner: 10Jgiannelos) [11:45:39] (03Merged) 10jenkins-bot: admin_ng: bump limitranges for ml-serve's revision-models namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117873 (https://phabricator.wikimedia.org/T384172) (owner: 10AikoChou) [11:46:05] (03CR) 10Jgiannelos: [C:03+2] changeprop: Add testwiki rule for native PCS pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117856 (https://phabricator.wikimedia.org/T385719) (owner: 10Jgiannelos) [11:46:40] (03CR) 10MVernon: [C:03+2] swift: remove ms-be205[1-6] from profile::swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1117536 (https://phabricator.wikimedia.org/T382056) (owner: 10MVernon) [11:47:25] (03Merged) 10jenkins-bot: changeprop: Add testwiki rule for native PCS pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117856 (https://phabricator.wikimedia.org/T385719) (owner: 10Jgiannelos) [11:48:31] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1016.eqiad.wmnet [11:49:10] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [11:49:39] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [11:49:56] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [11:50:43] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [11:50:54] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [11:51:29] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [11:51:41] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [11:51:47] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1016.eqiad.wmnet [11:51:49] PROBLEM - Host clouddb1016 is DOWN: PING CRITICAL - Packet loss = 100% [11:51:49] RECOVERY - Host clouddb1016 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [11:52:00] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [11:52:09] PROBLEM - MariaDB read only wikireplica-s8 on clouddb1016 is CRITICAL: Could not connect to localhost:3318 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:52:15] PROBLEM - MariaDB read only wikireplica-s5 on clouddb1016 is CRITICAL: Could not connect to localhost:3315 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:52:21] PROBLEM - MariaDB read only s5 on clouddb1016 is CRITICAL: Could not connect to localhost:3315 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:52:21] PROBLEM - MariaDB read only s8 on clouddb1016 is CRITICAL: Could not connect to localhost:3318 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:52:39] PROBLEM - mysqld processes on clouddb1016 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:52:49] PROBLEM - MariaDB Replica SQL: s8 on clouddb1016 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:52:49] PROBLEM - MariaDB Replica IO: s8 on clouddb1016 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:52:49] PROBLEM - MariaDB Replica Lag: s5 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:52:49] PROBLEM - MariaDB Replica SQL: s5 on clouddb1016 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:52:49] PROBLEM - MariaDB Replica IO: s5 on clouddb1016 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:52:50] PROBLEM - MariaDB Replica Lag: s8 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:53:12] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [11:53:26] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [11:53:58] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [11:54:39] RECOVERY - mysqld processes on clouddb1016 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:55:11] RECOVERY - MariaDB read only wikireplica-s8 on clouddb1016 is OK: Version 10.6.20-MariaDB, Uptime 32s, read_only: True, event_scheduler: False, 770.12 QPS, connection latency: 0.029189s, query latency: 0.000546s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:55:17] RECOVERY - MariaDB read only wikireplica-s5 on clouddb1016 is OK: Version 10.6.20-MariaDB, Uptime 41s, read_only: True, event_scheduler: False, 1472.50 QPS, connection latency: 0.022292s, query latency: 0.000522s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:55:21] RECOVERY - MariaDB read only s8 on clouddb1016 is OK: Version 10.6.20-MariaDB, Uptime 43s, read_only: True, event_scheduler: False, 1481.47 QPS, connection latency: 0.024401s, query latency: 0.000370s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:55:21] RECOVERY - MariaDB read only s5 on clouddb1016 is OK: Version 10.6.20-MariaDB, Uptime 46s, read_only: True, event_scheduler: False, 800.57 QPS, connection latency: 0.017513s, query latency: 0.000304s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:55:49] RECOVERY - MariaDB Replica SQL: s8 on clouddb1016 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:55:49] RECOVERY - MariaDB Replica IO: s5 on clouddb1016 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:55:49] RECOVERY - MariaDB Replica SQL: s5 on clouddb1016 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:55:49] RECOVERY - MariaDB Replica IO: s8 on clouddb1016 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:55:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2208 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73311 and previous config saved to /var/cache/conftool/dbconfig/20250206-115556-root.json [11:56:49] RECOVERY - MariaDB Replica Lag: s5 on clouddb1016 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:56:49] RECOVERY - MariaDB Replica Lag: s8 on clouddb1016 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:42] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [12:00:20] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Prepare P:mediawiki::maintenance::growthexperiments [puppet] - 10https://gerrit.wikimedia.org/r/1117234 (https://phabricator.wikimedia.org/T377963) (owner: 10Clément Goubert) [12:04:16] (03PS2) 10Clément Goubert: mediawiki: Migrate one dry-run job to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1117862 (https://phabricator.wikimedia.org/T385782) [12:04:20] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1117862 (https://phabricator.wikimedia.org/T385782) (owner: 10Clément Goubert) [12:06:34] !log installing bind9 security updates (client-side libs/tools only) [12:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:40] RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1046:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:11:24] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [12:11:33] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [12:16:08] jouncebot: nowandnext [12:16:08] No deployments scheduled for the next 0 hour(s) and 43 minute(s) [12:16:08] In 0 hour(s) and 43 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250206T1300) [12:16:14] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117884 [12:17:06] (03CR) 10Ladsgroup: [C:03+2] Set categorylinks to write both everywhere except commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117521 (https://phabricator.wikimedia.org/T385164) (owner: 10Ladsgroup) [12:17:47] (03Merged) 10jenkins-bot: Set categorylinks to write both everywhere except commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117521 (https://phabricator.wikimedia.org/T385164) (owner: 10Ladsgroup) [12:18:14] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1117521|Set categorylinks to write both everywhere except commonswiki (T385164)]] [12:18:17] T385164: Set categorylinks to write both - https://phabricator.wikimedia.org/T385164 [12:21:06] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1117521|Set categorylinks to write both everywhere except commonswiki (T385164)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:21:40] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [12:22:39] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [12:23:25] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [12:27:29] (03PS3) 10Kamila Součková: kube-state-metrics: export extra jobs labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117574 (https://phabricator.wikimedia.org/T385709) [12:27:37] (03CR) 10Kamila Součková: kube-state-metrics: export extra jobs labels (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117574 (https://phabricator.wikimedia.org/T385709) (owner: 10Kamila Součková) [12:27:56] !log installing openjpeg2 security updates [12:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:05] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1117521|Set categorylinks to write both everywhere except commonswiki (T385164)]] (duration: 11m 50s) [12:30:08] T385164: Set categorylinks to write both - https://phabricator.wikimedia.org/T385164 [12:30:26] 06SRE, 06Traffic, 07Wikimedia-production-error: 503 error when edit large size pages - https://phabricator.wikimedia.org/T385395#10528517 (10Ahonc) Are there any news about this case? I still can edit such pages only using API on remote server. It is difficult and uncomfortable. [12:33:57] (03CR) 10Clément Goubert: [C:03+1] kube-state-metrics: export extra jobs labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117574 (https://phabricator.wikimedia.org/T385709) (owner: 10Kamila Součková) [12:34:36] (03CR) 10Kamila Součková: [C:03+2] kube-state-metrics: export extra jobs labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117574 (https://phabricator.wikimedia.org/T385709) (owner: 10Kamila Součková) [12:38:01] (03PS1) 10Muehlenhoff: Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1117896 [12:38:40] (03Merged) 10jenkins-bot: kube-state-metrics: export extra jobs labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117574 (https://phabricator.wikimedia.org/T385709) (owner: 10Kamila Součková) [12:38:59] (03PS2) 10Clément Goubert: mediawiki: Pass down CronJob description to Job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117895 (https://phabricator.wikimedia.org/T385709) [12:39:35] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [12:40:03] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [12:40:23] !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [12:40:52] !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [12:41:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1038.eqiad.wmnet [12:43:44] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti1038.eqiad.wmnet with reason: remove from cluster for reimage [12:43:52] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10528562 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ab4b79bc-3dbe-4d67-a421-882ba2ecce42) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [12:44:11] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti1038 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1117839 (owner: 10Muehlenhoff) [12:44:59] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [12:45:23] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:45:27] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [12:45:49] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:45:52] (03CR) 10Hnowlan: [C:03+1] Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1117896 (owner: 10Muehlenhoff) [12:46:00] (03CR) 10Elukey: sysctl: Introduce base::sysctl::inotify helper (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [12:48:52] (03CR) 10CI reject: [V:04-1] Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1117896 (owner: 10Muehlenhoff) [12:55:58] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=s8 [12:56:01] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=s5 [12:57:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2159 db1191 T385550', diff saved to https://phabricator.wikimedia.org/P73312 and previous config saved to /var/cache/conftool/dbconfig/20250206-125713-marostegui.json [12:57:17] T385550: Upgrade and rebuild s7 - https://phabricator.wikimedia.org/T385550 [12:57:24] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2159.codfw.wmnet [12:57:29] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1191.eqiad.wmnet [12:57:59] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [12:58:20] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:58:22] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288#10528624 (10cmooney) 05Open→03Resolved This link has had a reasonable amount of traffic since the move and still error free so I a... [12:58:36] (03CR) 10Btullis: [C:03+1] "Thanks for this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117874 (https://phabricator.wikimedia.org/T385785) (owner: 10Brouberol) [12:58:51] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Homer trying to delete BGP peerings for VMs on new Eqiad ganeti nodes - https://phabricator.wikimedia.org/T381175#10528627 (10cmooney) p:05High→03Low [12:59:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1038.eqiad.wmnet with OS bookworm [12:59:40] (03CR) 10Kamila Součková: [C:03+1] mediawiki: Pass down CronJob description to Job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117895 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [12:59:45] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10528628 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1038.eqiad.wmnet with OS bookworm [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250206T1300) [13:00:52] (03CR) 10Btullis: envoy: add the analytics-web service to the mesh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1116760 (https://phabricator.wikimedia.org/T384329) (owner: 10Brouberol) [13:04:12] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1191.eqiad.wmnet [13:04:30] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2159.codfw.wmnet [13:04:43] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1191.eqiad.wmnet with reason: Index rebuild [13:04:52] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2159.codfw.wmnet with reason: Index rebuild [13:06:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:12:17] (03CR) 10Lucas Werkmeister: "I see, thanks – then I think the reason I couldn’t test it successfully on WikimediaDebug is that the preflight request still went to a re" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1116795 (https://phabricator.wikimedia.org/T322944) (owner: 10Lucas Werkmeister) [13:12:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:13:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1235 db2188 T385561', diff saved to https://phabricator.wikimedia.org/P73313 and previous config saved to /var/cache/conftool/dbconfig/20250206-131300-marostegui.json [13:13:03] T385561: Upgrade and rebuild s1 - https://phabricator.wikimedia.org/T385561 [13:13:12] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1235.eqiad.wmnet [13:13:19] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2188.codfw.wmnet [13:17:14] (03PS8) 10Andrew Bogott: sysctl: Introduce base::sysctl::inotify helper [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [13:17:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:17:41] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Pass down CronJob description to Job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117895 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [13:17:46] (03CR) 10Andrew Bogott: sysctl: Introduce base::sysctl::inotify helper (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [13:17:52] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [13:18:21] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2188.codfw.wmnet [13:18:43] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2188.codfw.wmnet with reason: Index rebuild [13:18:46] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1038.eqiad.wmnet with reason: host reimage [13:18:46] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1235.eqiad.wmnet [13:19:11] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1235.eqiad.wmnet with reason: Index rebuild [13:19:44] (03Merged) 10jenkins-bot: mediawiki: Pass down CronJob description to Job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117895 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [13:21:17] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: sync [13:21:20] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: sync [13:22:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1038.eqiad.wmnet with reason: host reimage [13:24:18] (03PS9) 10Andrew Bogott: sysctl: Introduce base::sysctl::inotify helper [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [13:24:30] (03CR) 10Lucas Werkmeister (WMDE): "recheck (old diffConfig build expired)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087584 (https://phabricator.wikimedia.org/T356294) (owner: 10Tchanders) [13:24:41] (03CR) 10CI reject: [V:04-1] sysctl: Introduce base::sysctl::inotify helper [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [13:26:26] (03CR) 10Kamila Součková: [C:03+1] mediawiki: Migrate one dry-run job to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1117862 (https://phabricator.wikimedia.org/T385782) (owner: 10Clément Goubert) [13:29:21] (03PS10) 10Andrew Bogott: sysctl: Introduce base::sysctl::inotify helper [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [13:31:48] (03Abandoned) 10Sohom Datta: Fix regression with re-enabling button after error [extensions/PageTriage] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1116824 (https://phabricator.wikimedia.org/T385355) (owner: 10Sohom Datta) [13:32:20] !log cgoubert@deploy2002 Started scap sync-world: no-op deploy to clean up diff [13:34:24] !log cgoubert@deploy2002 Finished scap sync-world: no-op deploy to clean up diff (duration: 02m 59s) [13:37:09] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:39:33] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:42:05]