[00:10:54] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 634.55 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:13:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T371742)', diff saved to https://phabricator.wikimedia.org/P73196 and previous config saved to /var/cache/conftool/dbconfig/20250205-001309-ladsgroup.json [00:13:13] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [00:18:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1257:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1257 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:23:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1257:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1257 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:30:09] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1116889 (owner: 10TrainBranchBot) [00:32:14] (03PS1) 10Scott French: mw-api-int: serve 5% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117263 (https://phabricator.wikimedia.org/T383845) [00:32:15] (03PS1) 10Scott French: mw-(api-ext|web): scale next to 25% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117271 (https://phabricator.wikimedia.org/T383845) [00:32:17] (03PS1) 10Scott French: Enroll 50% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117276 (https://phabricator.wikimedia.org/T383845) [00:38:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1117289 [00:38:26] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1117289 (owner: 10TrainBranchBot) [00:49:50] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1117289 (owner: 10TrainBranchBot) [01:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1116890 (owner: 10TrainBranchBot) [01:08:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1117295 [01:08:23] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1117295 (owner: 10TrainBranchBot) [01:28:29] !log zabe@mwmaint2002:/tmp/uploads$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user=Dyolf77 /tmp/uploads # T385642 [01:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:32] T385642: Server side upload for Dyolf77 - https://phabricator.wikimedia.org/T385642 [01:28:49] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1117295 (owner: 10TrainBranchBot) [01:40:16] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:40:58] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:43:20] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:44:10] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 09 Apr 2025 10:34:17 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:46:28] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/d754a861a3040321cd1fff53ffa354ec3fc7cde0db1a7c0f0e9b908053449561/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:47:20] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:47:50] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53515 bytes in 1.140 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:48:06] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:48:10] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 09 Apr 2025 10:34:17 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:48:30] the releases1003 "disk space" issue isn't actually one. it's permissions to the docker overlay filesystem stuff.. as had to be fixed many times before [01:49:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T384592)', diff saved to https://phabricator.wikimedia.org/P73197 and previous config saved to /var/cache/conftool/dbconfig/20250205-014907-marostegui.json [01:49:11] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [02:04:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P73198 and previous config saved to /var/cache/conftool/dbconfig/20250205-020414-marostegui.json [02:06:28] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:06:52] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 48.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:08:06] PROBLEM - SSH on bast4005 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:09:06] RECOVERY - SSH on bast4005 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:19:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P73199 and previous config saved to /var/cache/conftool/dbconfig/20250205-021921-marostegui.json [02:34:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T384592)', diff saved to https://phabricator.wikimedia.org/P73200 and previous config saved to /var/cache/conftool/dbconfig/20250205-023428-marostegui.json [02:34:32] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [02:34:44] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:09:55] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:07:16] PROBLEM - Disk space on ms-be2051 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sde1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2051&var-datasource=codfw+prometheus/ops [04:11:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:16:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:40:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:12:41] (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade function-orchestrator RAM request, given heap issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117243 (https://phabricator.wikimedia.org/T384883) (owner: 10Jforrester) [05:12:59] (03CR) 10Ecarg: [C:03+2] "thank youu" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117243 (https://phabricator.wikimedia.org/T384883) (owner: 10Jforrester) [05:13:53] (03Merged) 10jenkins-bot: wikifunctions: Upgrade function-orchestrator RAM request, given heap issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117243 (https://phabricator.wikimedia.org/T384883) (owner: 10Jforrester) [05:17:04] (03PS2) 10KartikMistry: Update cxserver to 2025-02-03-095815-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116912 (https://phabricator.wikimedia.org/T377966) [05:17:42] Updating cxserver in a few minutes.. [05:18:34] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-02-03-095815-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116912 (https://phabricator.wikimedia.org/T377966) (owner: 10KartikMistry) [05:19:41] (03Merged) 10jenkins-bot: Update cxserver to 2025-02-03-095815-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116912 (https://phabricator.wikimedia.org/T377966) (owner: 10KartikMistry) [05:19:58] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:20:48] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53513 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:31:06] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:31:34] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:41:58] (03PS2) 10KartikMistry: Make MT limit more strict by 10 Percentage Point in Bhojpuri Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117113 (https://phabricator.wikimedia.org/T383789) [05:42:15] (03CR) 10KartikMistry: Make MT limit more strict by 10 Percentage Point in Bhojpuri Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117113 (https://phabricator.wikimedia.org/T383789) (owner: 10KartikMistry) [05:43:42] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:44:13] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:49:23] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:49:57] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:50:28] !log Updated cxserver to 2025-02-03-095815-production (T377966, T385185) [05:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:32] T377966: Make cxserver Logstash logs readable and reliable - https://phabricator.wikimedia.org/T377966 [05:50:33] T385185: Post-creation work for kncwiki - https://phabricator.wikimedia.org/T385185 [05:57:23] (03PS1) 10Kevin Bazira: ml-services: update article-country prod config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117318 (https://phabricator.wikimedia.org/T382295) [06:15:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:23:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:39:05] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1247.eqiad.wmnet with reason: Maintenance [06:39:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1247 (T384592)', diff saved to https://phabricator.wikimedia.org/P73201 and previous config saved to /var/cache/conftool/dbconfig/20250205-063911-marostegui.json [06:39:15] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [06:40:42] (03PS2) 10Anzx: kywiki: create draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117321 (https://phabricator.wikimedia.org/T385593) [06:50:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 05 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117321 (https://phabricator.wikimedia.org/T385593) (owner: 10Anzx) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250205T0700) [07:09:55] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:47:14] PROBLEM - BFD status on cr2-magru is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:47:18] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:48:14] RECOVERY - BFD status on cr2-magru is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:48:18] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:48:28] PROBLEM - MariaDB Replica Lag: s1 on clouddb1013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86343.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:49:58] PROBLEM - MariaDB Replica Lag: s2 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86217.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:49:58] PROBLEM - MariaDB Replica Lag: s7 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 85268.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:50:48] PROBLEM - MariaDB Replica Lag: s1 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 75018.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:55:48] (03CR) 10Elukey: [C:03+1] external_cloud_vendors: Added OpenAI IP lists [puppet] - 10https://gerrit.wikimedia.org/r/1117245 (https://phabricator.wikimedia.org/T385616) (owner: 10Fabfur) [08:00:04] Amir1, Urbanecm, and awight: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250205T0800) [08:00:05] Jhs and anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:22] hiya, i'm here [08:03:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10524037 (10elukey) I double checked via Redfish and `P1_AIOMAOC_AG_i2LAN1OPROM` is set to `PXE` (as expected). [08:09:19] (03CR) 10Fabfur: [C:03+2] external_cloud_vendors: Added OpenAI IP lists [puppet] - 10https://gerrit.wikimedia.org/r/1117245 (https://phabricator.wikimedia.org/T385616) (owner: 10Fabfur) [08:09:43] o/ [08:12:58] (03PS1) 10Filippo Giunchedi: hieradata: fix o11y wmcloud idp-test access [puppet] - 10https://gerrit.wikimedia.org/r/1117488 [08:14:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10524070 (10ayounsi) Sure, as usual for power/console/mgmt. Regarding production ports : On the ssw1 side: `use `et-0/0/7` towards e8 and `et-0/0/15` tow... [08:17:10] (03PS1) 10Aklapper: Phabricator: Disable weekly 2fa mail [puppet] - 10https://gerrit.wikimedia.org/r/1117489 (https://phabricator.wikimedia.org/T304792) [08:23:03] (03CR) 10Elukey: [C:03+1] hieradata: fix o11y wmcloud idp-test access [puppet] - 10https://gerrit.wikimedia.org/r/1117488 (owner: 10Filippo Giunchedi) [08:27:03] (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: fix o11y wmcloud idp-test access [puppet] - 10https://gerrit.wikimedia.org/r/1117488 (owner: 10Filippo Giunchedi) [08:35:34] (03PS1) 10Elukey: knative: backport https://github.com/knative/serving/pull/13402 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1117492 (https://phabricator.wikimedia.org/T369493) [08:41:50] (03CR) 10Jelto: [C:03+2] "I'll merge this and monitor the metrics for query-main and query service gui closely." [puppet] - 10https://gerrit.wikimedia.org/r/1115766 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [08:55:07] (03PS6) 10Fabfur: hiera: enable json logging for benthos [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) [08:55:38] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [08:59:55] FIRING: [4x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:00:05] jnuche and jeena: gettimeofday() says it's time for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250205T0900) [09:00:30] hi there, rolling out the train in a few minutes [09:01:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117204 (https://phabricator.wikimedia.org/T385591) (owner: 10Jon Harald Søby) [09:02:58] (03PS7) 10Fabfur: hiera: enable json logging for benthos [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) [09:03:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117321 (https://phabricator.wikimedia.org/T385593) (owner: 10Anzx) [09:03:32] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [09:03:46] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117494 (https://phabricator.wikimedia.org/T382366) [09:03:47] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117494 (https://phabricator.wikimedia.org/T382366) (owner: 10TrainBranchBot) [09:04:34] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117494 (https://phabricator.wikimedia.org/T382366) (owner: 10TrainBranchBot) [09:09:28] (03PS8) 10Fabfur: hiera: enable json logging for benthos [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) [09:12:01] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [09:13:54] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.15 refs T382366 [09:13:57] T382366: 1.44.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T382366 [09:19:44] (03PS9) 10Fabfur: hiera: enable json logging for benthos [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) [09:21:11] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [09:22:36] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [09:31:32] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[1155-1156].eqiad.wmnet with reason: Rebuild tables [09:31:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1156 for index rebuild', diff saved to https://phabricator.wikimedia.org/P73202 and previous config saved to /var/cache/conftool/dbconfig/20250205-093152-marostegui.json [09:32:04] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1156.eqiad.wmnet [09:32:28] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1014.eqiad.wmnet with reason: Rebuild tables [09:32:55] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-redacteddb1001.eqiad.wmnet with reason: Rebuild tables [09:34:10] (03CR) 10Ilias Sarantopoulos: [C:03+1] knative: backport https://github.com/knative/serving/pull/13402 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1117492 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [09:34:32] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update article-country prod config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117318 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [09:38:24] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1156.eqiad.wmnet [09:39:02] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1156.eqiad.wmnet with reason: Index rebuild [09:41:14] (03PS1) 10Marostegui: installserver: Do not format db1250 [puppet] - 10https://gerrit.wikimedia.org/r/1117497 [09:42:44] PROBLEM - MariaDB Replica Lag: s2 on clouddb1018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 544.98 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:43:37] (03CR) 10Marostegui: [C:03+2] installserver: Do not format db1250 [puppet] - 10https://gerrit.wikimedia.org/r/1117497 (owner: 10Marostegui) [09:46:08] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1018.eqiad.wmnet with reason: Rebuild tables [09:46:37] (03CR) 10Kevin Bazira: [C:03+2] "thanks for the review :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117318 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [09:47:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling after cloning db1251', diff saved to https://phabricator.wikimedia.org/P73203 and previous config saved to /var/cache/conftool/dbconfig/20250205-094711-fceratto.json [09:47:46] (03Merged) 10jenkins-bot: ml-services: update article-country prod config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117318 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [09:52:25] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . [09:54:15] (03PS2) 10Elukey: knative: backport https://github.com/knative/serving/pull/13402 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1117492 (https://phabricator.wikimedia.org/T369493) [09:54:49] (03CR) 10Elukey: [V:03+2 C:03+2] knative: backport https://github.com/knative/serving/pull/13402 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1117492 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [09:55:59] !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [09:56:43] !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on ms-be2075.codfw.wmnet with reason: hardware broken awaiting vendor action [09:56:51] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10524282 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a9517ffa-d053-4e3b-a7d0-6b08948ed456) set by mvernon@cumin2002 for 7 days, 0:00:00 on 1 host(s) and t... [09:58:00] !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ms-be2051.codfw.wmnet with reason: disk failed, due decom soon [09:58:09] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: ms backend hardware refresh for 24/25 - https://phabricator.wikimedia.org/T382056#10524286 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=837a92b8-0555-4a3d-bd8e-9aefd3493691) set by mvernon@cumin2002 for 2 days, 0:00:00 on 1 host(s) and th... [09:59:43] (03PS1) 10Jelto: trafficserver: move /querybuilder before catch-all [puppet] - 10https://gerrit.wikimedia.org/r/1117498 (https://phabricator.wikimedia.org/T350793) [09:59:52] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29349 bytes in 0.322 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [10:02:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: Repooling after cloning db1251', diff saved to https://phabricator.wikimedia.org/P73205 and previous config saved to /var/cache/conftool/dbconfig/20250205-100216-fceratto.json [10:04:53] (03PS1) 10Elukey: admin_ng: update Knative docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117500 [10:06:08] (03PS1) 10Federico Ceratto: db1251.yaml: enable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1117501 (https://phabricator.wikimedia.org/T385141) [10:12:27] (03CR) 10Elukey: [C:03+2] admin_ng: update Knative docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117500 (owner: 10Elukey) [10:13:16] (03CR) 10Elukey: [V:03+2 C:03+2] admin_ng: update Knative docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117500 (owner: 10Elukey) [10:13:30] (03CR) 10Marostegui: [C:03+1] db1251.yaml: enable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1117501 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto) [10:14:19] jnuche: we have a visual regression that is fairly visible (T385542). we have a fix already, OK to deploy it? [10:14:20] T385542: [testwiki-wmf.15] Add link inspector elements are misaligned - https://phabricator.wikimedia.org/T385542 [10:14:40] !log restarting blazegraph on wdqs1012 (BlazegraphFreeAllocatorsDecreasingRapidly) [10:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:18] urbanecm: yes please, go ahead [10:15:37] proceeding, thanks! [10:16:00] (03PS1) 10Urbanecm: fix(AddLink): button should show after link preview [extensions/GrowthExperiments] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1117502 (https://phabricator.wikimedia.org/T385542) [10:16:05] (03CR) 10Urbanecm: [C:03+2] fix(AddLink): button should show after link preview [extensions/GrowthExperiments] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1117502 (https://phabricator.wikimedia.org/T385542) (owner: 10Urbanecm) [10:16:21] (03CR) 10Federico Ceratto: [C:03+2] db1251.yaml: enable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1117501 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto) [10:17:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: Repooling after cloning db1251', diff saved to https://phabricator.wikimedia.org/P73207 and previous config saved to /var/cache/conftool/dbconfig/20250205-101721-fceratto.json [10:17:25] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:18:54] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:20:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1202, db2221 for index rebuild', diff saved to https://phabricator.wikimedia.org/P73208 and previous config saved to /var/cache/conftool/dbconfig/20250205-102012-marostegui.json [10:20:19] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2221.codfw.wmnet [10:20:30] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db1202.eqiad.wmnet [10:20:50] !log fceratto@cumin1002 START - Cookbook sre.hosts.remove-downtime for db1251.eqiad.wmnet [10:20:51] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1251.eqiad.wmnet [10:21:32] (03CR) 10FNegri: [C:03+1] "Adding a +1 after merge, this makes sense to me." [puppet] - 10https://gerrit.wikimedia.org/r/1116868 (https://phabricator.wikimedia.org/T383370) (owner: 10Andrew Bogott) [10:23:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [10:25:51] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2221.codfw.wmnet [10:26:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'db1251 (re)pooling @ 1%: Pooling in new host', diff saved to https://phabricator.wikimedia.org/P73209 and previous config saved to /var/cache/conftool/dbconfig/20250205-102614-fceratto.json [10:26:23] (03PS21) 10Clément Goubert: mediawiki: Add kubernetes periodic job support [puppet] - 10https://gerrit.wikimedia.org/r/1117222 (https://phabricator.wikimedia.org/T385596) [10:26:23] (03PS11) 10Clément Goubert: mediawiki: Migrate one dry-run job to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1117234 (https://phabricator.wikimedia.org/T377963) [10:26:44] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1179.eqiad.wmnet with reason: Maintenance [10:26:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1179 (T385645)', diff saved to https://phabricator.wikimedia.org/P73210 and previous config saved to /var/cache/conftool/dbconfig/20250205-102650-marostegui.json [10:26:54] T385645: Drop event_variant column from echo_event - https://phabricator.wikimedia.org/T385645 [10:27:05] !log pushing Changeprop patch (k8s values) https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1117063 [10:27:06] (03PS1) 10Clément Goubert: mw-cron: Add puppet-defined periodic jobs file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117503 (https://phabricator.wikimedia.org/T385596) [10:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:11] (03CR) 10Effie Mouzeli: [C:03+1] mw-api-int: serve 5% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117263 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [10:27:11] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1202.eqiad.wmnet [10:27:28] (03PS2) 10Jelto: trafficserver: move /querybuilder before catch-all [puppet] - 10https://gerrit.wikimedia.org/r/1117498 (https://phabricator.wikimedia.org/T350793) [10:27:30] (03CR) 10Effie Mouzeli: [C:03+1] mw-(api-ext|web): scale next to 25% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117271 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [10:27:48] (03CR) 10Effie Mouzeli: [C:03+1] Enroll 50% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117276 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [10:27:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T385645)', diff saved to https://phabricator.wikimedia.org/P73211 and previous config saved to /var/cache/conftool/dbconfig/20250205-102758-marostegui.json [10:29:01] !log klausman@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [10:30:22] !log klausman@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [10:30:32] RECOVERY - Disk space on ml-lab1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [10:31:49] (03Merged) 10jenkins-bot: fix(AddLink): button should show after link preview [extensions/GrowthExperiments] (wmf/1.44.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1117502 (https://phabricator.wikimedia.org/T385542) (owner: 10Urbanecm) [10:32:01] (03CR) 10Clément Goubert: [C:03+1] "Actually yeah I completely missed that none of the `Chart.yaml` were bumped, I'm actually surprised it produced a diff in prod for `kartot" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105972 (https://phabricator.wikimedia.org/T359497) (owner: 10Cwhite) [10:32:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling after cloning db1251', diff saved to https://phabricator.wikimedia.org/P73212 and previous config saved to /var/cache/conftool/dbconfig/20250205-103227-fceratto.json [10:33:19] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1117502|fix(AddLink): button should show after link preview (T385542)]] [10:33:21] T385542: [testwiki-wmf.15] Add link inspector elements are misaligned - https://phabricator.wikimedia.org/T385542 [10:33:53] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1117222 (https://phabricator.wikimedia.org/T385596) (owner: 10Clément Goubert) [10:33:57] (03PS1) 10Effie Mouzeli: shellbox: all replicas on PHP 8.1 (score) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117506 (https://phabricator.wikimedia.org/T377038) [10:34:00] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1117234 (https://phabricator.wikimedia.org/T377963) (owner: 10Clément Goubert) [10:35:29] (03PS1) 10Hnowlan: trafficserver: remove restbase from hewiki mobile-html api [puppet] - 10https://gerrit.wikimedia.org/r/1117508 (https://phabricator.wikimedia.org/T372746) [10:36:21] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1117502|fix(AddLink): button should show after link preview (T385542)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:37:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1179', diff saved to https://phabricator.wikimedia.org/P73213 and previous config saved to /var/cache/conftool/dbconfig/20250205-103738-marostegui.json [10:39:00] !log urbanecm@deploy2002 urbanecm: Continuing with sync [10:39:58] RECOVERY - MariaDB Replica Lag: s7 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:43:25] (03PS1) 10Effie Mouzeli: mw-parsoid & mw-jobrunner serve 2% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117511 (https://phabricator.wikimedia.org/T383845) [10:43:46] !log Set x1 to SBR for a bit T385645 [10:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:49] T385645: Drop event_variant column from echo_event - https://phabricator.wikimedia.org/T385645 [10:44:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73214 and previous config saved to /var/cache/conftool/dbconfig/20250205-104423-root.json [10:45:34] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1117502|fix(AddLink): button should show after link preview (T385542)]] (duration: 12m 15s) [10:45:37] T385542: [testwiki-wmf.15] Add link inspector elements are misaligned - https://phabricator.wikimedia.org/T385542 [10:45:43] fix should be deployed [10:45:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'db1251 (re)pooling @ 5%: Pooling host to 5%', diff saved to https://phabricator.wikimedia.org/P73215 and previous config saved to /var/cache/conftool/dbconfig/20250205-104543-fceratto.json [10:45:54] jnuche: fyi, in case you want to do something other train related [10:47:20] (03CR) 10Clément Goubert: [C:03+1] mw-parsoid & mw-jobrunner serve 2% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117511 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [10:47:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling after cloning db1251', diff saved to https://phabricator.wikimedia.org/P73216 and previous config saved to /var/cache/conftool/dbconfig/20250205-104732-fceratto.json [10:47:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1237', diff saved to https://phabricator.wikimedia.org/P73217 and previous config saved to /var/cache/conftool/dbconfig/20250205-104742-marostegui.json [10:47:49] (03CR) 10Hnowlan: [C:03+1] mw-parsoid & mw-jobrunner serve 2% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117511 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [10:48:04] (03CR) 10Clément Goubert: [C:03+1] shellbox: all replicas on PHP 8.1 (score) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117506 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [10:48:24] (03CR) 10Hnowlan: [C:03+1] shellbox: all replicas on PHP 8.1 (score) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117506 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [10:49:05] (03PS2) 10Effie Mouzeli: shellbox-media: 1 replica on 8.1 for each DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116838 (https://phabricator.wikimedia.org/T377038) [10:49:21] (03CR) 10Clément Goubert: [C:03+1] shellbox-media: 1 replica on 8.1 for each DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116838 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [10:51:02] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1237.eqiad.wmnet onto db1179.eqiad.wmnet [10:58:42] urbanecm: anything outstanding train wise? [10:59:05] urbanecm: ach, thx for the headsup [10:59:08] effie: not from my side, but i'm not the conductor [10:59:17] effie: nope, nothing from my side [10:59:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73218 and previous config saved to /var/cache/conftool/dbconfig/20250205-105928-root.json [11:00:04] effie and swfrench-wmf: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki infrastructure (UTC mid-day). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250205T1100). [11:00:42] FIRING: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:01:39] (03CR) 10Vgutierrez: "looks good but /querybuilder currently downgrades requests to http:// for 301s even if `X-Forwarded-Proto` is set to `https`, well-known U" [puppet] - 10https://gerrit.wikimedia.org/r/1117498 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [11:02:12] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:02:57] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:03:07] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2221.codfw.wmnet with reason: Index rebuild [11:03:09] (03CR) 10Vgutierrez: [C:03+1] trafficserver: move /querybuilder before catch-all [puppet] - 10https://gerrit.wikimedia.org/r/1117498 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [11:03:12] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:03:16] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1202.eqiad.wmnet with reason: Index rebuild [11:04:02] (03CR) 10Hnowlan: "mostly lgtm, some style nice-to-haves" [puppet] - 10https://gerrit.wikimedia.org/r/1117222 (https://phabricator.wikimedia.org/T385596) (owner: 10Clément Goubert) [11:05:03] (03CR) 10Jelto: [C:03+2] trafficserver: move /querybuilder before catch-all [puppet] - 10https://gerrit.wikimedia.org/r/1117498 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [11:05:55] (03CR) 10Hnowlan: [C:03+1] "lgtm once the puppet change is in" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117503 (https://phabricator.wikimedia.org/T385596) (owner: 10Clément Goubert) [11:06:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1179', diff saved to https://phabricator.wikimedia.org/P73219 and previous config saved to /var/cache/conftool/dbconfig/20250205-110628-marostegui.json [11:07:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'db1251 (re)pooling @ 7%: Pooling in', diff saved to https://phabricator.wikimedia.org/P73220 and previous config saved to /var/cache/conftool/dbconfig/20250205-110731-fceratto.json [11:07:57] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:11:00] (03PS22) 10Clément Goubert: mediawiki: Add kubernetes periodic job support [puppet] - 10https://gerrit.wikimedia.org/r/1117222 (https://phabricator.wikimedia.org/T385596) [11:11:00] (03PS12) 10Clément Goubert: mediawiki: Migrate one dry-run job to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1117234 (https://phabricator.wikimedia.org/T377963) [11:11:53] !log bounce thanos-query on titan1002 [11:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:15] (03PS23) 10Clément Goubert: mediawiki: Add kubernetes periodic job support [puppet] - 10https://gerrit.wikimedia.org/r/1117222 (https://phabricator.wikimedia.org/T385596) [11:13:15] (03PS13) 10Clément Goubert: mediawiki: Migrate one dry-run job to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1117234 (https://phabricator.wikimedia.org/T377963) [11:14:04] (03CR) 10Clément Goubert: mediawiki: Add kubernetes periodic job support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1117222 (https://phabricator.wikimedia.org/T385596) (owner: 10Clément Goubert) [11:14:42] (03CR) 10Hnowlan: [C:03+1] mediawiki: Add kubernetes periodic job support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1117222 (https://phabricator.wikimedia.org/T385596) (owner: 10Clément Goubert) [11:14:43] (03CR) 10Clément Goubert: mediawiki: Add kubernetes periodic job support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1117222 (https://phabricator.wikimedia.org/T385596) (owner: 10Clément Goubert) [11:14:50] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1117222 (https://phabricator.wikimedia.org/T385596) (owner: 10Clément Goubert) [11:15:42] RESOLVED: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:22:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'db1251 (re)pooling @ 10%: Pooling in', diff saved to https://phabricator.wikimedia.org/P73221 and previous config saved to /var/cache/conftool/dbconfig/20250205-112236-fceratto.json [11:22:41] (03CR) 10Effie Mouzeli: [C:03+2] shellbox: all replicas on PHP 8.1 (score) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117506 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [11:24:12] (03Merged) 10jenkins-bot: shellbox: all replicas on PHP 8.1 (score) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117506 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [11:24:34] (03CR) 10Effie Mouzeli: [C:03+2] mw-parsoid & mw-jobrunner serve 2% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117511 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [11:25:09] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply [11:25:50] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Add kubernetes periodic job support [puppet] - 10https://gerrit.wikimedia.org/r/1117222 (https://phabricator.wikimedia.org/T385596) (owner: 10Clément Goubert) [11:25:54] (03Merged) 10jenkins-bot: mw-parsoid & mw-jobrunner serve 2% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117511 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [11:25:59] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [11:27:16] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [11:27:30] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [11:28:03] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [11:28:26] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [11:31:00] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet,service=s1 [11:31:04] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet,service=31 [11:31:10] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet,service=s3 [11:31:44] PROBLEM - MariaDB Replica Lag: s2 on db1155 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 7085.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:31:54] !log fnegri@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1017.eqiad.wmnet with reason: Rebooting clouddb1017 T384946 [11:32:26] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1155.eqiad.wmnet with reason: Rebuild tables [11:32:44] PROBLEM - MariaDB Replica Lag: s2 on clouddb1014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 7144.82 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:33:18] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-redacteddb1001.eqiad.wmnet with reason: Rebuild tables [11:33:39] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1014.eqiad.wmnet with reason: Rebuild tables [11:33:51] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on clouddb1018.eqiad.wmnet with reason: Rebuild tables [11:34:06] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [11:34:14] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb1014.eqiad.wmnet with reason: Rebuild tables [11:34:27] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [11:36:05] (03PS2) 10Clément Goubert: mw-cron: Add puppet-defined periodic jobs file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117503 (https://phabricator.wikimedia.org/T385596) [11:37:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'db1251 (re)pooling @ 15%: Pooling in', diff saved to https://phabricator.wikimedia.org/P73222 and previous config saved to /var/cache/conftool/dbconfig/20250205-113741-fceratto.json [11:37:56] (03PS14) 10Clément Goubert: mediawiki: Migrate one dry-run job to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1117234 (https://phabricator.wikimedia.org/T377963) [11:37:56] (03PS1) 10Clément Goubert: kubernetes_periodic_job: Fix title in job template [puppet] - 10https://gerrit.wikimedia.org/r/1117516 (https://phabricator.wikimedia.org/T385596) [11:38:06] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1017.eqiad.wmnet [11:39:38] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1117516 (https://phabricator.wikimedia.org/T385596) (owner: 10Clément Goubert) [11:39:45] 06SRE, 06Infrastructure-Foundations, 10netops: Extend sre.network.configure-switch-interfaces cookbook to add sflow and qos config - https://phabricator.wikimedia.org/T379549#10524854 (10cmooney) 05Open→03Resolved [11:39:56] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1117234 (https://phabricator.wikimedia.org/T377963) (owner: 10Clément Goubert) [11:41:29] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1017.eqiad.wmnet [11:41:42] PROBLEM - mysqld processes on clouddb1017 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:41:46] PROBLEM - MariaDB Replica IO: s1 on clouddb1017 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:41:46] PROBLEM - MariaDB Replica SQL: s3 on clouddb1017 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:41:46] PROBLEM - MariaDB Replica SQL: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:41:46] PROBLEM - MariaDB Replica IO: s3 on clouddb1017 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:41:48] PROBLEM - MariaDB read only s3 on clouddb1017 is CRITICAL: Could not connect to localhost:3313 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:41:48] PROBLEM - MariaDB read only wikireplica-s3 on clouddb1017 is CRITICAL: Could not connect to localhost:3313 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:41:48] PROBLEM - MariaDB read only s1 on clouddb1017 is CRITICAL: Could not connect to localhost:3311 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:41:48] PROBLEM - MariaDB read only wikireplica-s1 on clouddb1017 is CRITICAL: Could not connect to localhost:3311 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:41:59] dhinus: that's you right ^? [11:42:25] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [11:42:37] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [11:42:45] marostegui: yep [11:42:53] I thought I silenced it though [11:44:05] (03CR) 10Clément Goubert: [C:03+2] kubernetes_periodic_job: Fix title in job template [puppet] - 10https://gerrit.wikimedia.org/r/1117516 (https://phabricator.wikimedia.org/T385596) (owner: 10Clément Goubert) [11:45:44] PROBLEM - MariaDB Replica Lag: s3 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:46:44] (03PS1) 10Marostegui: x1: Change format to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/1117517 (https://phabricator.wikimedia.org/T385645) [11:46:46] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:47:17] (03CR) 10Marostegui: [C:03+2] x1: Change format to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/1117517 (https://phabricator.wikimedia.org/T385645) (owner: 10Marostegui) [11:48:28] dhinus: Lately my impression is that lots of downtimes get lost [11:49:19] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117518 [11:49:29] marostegui: there is definitely something odd: https://sal.toolforge.org/log/4zvh1ZQBffdvpiTrhsuR [11:49:43] it should be downtimed for 1 hour [11:49:59] "Created silence ID 266a2b12-14a7-4728-ad13-d4309d19dfd6" [11:50:31] 06SRE, 06Infrastructure-Foundations, 10netops: Homer trying to delete BGP peerings for VMs on new Eqiad ganeti nodes - https://phabricator.wikimedia.org/T381175#10524944 (10cmooney) 05Open→03Resolved >>! In T381175#10520327, @ayounsi wrote: > For (1) we can have the `sre.ganeti.addnode` cookbook call... [11:50:56] (03PS1) 10Filippo Giunchedi: statograph: update mw edit rate to use thanos [puppet] - 10https://gerrit.wikimedia.org/r/1117519 (https://phabricator.wikimedia.org/T383963) [11:51:04] dhinus: Yeah, I've had the same for a few days [11:51:07] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117520 [11:52:42] RECOVERY - mysqld processes on clouddb1017 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:52:44] RECOVERY - MariaDB Replica SQL: s1 on clouddb1017 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:52:44] RECOVERY - MariaDB Replica IO: s1 on clouddb1017 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:52:46] RECOVERY - MariaDB Replica SQL: s3 on clouddb1017 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:52:46] RECOVERY - MariaDB Replica IO: s3 on clouddb1017 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:52:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'db1251 (re)pooling @ 20%: Pooling in', diff saved to https://phabricator.wikimedia.org/P73223 and previous config saved to /var/cache/conftool/dbconfig/20250205-115247-fceratto.json [11:52:50] RECOVERY - MariaDB read only s3 on clouddb1017 is OK: Version 10.6.20-MariaDB, Uptime 59s, read_only: True, event_scheduler: False, 380.72 QPS, connection latency: 0.015028s, query latency: 0.000355s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:52:50] RECOVERY - MariaDB read only wikireplica-s1 on clouddb1017 is OK: Version 10.6.20-MariaDB, Uptime 56s, read_only: True, event_scheduler: False, 940.20 QPS, connection latency: 0.023746s, query latency: 0.000396s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:52:50] RECOVERY - MariaDB read only s1 on clouddb1017 is OK: Version 10.6.20-MariaDB, Uptime 56s, read_only: True, event_scheduler: False, 976.20 QPS, connection latency: 0.020599s, query latency: 0.000470s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:52:50] RECOVERY - MariaDB read only wikireplica-s3 on clouddb1017 is OK: Version 10.6.20-MariaDB, Uptime 59s, read_only: True, event_scheduler: False, 380.17 QPS, connection latency: 0.014857s, query latency: 0.000339s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [11:52:55] (03CR) 10Fabfur: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [11:52:57] (03PS1) 10Ladsgroup: Set categorylinks to write both everywhere except commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117521 (https://phabricator.wikimedia.org/T385164) [11:53:44] RECOVERY - MariaDB Replica Lag: s3 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:53:51] (03PS15) 10Clément Goubert: mediawiki: Migrate one dry-run job to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1117234 (https://phabricator.wikimedia.org/T377963) [11:54:06] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1117234 (https://phabricator.wikimedia.org/T377963) (owner: 10Clément Goubert) [11:54:10] dhinus: can I start rebuilding indexes on clouddb1017? [11:54:38] I've just restarted mariadb there, and restarted replication, so I think yes! [11:54:46] thank you! [11:55:46] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:56:16] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1017.eqiad.wmnet with reason: Rebuild tables [11:58:12] (03PS1) 10Clément Goubert: mediawiki::periodic_job: Fix kubernetes conditional [puppet] - 10https://gerrit.wikimedia.org/r/1117522 (https://phabricator.wikimedia.org/T385596) [11:58:14] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1117522 (https://phabricator.wikimedia.org/T385596) (owner: 10Clément Goubert) [12:00:05] mvolz: Your horoscope predicts another Services – Citoid / Zotero deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250205T1200). [12:00:34] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply [12:00:38] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [12:00:47] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet,service=s3 [12:00:54] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet,service=s1 [12:01:19] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117520 (owner: 10PipelineBot) [12:01:51] (03CR) 10Clément Goubert: [C:03+2] mediawiki::periodic_job: Fix kubernetes conditional [puppet] - 10https://gerrit.wikimedia.org/r/1117522 (https://phabricator.wikimedia.org/T385596) (owner: 10Clément Goubert) [12:02:29] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117520 (owner: 10PipelineBot) [12:03:02] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [12:03:41] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [12:04:04] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:04:04] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:05:00] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [12:06:30] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [12:07:16] 06SRE, 06serviceops, 10Wikidata, 10Wikidata Integration in Wikimedia projects, 10Wikimedia-Site-requests: Increase entityAccessLimit for WikibaseClient wikis - https://phabricator.wikimedia.org/T384455#10525034 (10Marostegui) I am tagging #serviceops here to see if this is something they can help with. [12:07:25] (03PS1) 10Hnowlan: mediawiki: miscellaneous bits of jobrunner cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1117525 (https://phabricator.wikimedia.org/T354791) [12:07:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'db1251 (re)pooling @ 25%: Pooling in', diff saved to https://phabricator.wikimedia.org/P73224 and previous config saved to /var/cache/conftool/dbconfig/20250205-120752-fceratto.json [12:08:38] (03CR) 10Clément Goubert: [C:03+2] mw-cron: Add puppet-defined periodic jobs file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117503 (https://phabricator.wikimedia.org/T385596) (owner: 10Clément Goubert) [12:09:26] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [12:09:44] (03Merged) 10jenkins-bot: mw-cron: Add puppet-defined periodic jobs file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117503 (https://phabricator.wikimedia.org/T385596) (owner: 10Clément Goubert) [12:09:57] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [12:10:05] (03CR) 10Cathal Mooney: [C:03+1] "I should have done this earlier good shout." [puppet] - 10https://gerrit.wikimedia.org/r/1117154 (https://phabricator.wikimedia.org/T382518) (owner: 10Ayounsi) [12:12:10] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [12:12:12] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [12:12:22] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply [12:12:28] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [12:14:48] RECOVERY - MariaDB Replica Lag: s1 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:15:26] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117518 (owner: 10PipelineBot) [12:15:30] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:17:25] (03CR) 10Ayounsi: [C:03+2] Remove eqiad and eqsin ripe atlas from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1117154 (https://phabricator.wikimedia.org/T382518) (owner: 10Ayounsi) [12:17:42] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:20:01] (03CR) 10Nikerabbit: [C:03+1] Make MT limit more strict by 10 Percentage Point in Bhojpuri Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117113 (https://phabricator.wikimedia.org/T383789) (owner: 10KartikMistry) [12:22:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'db1251 (re)pooling @ 30%: Pooling in', diff saved to https://phabricator.wikimedia.org/P73225 and previous config saved to /var/cache/conftool/dbconfig/20250205-122257-fceratto.json [12:24:10] (03CR) 10Clément Goubert: [C:03+1] mediawiki: miscellaneous bits of jobrunner cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1117525 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [12:38:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'db1251 (re)pooling @ 35%: Pooling in', diff saved to https://phabricator.wikimedia.org/P73226 and previous config saved to /var/cache/conftool/dbconfig/20250205-123803-fceratto.json [12:41:57] (03PS1) 10Arnaudb: rt: removing email configurations [puppet] - 10https://gerrit.wikimedia.org/r/1117528 (https://phabricator.wikimedia.org/T384595) [12:42:10] (03PS1) 10Arnaudb: rt: removing informations about moscovium [puppet] - 10https://gerrit.wikimedia.org/r/1117529 (https://phabricator.wikimedia.org/T384595) [12:42:46] (03Abandoned) 10D3r1ck01: SUL3: Allow temp users to authenticate (login/signup) via the API [extensions/CentralAuth] (wmf/1.44.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1115106 (https://phabricator.wikimedia.org/T384523) (owner: 10D3r1ck01) [12:42:50] (03CR) 10Arnaudb: "this should be merged after the decommission cookbook is run on moscovium" [puppet] - 10https://gerrit.wikimedia.org/r/1117529 (https://phabricator.wikimedia.org/T384595) (owner: 10Arnaudb) [12:46:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1237.eqiad.wmnet onto db1179.eqiad.wmnet [12:48:04] (03PS1) 10Marostegui: db1179: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1117532 (https://phabricator.wikimedia.org/T385645) [12:48:44] (03CR) 10Marostegui: [C:03+2] db1179: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1117532 (https://phabricator.wikimedia.org/T385645) (owner: 10Marostegui) [12:50:30] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1237.eqiad.wmnet onto db1179.eqiad.wmnet [12:51:32] (03PS1) 10Marostegui: db1237: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1117533 (https://phabricator.wikimedia.org/T385645) [12:52:07] (03CR) 10Marostegui: [C:03+2] db1237: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1117533 (https://phabricator.wikimedia.org/T385645) (owner: 10Marostegui) [12:52:47] RECOVERY - MariaDB Replica Lag: s2 on clouddb1014 is OK: OK slave_sql_lag Replication lag: 0.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:52:47] RECOVERY - MariaDB Replica Lag: s2 on clouddb1018 is OK: OK slave_sql_lag Replication lag: 0.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:52:47] RECOVERY - MariaDB Replica Lag: s2 on db1155 is OK: OK slave_sql_lag Replication lag: 0.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:52:59] RECOVERY - MariaDB Replica Lag: s2 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.48 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:53:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73227 and previous config saved to /var/cache/conftool/dbconfig/20250205-125259-root.json [12:53:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'db1251 (re)pooling @ 50%: Pooling in', diff saved to https://phabricator.wikimedia.org/P73228 and previous config saved to /var/cache/conftool/dbconfig/20250205-125308-fceratto.json [12:54:00] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10525215 (10cmooney) >>! In T384731#10516013, @ayounsi wrote: > An alternative (or short term solution until the ab... [12:56:10] 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Migrate port utilisation alert from LibreNMS to alertmanager - https://phabricator.wikimedia.org/T384052#10525220 (10cmooney) >>! In T384052#10516521, @ayounsi wrote: > I'm wondering if we could re-write the "instance" in Prometheus t... [12:57:08] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:59:32] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:59:55] FIRING: [4x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:08:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73230 and previous config saved to /var/cache/conftool/dbconfig/20250205-130804-root.json [13:08:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'db1251 (re)pooling @ 75%: Pooling in', diff saved to https://phabricator.wikimedia.org/P73231 and previous config saved to /var/cache/conftool/dbconfig/20250205-130813-fceratto.json [13:09:16] (03PS1) 10MVernon: swift: remove drained codfw nodes from the rings [puppet] - 10https://gerrit.wikimedia.org/r/1117535 (https://phabricator.wikimedia.org/T382056) [13:09:18] (03PS1) 10MVernon: swift: remove ms-be205[1-6] from profile::swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1117536 (https://phabricator.wikimedia.org/T382056) [13:10:51] (03CR) 10Cathal Mooney: [C:03+2] Modifications to CR BGP policy for eqiad cloud-private IPv6 aggregate (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1112268 (https://phabricator.wikimedia.org/T37947) (owner: 10Cathal Mooney) [13:10:54] (03CR) 10Marostegui: [C:03+1] swift: remove ms-be205[1-6] from profile::swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/1117536 (https://phabricator.wikimedia.org/T382056) (owner: 10MVernon) [13:11:07] (03CR) 10Marostegui: [C:03+1] swift: remove drained codfw nodes from the rings [puppet] - 10https://gerrit.wikimedia.org/r/1117535 (https://phabricator.wikimedia.org/T382056) (owner: 10MVernon) [13:11:15] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:11:37] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:13:07] (03PS1) 10Cathal Mooney: Add semicolon to end of prefix in cloud6 prefix list [homer/public] - 10https://gerrit.wikimedia.org/r/1117538 (https://phabricator.wikimedia.org/T37947) [13:13:29] (03PS16) 10Clément Goubert: mediawiki: Migrate one dry-run job to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1117234 (https://phabricator.wikimedia.org/T377963) [13:14:53] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Use FIDO2 ssh keys for production access - https://phabricator.wikimedia.org/T385229#10525301 (10cmooney) >>! In T385229#10520528, @taavi wrote: > FWIW, this is possible as of today, my account for example is exclusively using them for Bullseye+ hosts.... [13:14:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T384592)', diff saved to https://phabricator.wikimedia.org/P73232 and previous config saved to /var/cache/conftool/dbconfig/20250205-131456-marostegui.json [13:15:00] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [13:15:43] (03CR) 10CI reject: [V:04-1] mediawiki: Migrate one dry-run job to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1117234 (https://phabricator.wikimedia.org/T377963) (owner: 10Clément Goubert) [13:17:29] RECOVERY - MariaDB Replica Lag: s1 on clouddb1013 is OK: OK slave_sql_lag Replication lag: 0.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:17:49] (03PS17) 10Clément Goubert: mediawiki: Migrate one dry-run job to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1117234 (https://phabricator.wikimedia.org/T377963) [13:18:12] (03CR) 10MVernon: [C:03+2] swift: remove drained codfw nodes from the rings [puppet] - 10https://gerrit.wikimedia.org/r/1117535 (https://phabricator.wikimedia.org/T382056) (owner: 10MVernon) [13:21:16] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1117234 (https://phabricator.wikimedia.org/T377963) (owner: 10Clément Goubert) [13:21:29] (03PS1) 10Filippo Giunchedi: prometheus: add per user breakdown to mw edit rates [puppet] - 10https://gerrit.wikimedia.org/r/1117539 (https://phabricator.wikimedia.org/T383963) [13:22:19] (03CR) 10Ladsgroup: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1116846 (https://phabricator.wikimedia.org/T383902) (owner: 10Jcrespo) [13:23:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73233 and previous config saved to /var/cache/conftool/dbconfig/20250205-132309-root.json [13:23:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'db1251 (re)pooling @ 100%: Pooling in', diff saved to https://phabricator.wikimedia.org/P73234 and previous config saved to /var/cache/conftool/dbconfig/20250205-132319-fceratto.json [13:24:23] !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [13:24:27] !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [13:24:47] !log klausman@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: apply [13:25:25] !log klausman@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [13:27:54] (03CR) 10Clément Goubert: [C:03+1] prometheus: add per user breakdown to mw edit rates [puppet] - 10https://gerrit.wikimedia.org/r/1117539 (https://phabricator.wikimedia.org/T383963) (owner: 10Filippo Giunchedi) [13:29:59] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: add per user breakdown to mw edit rates [puppet] - 10https://gerrit.wikimedia.org/r/1117539 (https://phabricator.wikimedia.org/T383963) (owner: 10Filippo Giunchedi) [13:30:03] (03PS2) 10Filippo Giunchedi: prometheus: add per user breakdown to mw edit rates [puppet] - 10https://gerrit.wikimedia.org/r/1117539 (https://phabricator.wikimedia.org/T383963) [13:30:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P73235 and previous config saved to /var/cache/conftool/dbconfig/20250205-133003-marostegui.json [13:30:15] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] prometheus: add per user breakdown to mw edit rates [puppet] - 10https://gerrit.wikimedia.org/r/1117539 (https://phabricator.wikimedia.org/T383963) (owner: 10Filippo Giunchedi) [13:31:29] (03CR) 10Ayounsi: [C:03+1] Add semicolon to end of prefix in cloud6 prefix list [homer/public] - 10https://gerrit.wikimedia.org/r/1117538 (https://phabricator.wikimedia.org/T37947) (owner: 10Cathal Mooney) [13:32:08] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-jobrunner_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:32:39] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10525385 (10Andrew) This is flapping like crazy, I ack'd it before bed last night but have another 15 alert messages this morning. [13:34:27] (03CR) 10Elukey: Add interative.ask_yesno (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1115767 (owner: 10JMeybohm) [13:34:37] PROBLEM - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:37:08] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:38:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73236 and previous config saved to /var/cache/conftool/dbconfig/20250205-133815-root.json [13:39:32] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:45:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P73237 and previous config saved to /var/cache/conftool/dbconfig/20250205-134510-marostegui.json [13:45:48] (03CR) 10Lucas Werkmeister (WMDE): Add sourceswiki to $wgImportSources for all Wikisources (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117204 (https://phabricator.wikimedia.org/T385591) (owner: 10Jon Harald Søby) [13:47:50] (03CR) 10Lucas Werkmeister (WMDE): "Yesterday’s change for a draft namespace, I4ebe6927ae, also added the namespace to `wmgExemptFromUserRobotsControlExtra` – would that make" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117321 (https://phabricator.wikimedia.org/T385593) (owner: 10Anzx) [13:48:55] (03CR) 10Kamila Součková: "I have questions!" [puppet] - 10https://gerrit.wikimedia.org/r/1117234 (https://phabricator.wikimedia.org/T377963) (owner: 10Clément Goubert) [13:49:37] !log deploy removal of old hosts for the m1 dbbackups backup user T383871 [13:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:40] T383871: decommission dbprov1001, dbprov1002 - https://phabricator.wikimedia.org/T383871 [13:49:44] (03CR) 10Kamila Součková: "[marking as not resolved]" [puppet] - 10https://gerrit.wikimedia.org/r/1117234 (https://phabricator.wikimedia.org/T377963) (owner: 10Clément Goubert) [13:52:54] (03PS5) 10Jcrespo: dbbackups: Remove last references to dbprov[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/1116846 (https://phabricator.wikimedia.org/T383902) [13:53:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73238 and previous config saved to /var/cache/conftool/dbconfig/20250205-135320-root.json [13:57:16] (03CR) 10Anzx: "i think it should be done if community request, good to it as it is for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117321 (https://phabricator.wikimedia.org/T385593) (owner: 10Anzx) [13:57:39] (03CR) 10Cathal Mooney: [C:03+2] Add semicolon to end of prefix in cloud6 prefix list [homer/public] - 10https://gerrit.wikimedia.org/r/1117538 (https://phabricator.wikimedia.org/T37947) (owner: 10Cathal Mooney) [13:58:16] (03Merged) 10jenkins-bot: Add semicolon to end of prefix in cloud6 prefix list [homer/public] - 10https://gerrit.wikimedia.org/r/1117538 (https://phabricator.wikimedia.org/T37947) (owner: 10Cathal Mooney) [14:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250205T1400). [14:00:05] Jhs and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T384592)', diff saved to https://phabricator.wikimedia.org/P73240 and previous config saved to /var/cache/conftool/dbconfig/20250205-140017-marostegui.json [14:00:20] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [14:00:32] o/ [14:00:32] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1248.eqiad.wmnet with reason: Maintenance [14:00:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1248 (T384592)', diff saved to https://phabricator.wikimedia.org/P73241 and previous config saved to /var/cache/conftool/dbconfig/20250205-140039-marostegui.json [14:02:31] o/ [14:02:54] (03CR) 10Lucas Werkmeister (WMDE): "Well, the community did request “web indexing: not indexed” according to the task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117321 (https://phabricator.wikimedia.org/T385593) (owner: 10Anzx) [14:04:12] (03PS2) 10Jon Harald Søby: Add sourceswiki to $wgImportSources for all Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117204 (https://phabricator.wikimedia.org/T385591) [14:05:01] Lucas_WMDE: i will update my patch [14:05:46] ok [14:05:51] * Lucas_WMDE looks at Jhs PS2 [14:05:51] (03CR) 10Jon Harald Søby: Add sourceswiki to $wgImportSources for all Wikisources (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117204 (https://phabricator.wikimedia.org/T385591) (owner: 10Jon Harald Søby) [14:07:24] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Add sourceswiki to $wgImportSources for all Wikisources (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117204 (https://phabricator.wikimedia.org/T385591) (owner: 10Jon Harald Søby) [14:08:33] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10525488 (10elukey) @Neobeta61 Hi! I just followed up on the email threads, I didn't get any response so far, I tried to summarize my un... [14:08:49] Jhs: are you ready for the deployment window? [14:08:56] Lucas_WMDE, yup [14:09:00] ok, then let’s start [14:09:11] (03CR) 10Jon Harald Søby: Add sourceswiki to $wgImportSources for all Wikisources (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117204 (https://phabricator.wikimedia.org/T385591) (owner: 10Jon Harald Søby) [14:09:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117204 (https://phabricator.wikimedia.org/T385591) (owner: 10Jon Harald Søby) [14:10:33] (03Merged) 10jenkins-bot: Add sourceswiki to $wgImportSources for all Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117204 (https://phabricator.wikimedia.org/T385591) (owner: 10Jon Harald Søby) [14:11:00] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1117204|Add sourceswiki to $wgImportSources for all Wikisources (T385591)]] [14:11:02] T385591: $wgImportSources for Wikisources should include the multilingual Wikisource by default - https://phabricator.wikimedia.org/T385591 [14:11:04] (03CR) 10Lucas Werkmeister (WMDE): Add sourceswiki to $wgImportSources for all Wikisources (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117204 (https://phabricator.wikimedia.org/T385591) (owner: 10Jon Harald Søby) [14:11:12] (03PS3) 10Anzx: kywiki: create draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117321 (https://phabricator.wikimedia.org/T385593) [14:11:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73243 and previous config saved to /var/cache/conftool/dbconfig/20250205-141131-root.json [14:12:56] (03CR) 10Anzx: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117321 (https://phabricator.wikimedia.org/T385593) (owner: 10Anzx) [14:14:22] one of the checks failed, https://movementroles.wikimedia.org/wiki/Main_Page gave 503 [14:14:27] retrying [14:14:59] !log lucaswerkmeister-wmde@deploy2002 jhsoby, lucaswerkmeister-wmde: Backport for [[gerrit:1117204|Add sourceswiki to $wgImportSources for all Wikisources (T385591)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:15:08] now it worked 🤷 [14:15:10] (03PS4) 10Giuseppe Lavagetto: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 [14:15:10] (03PS2) 10Giuseppe Lavagetto: Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 [14:15:11] (03PS1) 10Giuseppe Lavagetto: mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 [14:15:11] (03PS1) 10Giuseppe Lavagetto: Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 [14:15:41] why everything had a private wiki [14:16:08] It was 2008, it was the cool thing. [14:16:13] Amir1, cause when you were in a private wiki, you were *the shit* [14:16:32] :D [14:16:33] i remember getting access to internalwiki back in 2006, and i was indeed the shit [14:16:41] lol [14:17:50] can’t find the error in logstash [14:17:58] do private wikis not send errors to logstash? [14:18:11] they should, I've seen some from officewiki [14:18:16] (it’s probably safe to ignore but I’d like to know what’s going on) [14:18:20] Jhs: please test, by the way ^^ [14:18:41] Lucas_WMDE, already on it, works like a charm so far [14:18:49] \o/ [14:20:35] would be nice if httpbb dropped the test output somewhere in /tmp, I think [14:20:37] “Body: expected to contain 'Movement Roles', got '\n\n (03PS5) 10Andrew Bogott: sysctl: Introduce base::sysctl::inotify helper [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [14:41:06] (03CR) 10Elukey: [C:03+1] webrequest-live: new X-Analytics Authorization subkey [puppet] - 10https://gerrit.wikimedia.org/r/1117550 (owner: 10CDanis) [14:41:28] (03Merged) 10jenkins-bot: kywiki: create draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1117321 (https://phabricator.wikimedia.org/T385593) (owner: 10Anzx) [14:41:39] (03CR) 10Andrew Bogott: "done" [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [14:41:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73245 and previous config saved to /var/cache/conftool/dbconfig/20250205-144141-root.json [14:41:43] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [14:41:57] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1117321|kywiki: create draft namespace (T385593)]] [14:42:00] T385593: New namespace ("Макала долбоору") for the Kyrgyz Wikipedia - https://phabricator.wikimedia.org/T385593 [14:43:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1237.eqiad.wmnet onto db1179.eqiad.wmnet [14:43:50] !log deploy new grants to analytics_meta T385565 [14:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:53] T385565: Some analytics_meta databases are not being backed up - https://phabricator.wikimedia.org/T385565 [14:44:04] (03CR) 10Elukey: "Can you run pcc again to confirm :) ?" [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [14:44:59] !log lucaswerkmeister-wmde@deploy2002 anzx, lucaswerkmeister-wmde: Backport for [[gerrit:1117321|kywiki: create draft namespace (T385593)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:45:14] Lucas_WMDE: checking [14:45:51] (03PS6) 10Andrew Bogott: sysctl: Introduce base::sysctl::inotify helper [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [14:45:59] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [14:46:01] thanks! [14:46:06] Lucas_WMDE: looks good [14:46:21] !log lucaswerkmeister-wmde@deploy2002 anzx, lucaswerkmeister-wmde: Continuing with sync [14:46:22] \o/ [14:46:23] (03CR) 10CDanis: [C:03+2] webrequest-live: new X-Analytics Authorization subkey [puppet] - 10https://gerrit.wikimedia.org/r/1117550 (owner: 10CDanis) [14:46:24] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cr2-magru with reason: IBGP instability from cr1 to cr2 in magru causing ping faulures from alert1002 [14:46:51] (03CR) 10Elukey: sysctl: Introduce base::sysctl::inotify helper (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [14:48:56] (03CR) 10Fabfur: [C:03+1] webrequest-live: new X-Analytics Authorization subkey [puppet] - 10https://gerrit.wikimedia.org/r/1117550 (owner: 10CDanis) [14:49:11] 10ops-magru, 06Infrastructure-Foundations, 10netops: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10525590 (10cmooney) I've added BFD to this particular session now. Not that it will fix things but it should give us more datapoints for the (likely) case with Ju... [14:49:14] (03PS7) 10Andrew Bogott: sysctl: Introduce base::sysctl::inotify helper [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [14:52:20] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [14:52:52] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1117321|kywiki: create draft namespace (T385593)]] (duration: 10m 54s) [14:52:55] T385593: New namespace ("Макала долбоору") for the Kyrgyz Wikipedia - https://phabricator.wikimedia.org/T385593 [14:53:33] jouncebot: nowandnext [14:53:33] For the next 0 hour(s) and 6 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250205T1400) [14:53:33] In 0 hour(s) and 6 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250205T1500) [14:53:36] !log UTC afternoon backport+config window done [14:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:38] (03CR) 10Elukey: sysctl: Introduce base::sysctl::inotify helper (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [14:53:55] I’d still love to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1116812 at some point if I can get a +1, but no need to overrun into the wikifunctions window for that ^^ [14:54:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73246 and previous config saved to /var/cache/conftool/dbconfig/20250205-145434-root.json [14:54:55] FIRING: [4x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:56:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73247 and previous config saved to /var/cache/conftool/dbconfig/20250205-145647-root.json [15:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250205T1500) [15:00:08] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-01-28-144249 to 2025-02-03-215824 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117551 (https://phabricator.wikimedia.org/T379977) [15:00:15] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-01-29-140344 to 2025-01-30-011236 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117552 [15:01:24] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:01:54] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:04:27] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:05:13] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:05:24] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:06:15] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:06:21] (03CR) 10Elukey: [C:03+1] sysctl: Introduce base::sysctl::inotify helper [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [15:06:39] (03CR) 10Elukey: sysctl: Introduce base::sysctl::inotify helper [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [15:06:41] (03PS2) 10Giuseppe Lavagetto: Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 [15:06:51] (03CR) 10Elukey: "need to check one thing first :)" [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [15:07:14] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-01-28-144249 to 2025-02-03-215824 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117551 (https://phabricator.wikimedia.org/T379977) (owner: 10Jforrester) [15:07:57] (03PS1) 10Ayounsi: ganeti.addnode: run ImportPuppetDB script after node addition [cookbooks] - 10https://gerrit.wikimedia.org/r/1117554 (https://phabricator.wikimedia.org/T381175) [15:08:35] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-01-28-144249 to 2025-02-03-215824 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117551 (https://phabricator.wikimedia.org/T379977) (owner: 10Jforrester) [15:08:48] (03CR) 10Elukey: "+1 for the kubernetes part but https://puppet-compiler.wmflabs.org/output/1116888/2921/prometheus1005.eqiad.wmnet/index.html shows a chang" [puppet] - 10https://gerrit.wikimedia.org/r/1116888 (https://phabricator.wikimedia.org/T385530) (owner: 10BryanDavis) [15:09:00] (03CR) 10CI reject: [V:04-1] Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 (owner: 10Giuseppe Lavagetto) [15:09:22] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:09:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73248 and previous config saved to /var/cache/conftool/dbconfig/20250205-150940-root.json [15:09:43] !log reprepro included conftool 5.0.1-1 - T383324 [15:09:51] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:11:10] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:11:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:11:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73249 and previous config saved to /var/cache/conftool/dbconfig/20250205-151152-root.json [15:11:57] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:12:00] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:12:48] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:13:33] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2025-01-29-140344 to 2025-01-30-011236 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117552 (owner: 10Jforrester) [15:14:28] (03CR) 10CI reject: [V:04-1] ganeti.addnode: run ImportPuppetDB script after node addition [cookbooks] - 10https://gerrit.wikimedia.org/r/1117554 (https://phabricator.wikimedia.org/T381175) (owner: 10Ayounsi) [15:14:51] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-01-29-140344 to 2025-01-30-011236 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117552 (owner: 10Jforrester) [15:15:15] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:15:56] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:18:15] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:19:01] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:19:05] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:20:06] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:21:23] (03PS2) 10Ayounsi: ganeti.addnode: run ImportPuppetDB script after node addition [cookbooks] - 10https://gerrit.wikimedia.org/r/1117554 (https://phabricator.wikimedia.org/T381175) [15:24:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73250 and previous config saved to /var/cache/conftool/dbconfig/20250205-152445-root.json [15:27:17] jouncebot: nowandnext [15:27:17] For the next 0 hour(s) and 32 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250205T1500) [15:27:17] In 2 hour(s) and 32 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250205T1800) [15:28:33] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10525803 (10Jhancock.wm) I reset a what they asked me to inside the server yesterday. When you get a chance, @MatthewVernon can you see if that fixed the errors.? Thanks [15:29:57] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10525807 (10MatthewVernon) Hi, I'm afraid the answer is "no": ` Feb 5 15:23:01 ms-be2075 kernel: [71988.739632] sd 0:0:25:0: Power-on or device reset occurred Feb 5 15:23:02 ms... [15:33:47] (03CR) 10Ayounsi: "Moritz, is there a host I can test that change with ?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1117554 (https://phabricator.wikimedia.org/T381175) (owner: 10Ayounsi) [15:39:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73251 and previous config saved to /var/cache/conftool/dbconfig/20250205-153951-root.json [15:41:02] 10ops-magru, 06Infrastructure-Foundations, 10netops: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10525894 (10ayounsi) Good idea regarding BFD. From https://supportportal.juniper.net/s/article/Observing-BGP-IO-ERROR-CLOSE-SESSION-error-logs-when-BGP-protocolgoes... [15:51:19] !log finished deploying conftool 5.0.1-1 - T383324 [15:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:22] T383324: Prevent too many parsercache sections from being depooled - https://phabricator.wikimedia.org/T383324 [15:51:31] (03PS1) 10Hnowlan: mobileapps: use correct port for eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117560 (https://phabricator.wikimedia.org/T385718) [15:54:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2221 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73252 and previous config saved to /var/cache/conftool/dbconfig/20250205-155456-root.json [15:56:25] (03CR) 10Jgiannelos: [C:03+1] mobileapps: use correct port for eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117560 (https://phabricator.wikimedia.org/T385718) (owner: 10Hnowlan) [15:59:52] !log klausman@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [16:00:14] !log klausman@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [16:03:28] (03CR) 10Hnowlan: [C:03+2] mobileapps: use correct port for eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117560 (https://phabricator.wikimedia.org/T385718) (owner: 10Hnowlan) [16:04:35] (03Merged) 10jenkins-bot: mobileapps: use correct port for eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117560 (https://phabricator.wikimedia.org/T385718) (owner: 10Hnowlan) [16:05:39] jouncebot: nowandnext [16:05:39] No deployments scheduled for the next 1 hour(s) and 54 minute(s) [16:05:39] In 1 hour(s) and 54 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250205T1800) [16:07:27] (03PS3) 10Giuseppe Lavagetto: Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 [16:07:27] (03PS2) 10Giuseppe Lavagetto: mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 [16:07:27] (03PS3) 10Giuseppe Lavagetto: Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 [16:10:09] (03CR) 10CI reject: [V:04-1] Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 (owner: 10Giuseppe Lavagetto) [16:10:40] 06SRE, 10SRE Observability (FY2024/2025-Q3): etcd: adapt etcd-backup.py for etcd 3.4 - https://phabricator.wikimedia.org/T385727 (10herron) 03NEW [16:11:31] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 48113784 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:12:31] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 32848 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:15:54] (03CR) 10Clément Goubert: "1. It will get applied on an `helmfile apply`, we can work around the potential gap on a case-by-case basis." [puppet] - 10https://gerrit.wikimedia.org/r/1117234 (https://phabricator.wikimedia.org/T377963) (owner: 10Clément Goubert) [16:17:39] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [16:19:55] FIRING: [4x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:21:22] 06SRE, 10SRE Observability (FY2024/2025-Q3): etcd: adapt etcd-backup.py for etcd 3.4 - https://phabricator.wikimedia.org/T385727#10526002 (10herron) setting environment `ETCDCTL_API=2` for the backup script may be an option as well [16:22:13] (03CR) 10CDanis: [C:03+1] hiera: enable json logging for benthos [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [16:22:24] (03CR) 10Clément Goubert: "To be completely accurate, it will get applied on a subsequent `puppet` run, then an `helmfile apply`" [puppet] - 10https://gerrit.wikimedia.org/r/1117234 (https://phabricator.wikimedia.org/T377963) (owner: 10Clément Goubert) [16:22:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:23:58] (03CR) 10Fabfur: [C:03+2] hiera: enable json logging for benthos [puppet] - 10https://gerrit.wikimedia.org/r/1116763 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [16:30:29] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [16:31:00] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [16:32:42] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [16:33:08] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [16:38:47] (03PS1) 10Elukey: conftool-data: add wikikube workers to kartotherian-k8s-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1117568 (https://phabricator.wikimedia.org/T216826) [16:39:13] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10526052 (10Jhancock.wm) big sigh. can i get another smartctl report to send to dell? [16:41:35] (03CR) 10Hnowlan: [C:03+1] conftool-data: add wikikube workers to kartotherian-k8s-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1117568 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [16:45:05] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10526061 (10MatthewVernon) OK; same commands as before: ` mvernon@ms-be2075:~$ sudo smartctl --scan /dev/sda -d scsi # /dev/sda, SCSI device /dev/sdb -d scsi # /dev/sdb, SCSI dev... [16:50:30] (03CR) 10Dzahn: "removing from preseed.yaml and hierdata/requesttracker can go right away, but please keep the host in site.pp for now so we can apply the " [puppet] - 10https://gerrit.wikimedia.org/r/1117529 (https://phabricator.wikimedia.org/T384595) (owner: 10Arnaudb) [16:50:46] (03CR) 10Dzahn: [C:03+1] rt: removing email configurations [puppet] - 10https://gerrit.wikimedia.org/r/1117528 (https://phabricator.wikimedia.org/T384595) (owner: 10Arnaudb) [16:53:39] (03CR) 10JHathaway: [C:03+1] rt: removing email configurations [puppet] - 10https://gerrit.wikimedia.org/r/1117528 (https://phabricator.wikimedia.org/T384595) (owner: 10Arnaudb) [16:58:35] (03PS1) 10Dzahn: installserver: remove moscovium [puppet] - 10https://gerrit.wikimedia.org/r/1117572 (https://phabricator.wikimedia.org/T384595)