[00:01:16] (03PS2) 10Dzahn: tcpproxy: add config template [puppet] - 10https://gerrit.wikimedia.org/r/1200190 (https://phabricator.wikimedia.org/T408532) [00:08:54] (03PS1) 10Ahoelzl: Adding terms of use for download-index.html [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200192 (https://phabricator.wikimedia.org/T40888) [00:10:33] (03CR) 10Ahoelzl: "Thanks for reviewing!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200192 (https://phabricator.wikimedia.org/T40888) (owner: 10Ahoelzl) [00:13:16] (03PS3) 10Dzahn: tcpproxy: add config template and parameters [puppet] - 10https://gerrit.wikimedia.org/r/1200190 (https://phabricator.wikimedia.org/T408532) [00:15:26] (03CR) 10Ottomata: EventBus: Enable TYPE_EVENT for loginwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200111 (https://phabricator.wikimedia.org/T408701) (owner: 10Kosta Harlan) [00:17:39] !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on tcp-proxy1001.eqiad.wmnet with reason: in setup [00:18:46] !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on tcp-proxy1002.eqiad.wmnet with reason: in setup [00:26:51] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:31:41] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:34:41] (03PS1) 10Bartosz Dziewoński: upload: Remove stashed file in UploadFromStash when upload completed [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1200194 (https://phabricator.wikimedia.org/T408610) [00:35:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1200194 (https://phabricator.wikimedia.org/T408610) (owner: 10Bartosz Dziewoński) [00:36:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941424 (https://phabricator.wikimedia.org/T183848) (owner: 10Func) [00:38:01] (03PS5) 10Əkrəm: azwiktionary: use new wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) [00:38:30] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1200190/7522/tcp-proxy1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1200190 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [00:38:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1200196 [00:38:32] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1200196 (owner: 10TrainBranchBot) [00:39:49] (03CR) 10Əkrəm: azwiktionary: use new wordmark and tagline (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [00:40:12] (03CR) 10Əkrəm: azwiktionary: use new wordmark and tagline (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [00:43:25] (03CR) 10Əkrəm: azwiktionary: use new wordmark and tagline (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [00:50:28] (03CR) 10Dzahn: [V:03+1 C:03+2] tcpproxy: add config template and parameters [puppet] - 10https://gerrit.wikimedia.org/r/1200190 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [00:56:47] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1200196 (owner: 10TrainBranchBot) [01:00:57] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:08:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1200197 [01:08:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1200197 (owner: 10TrainBranchBot) [01:09:00] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:15:06] !log upgraded envoyproxy on lists2001, aphlict1002, aphlict2001 T405808 [01:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:15:11] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [01:19:10] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 18m 13s) [01:22:51] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1200197 (owner: 10TrainBranchBot) [01:49:00] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:06:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199892 (https://phabricator.wikimedia.org/T403798) (owner: 10Tim Starling) [02:07:37] (03Merged) 10jenkins-bot: Enable ChangesListQuery partitioning on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199892 (https://phabricator.wikimedia.org/T403798) (owner: 10Tim Starling) [02:08:07] !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1199892|Enable ChangesListQuery partitioning on all wikis (T403798)]] [02:08:12] T403798: Slow watchlist queries due to large and expensive temporary table construction - https://phabricator.wikimedia.org/T403798 [02:12:27] !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1199892|Enable ChangesListQuery partitioning on all wikis (T403798)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [02:34:00] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:43:20] !log tstarling@deploy2002 tstarling: Continuing with sync [02:48:08] !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199892|Enable ChangesListQuery partitioning on all wikis (T403798)]] (duration: 40m 01s) [02:48:13] T403798: Slow watchlist queries due to large and expensive temporary table construction - https://phabricator.wikimedia.org/T403798 [02:48:51] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:18:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [04:26:55] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892 (10Papaul) 03NEW [04:28:21] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11330353 (10Papaul) [04:28:22] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: switch refresh - https://phabricator.wikimedia.org/T408510#11330355 (10Papaul) [05:00:02] (03PS1) 10Ori: admin: add FIDO key for ori [puppet] - 10https://gerrit.wikimedia.org/r/1200217 [05:00:54] (03CR) 10CI reject: [V:04-1] admin: add FIDO key for ori [puppet] - 10https://gerrit.wikimedia.org/r/1200217 (owner: 10Ori) [05:03:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [05:04:00] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11330375 (10Papaul) p:05Triage→03Medium [05:04:04] (03PS2) 10Ori: admin: add FIDO key for ori [puppet] - 10https://gerrit.wikimedia.org/r/1200217 [05:04:24] (03PS3) 10Ori: admin: add FIDO key for ori [puppet] - 10https://gerrit.wikimedia.org/r/1200217 [05:04:37] (03CR) 10Ori: "Signed using my currently-configured SSH key:" [puppet] - 10https://gerrit.wikimedia.org/r/1200217 (owner: 10Ori) [05:08:51] FIRING: [3x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:00] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:33:51] FIRING: [3x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:37:46] FIRING: Traffic bill over quota: Alert for device cr2-magru.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [05:39:53] 10ops-codfw, 06DC-Ops: Alert for device ps1-b5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T408893 (10phaultfinder) 03NEW [05:47:37] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1193.eqiad.wmnet with reason: Maintenance [05:49:00] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:54:58] (03PS1) 10Marostegui: installserver: Do not format sretest2003 [puppet] - 10https://gerrit.wikimedia.org/r/1200221 [05:57:46] RESOLVED: Traffic bill over quota: Alert for device cr2-magru.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [05:57:47] (03CR) 10Marostegui: [C:03+2] installserver: Do not format sretest2003 [puppet] - 10https://gerrit.wikimedia.org/r/1200221 (owner: 10Marostegui) [06:00:04] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2152.codfw.wmnet with reason: Maintenance [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251031T0600) [06:00:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2152 (T407997)', diff saved to https://phabricator.wikimedia.org/P84503 and previous config saved to /var/cache/conftool/dbconfig/20251031-060012-marostegui.json [06:00:17] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [06:01:28] (03PS1) 10Marostegui: db2173: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1200222 (https://phabricator.wikimedia.org/T407463) [06:02:57] (03CR) 10Marostegui: [C:03+2] db2173: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1200222 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [06:04:01] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2173.codfw.wmnet with reason: Maintenance [06:04:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2173 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84504 and previous config saved to /var/cache/conftool/dbconfig/20251031-060405-marostegui.json [06:07:18] (03PS1) 10Marostegui: db1226: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1200223 [06:09:30] (03CR) 10Marostegui: [C:03+2] db1226: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1200223 (owner: 10Marostegui) [06:11:07] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1226.eqiad.wmnet with reason: Maintenance [06:11:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1226 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84505 and previous config saved to /var/cache/conftool/dbconfig/20251031-061110-marostegui.json [06:12:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2173 (re)pooling @ 10%: After upgrading', diff saved to https://phabricator.wikimedia.org/P84506 and previous config saved to /var/cache/conftool/dbconfig/20251031-061219-root.json [06:14:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T407997)', diff saved to https://phabricator.wikimedia.org/P84507 and previous config saved to /var/cache/conftool/dbconfig/20251031-061406-marostegui.json [06:14:12] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [06:19:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1226 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84508 and previous config saved to /var/cache/conftool/dbconfig/20251031-061904-root.json [06:27:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2173 (re)pooling @ 25%: After upgrading', diff saved to https://phabricator.wikimedia.org/P84509 and previous config saved to /var/cache/conftool/dbconfig/20251031-062725-root.json [06:29:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P84510 and previous config saved to /var/cache/conftool/dbconfig/20251031-062914-marostegui.json [06:34:00] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:34:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1226 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84511 and previous config saved to /var/cache/conftool/dbconfig/20251031-063410-root.json [06:42:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2173 (re)pooling @ 50%: After upgrading', diff saved to https://phabricator.wikimedia.org/P84512 and previous config saved to /var/cache/conftool/dbconfig/20251031-064231-root.json [06:44:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P84513 and previous config saved to /var/cache/conftool/dbconfig/20251031-064422-marostegui.json [06:47:01] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:48:53] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 0.208 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:49:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1226 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84514 and previous config saved to /var/cache/conftool/dbconfig/20251031-064916-root.json [06:57:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2173 (re)pooling @ 75%: After upgrading', diff saved to https://phabricator.wikimedia.org/P84515 and previous config saved to /var/cache/conftool/dbconfig/20251031-065737-root.json [06:59:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T407997)', diff saved to https://phabricator.wikimedia.org/P84516 and previous config saved to /var/cache/conftool/dbconfig/20251031-065929-marostegui.json [06:59:35] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [06:59:46] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2154.codfw.wmnet with reason: Maintenance [06:59:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2154 (T407997)', diff saved to https://phabricator.wikimedia.org/P84517 and previous config saved to /var/cache/conftool/dbconfig/20251031-065953-marostegui.json [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251031T0700) [07:04:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1226 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84518 and previous config saved to /var/cache/conftool/dbconfig/20251031-070422-root.json [07:12:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2173 (re)pooling @ 100%: After upgrading', diff saved to https://phabricator.wikimedia.org/P84519 and previous config saved to /var/cache/conftool/dbconfig/20251031-071243-root.json [07:13:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T407997)', diff saved to https://phabricator.wikimedia.org/P84520 and previous config saved to /var/cache/conftool/dbconfig/20251031-071344-marostegui.json [07:13:49] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [07:27:19] (03CR) 10Brouberol: [C:03+1] opensearch-cluster: fix chart typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200171 (https://phabricator.wikimedia.org/T408012) (owner: 10Bking) [07:28:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P84521 and previous config saved to /var/cache/conftool/dbconfig/20251031-072852-marostegui.json [07:44:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P84522 and previous config saved to /var/cache/conftool/dbconfig/20251031-074359-marostegui.json [07:59:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T407997)', diff saved to https://phabricator.wikimedia.org/P84523 and previous config saved to /var/cache/conftool/dbconfig/20251031-075907-marostegui.json [07:59:13] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [07:59:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2161.codfw.wmnet with reason: Maintenance [07:59:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2161 (T407997)', diff saved to https://phabricator.wikimedia.org/P84524 and previous config saved to /var/cache/conftool/dbconfig/20251031-075931-marostegui.json [08:05:01] (03PS1) 10Marostegui: db1214: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1200227 [08:05:39] (03CR) 10Marostegui: [C:03+2] db1214: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1200227 (owner: 10Marostegui) [08:06:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1214.eqiad.wmnet with reason: Maintenance [08:06:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1214 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84525 and previous config saved to /var/cache/conftool/dbconfig/20251031-080633-marostegui.json [08:13:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T407997)', diff saved to https://phabricator.wikimedia.org/P84526 and previous config saved to /var/cache/conftool/dbconfig/20251031-081304-marostegui.json [08:13:10] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [08:14:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1214 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84527 and previous config saved to /var/cache/conftool/dbconfig/20251031-081417-root.json [08:22:59] (03PS3) 10Clément Goubert: trafficserver: action api to rest-gateway group0 100% [puppet] - 10https://gerrit.wikimedia.org/r/1198931 (https://phabricator.wikimedia.org/T408223) [08:28:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P84528 and previous config saved to /var/cache/conftool/dbconfig/20251031-082812-marostegui.json [08:29:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1214 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84529 and previous config saved to /var/cache/conftool/dbconfig/20251031-082923-root.json [08:33:16] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team: Promote dpogorzelski from ops-limited to ops - https://phabricator.wikimedia.org/T408702#11330697 (10elukey) a:05mark→03None [08:33:38] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team: Promote dpogorzelski from ops-limited to ops - https://phabricator.wikimedia.org/T408702#11330699 (10elukey) @calbon please review and approve when you have a moment :) [08:43:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P84530 and previous config saved to /var/cache/conftool/dbconfig/20251031-084320-marostegui.json [08:44:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1214 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84531 and previous config saved to /var/cache/conftool/dbconfig/20251031-084428-root.json [08:52:33] !log elukey@cumin2002 START - Cookbook sre.hosts.powercycle for host sretest2010 [08:54:15] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.powercycle (exit_code=0) for host sretest2010 [08:56:11] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11330711 (10elukey) It seems to work now! I powercycled it and I now see the console displaying some data, including Trixie booting. I have no idea what went wrong before but now... [08:58:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T407997)', diff saved to https://phabricator.wikimedia.org/P84532 and previous config saved to /var/cache/conftool/dbconfig/20251031-085827-marostegui.json [08:58:33] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [08:58:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2163.codfw.wmnet with reason: Maintenance [08:58:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2163 (T407997)', diff saved to https://phabricator.wikimedia.org/P84533 and previous config saved to /var/cache/conftool/dbconfig/20251031-085852-marostegui.json [08:59:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1214 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84534 and previous config saved to /var/cache/conftool/dbconfig/20251031-085934-root.json [09:00:42] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11330727 (10cmooney) So in general we have tried to keep the subnetting of our IPv4 /24 consistent at POPs, following the template first set in drmrs (and now... [09:00:50] (03PS1) 10Elukey: Revert "sretest2010: set to be installed like a new ms-be* node" [puppet] - 10https://gerrit.wikimedia.org/r/1200284 [09:01:09] (03CR) 10CI reject: [V:04-1] Revert "sretest2010: set to be installed like a new ms-be* node" [puppet] - 10https://gerrit.wikimedia.org/r/1200284 (owner: 10Elukey) [09:01:27] !log mvernon@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1090.eqiad.wmnet with OS bullseye [09:01:37] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11330728 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host ms-be1090.eqiad.wmnet with OS bullseye [09:02:39] (03PS2) 10Elukey: Revert "sretest2010: set to be installed like a new ms-be* node" [puppet] - 10https://gerrit.wikimedia.org/r/1200284 [09:07:14] (03PS3) 10Elukey: Revert "sretest2010: set to be installed like a new ms-be* node" [puppet] - 10https://gerrit.wikimedia.org/r/1200284 [09:08:21] (03PS4) 10Elukey: Revert "sretest2010: set to be installed like a new ms-be* node" [puppet] - 10https://gerrit.wikimedia.org/r/1200284 [09:09:00] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:11:39] (03CR) 10MVernon: [C:03+1] "Have fun :)" [puppet] - 10https://gerrit.wikimedia.org/r/1200284 (owner: 10Elukey) [09:12:03] (03CR) 10Elukey: [C:03+2] Revert "sretest2010: set to be installed like a new ms-be* node" [puppet] - 10https://gerrit.wikimedia.org/r/1200284 (owner: 10Elukey) [09:12:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T407997)', diff saved to https://phabricator.wikimedia.org/P84535 and previous config saved to /var/cache/conftool/dbconfig/20251031-091230-marostegui.json [09:12:37] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [09:14:32] !log mvernon@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1090.eqiad.wmnet with reason: host reimage [09:17:36] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1090.eqiad.wmnet with reason: host reimage [09:17:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [09:18:12] looking ^ [09:27:22] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [09:27:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P84536 and previous config saved to /var/cache/conftool/dbconfig/20251031-092738-marostegui.json [09:32:09] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie [09:32:33] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [09:32:55] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1090.eqiad.wmnet with OS bullseye [09:33:10] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11330795 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host ms-be1090.eqiad.wmnet with OS bullseye complete... [09:33:24] (03CR) 10Marostegui: [C:03+1] site.pp, es2027.yaml: Decommission es2027 [puppet] - 10https://gerrit.wikimedia.org/r/1199821 (https://phabricator.wikimedia.org/T408406) (owner: 10Federico Ceratto) [09:33:38] (03CR) 10Federico Ceratto: [C:03+2] site.pp, es2027.yaml: Decommission es2027 [puppet] - 10https://gerrit.wikimedia.org/r/1199821 (https://phabricator.wikimedia.org/T408406) (owner: 10Federico Ceratto) [09:34:00] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:35:13] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie [09:36:08] !log fceratto@cumin1003 START - Cookbook sre.hosts.decommission for hosts es2027.codfw.wmnet [09:37:54] (03PS1) 10MVernon: Return ms-be10{89,90} to the rings [puppet] - 10https://gerrit.wikimedia.org/r/1200288 (https://phabricator.wikimedia.org/T400877) [09:39:20] (03PS1) 10Brouberol: Enable normal caching for growthbook.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1200289 (https://phabricator.wikimedia.org/T408415) [09:39:23] (03PS1) 10Brouberol: Expose the growthbook service publicly [puppet] - 10https://gerrit.wikimedia.org/r/1200290 (https://phabricator.wikimedia.org/T408415) [09:39:36] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:39:50] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:40:06] (03CR) 10Brouberol: "`" [puppet] - 10https://gerrit.wikimedia.org/r/1200290 (https://phabricator.wikimedia.org/T408415) (owner: 10Brouberol) [09:40:37] (03CR) 10Brouberol: [C:03+1] Add OpenSearch cluster configs for net-new clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198139 (https://phabricator.wikimedia.org/T357753) (owner: 10Bking) [09:41:05] !log fceratto@cumin1003 START - Cookbook sre.dns.netbox [09:42:39] (03PS1) 10DCausse: cirrus: temporarily exclude loginwiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200294 [09:42:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P84537 and previous config saved to /var/cache/conftool/dbconfig/20251031-094246-marostegui.json [09:43:22] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11330818 (10MatthewVernon) 05Open→03Resolved [09:43:45] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11330821 (10MatthewVernon) @VRiley-WMF looks good now, thanks! [09:43:48] (03CR) 10Superpes15: azwiktionary: use new wordmark and tagline (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [09:45:03] (03CR) 10DCausse: [C:03+2] cirrus: temporarily exclude loginwiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200294 (owner: 10DCausse) [09:46:46] fceratto@cumin1003 decommission (PID 1490222) is awaiting input [09:46:49] (03Merged) 10jenkins-bot: cirrus: temporarily exclude loginwiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200294 (owner: 10DCausse) [09:49:00] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:49:51] (03PS6) 10Əkrəm: azwiktionary: use new wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) [09:50:14] (03CR) 10Əkrəm: azwiktionary: use new wordmark and tagline (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [09:50:49] !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:50:59] (03CR) 10Stevemunene: [C:03+1] opensearch-cluster: fix chart typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200171 (https://phabricator.wikimedia.org/T408012) (owner: 10Bking) [09:51:35] !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:53:11] !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2027.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003" [09:53:31] I have a puppet puppet change pending, I'll merge it shortly [09:53:45] FIRING: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [09:53:48] !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2027.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003" [09:53:48] !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:53:49] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es2027.codfw.wmnet [09:54:41] FIRING: CirrusSearchUpdaterKafkaMessagesInTooLow: ... [09:54:47] The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&var-dc=codfw%2520prometheus%252Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&from=now-1h&to=now&refresh=5m&var-topic=codfw.cirrussearch.update_pipeline.update.v1&var-topic=eqiad.cirrussearch.update_pipeline.update.v1&viewPanel=6 - ... [09:54:50] https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [09:56:24] (03CR) 10Federico Ceratto: [C:03+2] Return ms-be10{89,90} to the rings [puppet] - 10https://gerrit.wikimedia.org/r/1200288 (https://phabricator.wikimedia.org/T400877) (owner: 10MVernon) [09:56:51] (03CR) 10Federico Ceratto: [C:03+1] "The hostname match the related task where they were reimaged" [puppet] - 10https://gerrit.wikimedia.org/r/1200288 (https://phabricator.wikimedia.org/T400877) (owner: 10MVernon) [09:57:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T407997)', diff saved to https://phabricator.wikimedia.org/P84538 and previous config saved to /var/cache/conftool/dbconfig/20251031-095754-marostegui.json [09:57:59] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [09:58:00] (03CR) 10Superpes15: azwiktionary: use new wordmark and tagline (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [09:58:11] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2164.codfw.wmnet with reason: Maintenance [09:58:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2164 (T407997)', diff saved to https://phabricator.wikimedia.org/P84539 and previous config saved to /var/cache/conftool/dbconfig/20251031-095818-marostegui.json [09:58:45] RESOLVED: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [09:59:39] RESOLVED: CirrusSearchUpdaterKafkaMessagesInTooLow: ... [09:59:39] The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&var-dc=codfw%2520prometheus%252Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&from=now-1h&to=now&refresh=5m&var-topic=codfw.cirrussearch.update_pipeline.update.v1&var-topic=eqiad.cirrussearch.update_pipeline.update.v1&viewPanel=6 - ... [09:59:39] https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [09:59:50] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [10:02:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [10:03:11] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [10:05:10] elukey@cumin1003 reimage (PID 1514667) is awaiting input [10:05:29] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie [10:06:42] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715#11330872 (10MatthewVernon) [10:07:00] cmooney@cumin1003 provision (PID 1518151) is awaiting input [10:07:14] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715#11330875 (10MatthewVernon) Updated in the light of review from Android and iOS folks - only change to our list of sizes is the addition o... [10:11:56] (03PS7) 10Əkrəm: azwiktionary: use new wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) [10:12:42] (03CR) 10Əkrəm: azwiktionary: use new wordmark and tagline (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [10:12:53] PROBLEM - Host sretest2006 is DOWN: PING CRITICAL - Packet loss = 100% [10:14:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T407997)', diff saved to https://phabricator.wikimedia.org/P84540 and previous config saved to /var/cache/conftool/dbconfig/20251031-101409-marostegui.json [10:14:14] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [10:15:06] (03CR) 10Superpes15: [C:03+1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [10:15:25] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission es2027 - https://phabricator.wikimedia.org/T408406#11330928 (10FCeratto-WMF) [10:15:56] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission es2027 - https://phabricator.wikimedia.org/T408406#11330931 (10FCeratto-WMF) [10:16:31] (03PS1) 10Majavah: P:openstack::designate: Remove check_dns_query [puppet] - 10https://gerrit.wikimedia.org/r/1200306 [10:16:37] (03CR) 10Əkrəm: azwiktionary: use new wordmark and tagline (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [10:20:25] RECOVERY - Host sretest2006 is UP: PING OK - Packet loss = 0%, RTA = 30.43 ms [10:23:08] (03CR) 10Superpes15: [C:03+1] azwiktionary: use new wordmark and tagline (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [10:24:57] PROBLEM - Host sretest2006 is DOWN: PING CRITICAL - Packet loss = 100% [10:26:32] !log cmooney@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 45014 [10:27:03] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 45014 [10:27:22] !log taavi@deploy2002 mwscript-k8s job started: namespaceDupes.php --wiki=crhwiki '--add-prefix=BROKEN ' --fix # T408284 [10:27:27] T408284: Request to create a namespace for Crimean Tatar Wikipedia - https://phabricator.wikimedia.org/T408284 [10:28:27] (03CR) 10Əkrəm: azwiktionary: use new wordmark and tagline (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [10:29:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P84541 and previous config saved to /var/cache/conftool/dbconfig/20251031-102916-marostegui.json [10:32:18] (03CR) 10Superpes15: [C:03+1] azwiktionary: use new wordmark and tagline (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [10:32:25] RECOVERY - Host sretest2006 is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms [10:32:27] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1200290 (https://phabricator.wikimedia.org/T408415) (owner: 10Brouberol) [10:33:24] (03PS2) 10Federico Ceratto: site.pp, es2028.yaml: Decommission es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1199825 (https://phabricator.wikimedia.org/T408407) [10:33:24] (03PS6) 10Federico Ceratto: instances.yaml: remove es2029 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199740 (https://phabricator.wikimedia.org/T408408) [10:33:24] (03PS6) 10Federico Ceratto: instances.yaml: remove es2030 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199741 (https://phabricator.wikimedia.org/T408409) [10:33:24] (03PS6) 10Federico Ceratto: instances.yaml: remove es2031 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199742 (https://phabricator.wikimedia.org/T408410) [10:33:49] (03CR) 10Əkrəm: azwiktionary: use new wordmark and tagline (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [10:34:00] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:34:43] (03PS2) 10Federico Ceratto: site.pp, es2032.yaml: Decommission es2032 [puppet] - 10https://gerrit.wikimedia.org/r/1200310 (https://phabricator.wikimedia.org/T408411) [10:34:43] (03PS3) 10Federico Ceratto: site.pp, es2028.yaml: Decommission es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1199825 (https://phabricator.wikimedia.org/T408407) [10:34:43] (03PS7) 10Federico Ceratto: instances.yaml: remove es2029 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199740 (https://phabricator.wikimedia.org/T408408) [10:34:43] (03PS7) 10Federico Ceratto: instances.yaml: remove es2030 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199741 (https://phabricator.wikimedia.org/T408409) [10:34:44] (03PS7) 10Federico Ceratto: instances.yaml: remove es2031 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199742 (https://phabricator.wikimedia.org/T408410) [10:35:25] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [10:37:36] !log fceratto@cumin1003 START - Cookbook sre.hosts.decommission for hosts es2032.codfw.wmnet [10:39:24] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:41:32] (03PS1) 10Federico Ceratto: site.pp, es2033.yaml: Decommission es2033 [puppet] - 10https://gerrit.wikimedia.org/r/1200312 (https://phabricator.wikimedia.org/T408412) [10:41:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [10:42:12] !log fceratto@cumin1003 START - Cookbook sre.dns.netbox [10:42:36] (03PS1) 10Federico Ceratto: site.pp, es2034.yaml: Decommission es2034 [puppet] - 10https://gerrit.wikimedia.org/r/1200313 (https://phabricator.wikimedia.org/T408414) [10:44:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P84542 and previous config saved to /var/cache/conftool/dbconfig/20251031-104424-marostegui.json [10:46:10] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:46:53] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:47:32] (03PS2) 10Brouberol: Enable normal caching for growthbook.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1200289 (https://phabricator.wikimedia.org/T408903) [10:47:34] (03PS2) 10Brouberol: Expose the growthbook service publicly [puppet] - 10https://gerrit.wikimedia.org/r/1200290 (https://phabricator.wikimedia.org/T408903) [10:47:55] fceratto@cumin1003 decommission (PID 1552362) is awaiting input [10:47:55] (03PS1) 10Brouberol: Create the growthbook.wikimedia.org subdomain [dns] - 10https://gerrit.wikimedia.org/r/1200317 (https://phabricator.wikimedia.org/T408903) [10:50:05] !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2032.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003" [10:52:42] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:52:58] (03CR) 10Stevemunene: [C:03+1] Create the growthbook.wikimedia.org subdomain [dns] - 10https://gerrit.wikimedia.org/r/1200317 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [10:53:09] fceratto@cumin1003 decommission (PID 1552362) is awaiting input [10:53:25] (03CR) 10Stevemunene: [C:03+1] Enable normal caching for growthbook.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1200289 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [10:53:41] (03CR) 10Stevemunene: [C:03+1] Expose the growthbook service publicly [puppet] - 10https://gerrit.wikimedia.org/r/1200290 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [10:55:37] !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2032.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003" [10:55:37] !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:55:38] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es2032.codfw.wmnet [10:58:26] (03PS1) 10Majavah: P:toolforge: Improve tool overloaded error message [puppet] - 10https://gerrit.wikimedia.org/r/1200320 [10:58:55] (03CR) 10CI reject: [V:04-1] P:toolforge: Improve tool overloaded error message [puppet] - 10https://gerrit.wikimedia.org/r/1200320 (owner: 10Majavah) [10:59:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T407997)', diff saved to https://phabricator.wikimedia.org/P84543 and previous config saved to /var/cache/conftool/dbconfig/20251031-105932-marostegui.json [10:59:37] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [10:59:48] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2166.codfw.wmnet with reason: Maintenance [10:59:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2166 (T407997)', diff saved to https://phabricator.wikimedia.org/P84544 and previous config saved to /var/cache/conftool/dbconfig/20251031-105956-marostegui.json [10:59:57] (03PS2) 10Majavah: P:toolforge: Improve tool overloaded error message [puppet] - 10https://gerrit.wikimedia.org/r/1200320 [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251031T0700) [11:00:05] jelto, arnoldokoth, and mutante: Time to do the GitLab version upgrades deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251031T1100). [11:00:20] !log fceratto@cumin1003 START - Cookbook sre.hosts.decommission for hosts es2033.codfw.wmnet [11:00:26] (03CR) 10CI reject: [V:04-1] P:toolforge: Improve tool overloaded error message [puppet] - 10https://gerrit.wikimedia.org/r/1200320 (owner: 10Majavah) [11:01:29] (03PS3) 10Majavah: P:toolforge: Improve tool overloaded error message [puppet] - 10https://gerrit.wikimedia.org/r/1200320 [11:06:19] !log fceratto@cumin1003 START - Cookbook sre.dns.netbox [11:10:16] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [11:10:21] !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2009.codfw.wmnet with OS trixie [11:10:33] 10ops-codfw, 06SRE, 06DC-Ops: sretest2009 test in nokia rack - https://phabricator.wikimedia.org/T404115#11330980 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host sretest2009.codfw.wmnet with OS trixie [11:12:03] fceratto@cumin1003 decommission (PID 1575923) is awaiting input [11:14:43] !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2033.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003" [11:15:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T407997)', diff saved to https://phabricator.wikimedia.org/P84545 and previous config saved to /var/cache/conftool/dbconfig/20251031-111544-marostegui.json [11:15:50] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [11:15:50] elukey@cumin1003 reimage (PID 1585267) is awaiting input [11:17:47] fceratto@cumin1003 decommission (PID 1575923) is awaiting input [11:21:21] !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2033.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003" [11:21:21] !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:21:22] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es2033.codfw.wmnet [11:23:44] !log fceratto@cumin1003 START - Cookbook sre.hosts.decommission for hosts es2034.codfw.wmnet [11:28:50] !log fceratto@cumin1003 START - Cookbook sre.dns.netbox [11:30:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P84546 and previous config saved to /var/cache/conftool/dbconfig/20251031-113052-marostegui.json [11:30:56] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie [11:32:14] !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2034.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003" [11:32:33] !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2034.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003" [11:32:34] !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:32:35] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es2034.codfw.wmnet [11:33:59] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts mwdebug[1001-1002].eqiad.wmnet [11:37:45] (03PS1) 10Jcrespo: Transferer: Fix issue due to escaping where filenames with space failed [software/transferpy] - 10https://gerrit.wikimedia.org/r/1200330 (https://phabricator.wikimedia.org/T393692) [11:39:09] (03CR) 10CI reject: [V:04-1] Transferer: Fix issue due to escaping where filenames with space failed [software/transferpy] - 10https://gerrit.wikimedia.org/r/1200330 (https://phabricator.wikimedia.org/T393692) (owner: 10Jcrespo) [11:41:02] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [11:45:13] 07sre-alert-triage, 06SRE Observability (FY2025/2026-Q2): Alert in need of triage: PuppetConstantChange (instance prometheus2007:9100) - https://phabricator.wikimedia.org/T407484#11331030 (10tappof) 05Open→03Resolved [11:45:44] !log jiji@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mwdebug[1001-1002].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [11:46:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P84547 and previous config saved to /var/cache/conftool/dbconfig/20251031-114600-marostegui.json [11:46:19] (03PS1) 10Effie Mouzeli: api-gateway: removed mwdebug* hosts from networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200331 (https://phabricator.wikimedia.org/T397498) [11:46:56] !log jiji@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mwdebug[1001-1002].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [11:46:56] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:46:58] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mwdebug[1001-1002].eqiad.wmnet [11:47:30] !log jiji@cumin1003 START - Cookbook sre.hosts.decommission for hosts mwdebug[2001-2002].codfw.wmnet [11:49:39] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [11:53:09] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2010.codfw.wmnet with OS trixie [11:54:12] !log jiji@cumin1003 START - Cookbook sre.dns.netbox [11:54:37] (03PS1) 10Effie Mouzeli: site.pp: remove decommed mwdebug hosts [puppet] - 10https://gerrit.wikimedia.org/r/1200332 (https://phabricator.wikimedia.org/T397498) [11:55:00] (03PS2) 10Effie Mouzeli: api-gateway: remove mwdebug* hosts from networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200331 (https://phabricator.wikimedia.org/T397498) [11:56:26] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11331054 (10elukey) I spent a lot of time this morning trying to make a reimage working, but there seems to be something pathologically wrong about this host. I have the feeling t... [11:59:23] (03PS1) 10Vgutierrez: haproxy: Add python-httpx to ua_library_default ACL [puppet] - 10https://gerrit.wikimedia.org/r/1200334 [11:59:37] cmooney@cumin1003 reimage (PID 1583791) is awaiting input [11:59:58] jiji@cumin1003 decommission (PID 1623239) is awaiting input [12:01:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T407997)', diff saved to https://phabricator.wikimedia.org/P84548 and previous config saved to /var/cache/conftool/dbconfig/20251031-120108-marostegui.json [12:01:14] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [12:01:25] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2167.codfw.wmnet with reason: Maintenance [12:01:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2167 (T407997)', diff saved to https://phabricator.wikimedia.org/P84549 and previous config saved to /var/cache/conftool/dbconfig/20251031-120132-marostegui.json [12:02:38] (03PS2) 10Jcrespo: Transferer: Fix issue due to escaping where filenames with space failed [software/transferpy] - 10https://gerrit.wikimedia.org/r/1200330 (https://phabricator.wikimedia.org/T393692) [12:02:51] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1200334 (owner: 10Vgutierrez) [12:08:57] (03CR) 10Fabfur: [C:03+1] haproxy: Add python-httpx to ua_library_default ACL [puppet] - 10https://gerrit.wikimedia.org/r/1200334 (owner: 10Vgutierrez) [12:15:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T407997)', diff saved to https://phabricator.wikimedia.org/P84550 and previous config saved to /var/cache/conftool/dbconfig/20251031-121522-marostegui.json [12:15:28] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [12:16:07] !log jiji@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mwdebug[2001-2002].codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [12:17:33] !log jiji@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mwdebug[2001-2002].codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1003" [12:17:33] !log jiji@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:17:34] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mwdebug[2001-2002].codfw.wmnet [12:19:00] (03PS1) 10Stevemunene: airflow: Update the pythonpath [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200339 (https://phabricator.wikimedia.org/T408711) [12:29:21] (03CR) 10Brouberol: [C:04-1] "You also need to dump the chart version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200339 (https://phabricator.wikimedia.org/T408711) (owner: 10Stevemunene) [12:30:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P84551 and previous config saved to /var/cache/conftool/dbconfig/20251031-123030-marostegui.json [12:37:02] (03PS2) 10Stevemunene: airflow: Update the pythonpath [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200339 (https://phabricator.wikimedia.org/T408711) [12:40:12] (03CR) 10Brouberol: [C:03+1] airflow: Update the pythonpath (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200339 (https://phabricator.wikimedia.org/T408711) (owner: 10Stevemunene) [12:45:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P84552 and previous config saved to /var/cache/conftool/dbconfig/20251031-124537-marostegui.json [12:49:39] (03CR) 10Jgiannelos: Allow proofread page to use parsoid when parsoid render is requested [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198537 (https://phabricator.wikimedia.org/T278481) (owner: 10Jgiannelos) [12:58:58] (03CR) 10Stevemunene: [C:03+2] airflow: Update the pythonpath [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200339 (https://phabricator.wikimedia.org/T408711) (owner: 10Stevemunene) [13:00:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T407997)', diff saved to https://phabricator.wikimedia.org/P84553 and previous config saved to /var/cache/conftool/dbconfig/20251031-130046-marostegui.json [13:00:53] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [13:01:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2181.codfw.wmnet with reason: Maintenance [13:01:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2181 (T407997)', diff saved to https://phabricator.wikimedia.org/P84554 and previous config saved to /var/cache/conftool/dbconfig/20251031-130110-marostegui.json [13:01:15] (03Merged) 10jenkins-bot: airflow: Update the pythonpath [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200339 (https://phabricator.wikimedia.org/T408711) (owner: 10Stevemunene) [13:09:00] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:15:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T407997)', diff saved to https://phabricator.wikimedia.org/P84555 and previous config saved to /var/cache/conftool/dbconfig/20251031-131459-marostegui.json [13:15:05] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [13:15:27] FIRING: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:18:12] (03CR) 10Bking: Adding terms of use for download-index.html (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200192 (https://phabricator.wikimedia.org/T40888) (owner: 10Ahoelzl) [13:20:27] RESOLVED: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:23:56] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:25:09] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:25:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:30:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P84556 and previous config saved to /var/cache/conftool/dbconfig/20251031-133007-marostegui.json [13:30:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:34:00] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:34:59] (03PS1) 10Brouberol: airflow-platform-eng: allow task pods to egress to the urldownloader hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200354 (https://phabricator.wikimedia.org/T408238) [13:40:27] FIRING: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:41:00] (03CR) 10Marostegui: [C:03+1] site.pp, es2032.yaml: Decommission es2032 [puppet] - 10https://gerrit.wikimedia.org/r/1200310 (https://phabricator.wikimedia.org/T408411) (owner: 10Federico Ceratto) [13:41:15] (03CR) 10Marostegui: [C:03+1] site.pp, es2034.yaml: Decommission es2034 [puppet] - 10https://gerrit.wikimedia.org/r/1200313 (https://phabricator.wikimedia.org/T408414) (owner: 10Federico Ceratto) [13:41:27] (03CR) 10Marostegui: [C:03+1] site.pp, es2033.yaml: Decommission es2033 [puppet] - 10https://gerrit.wikimedia.org/r/1200312 (https://phabricator.wikimedia.org/T408412) (owner: 10Federico Ceratto) [13:41:54] (03CR) 10Marostegui: [C:03+1] instances.yaml: remove es2031 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199742 (https://phabricator.wikimedia.org/T408410) (owner: 10Federico Ceratto) [13:42:57] (03CR) 10Bking: [C:03+1] airflow-platform-eng: allow task pods to egress to the urldownloader hosts (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200354 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [13:45:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P84557 and previous config saved to /var/cache/conftool/dbconfig/20251031-134514-marostegui.json [13:45:27] RESOLVED: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:46:48] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team: Promote dpogorzelski from ops-limited to ops - https://phabricator.wikimedia.org/T408702#11331273 (10calbon) I approve. [13:49:00] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:56:24] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656#11331290 (10Gehel) [14:00:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T407997)', diff saved to https://phabricator.wikimedia.org/P84558 and previous config saved to /var/cache/conftool/dbconfig/20251031-140022-marostegui.json [14:00:27] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [14:00:39] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2195.codfw.wmnet with reason: Maintenance [14:00:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2195 (T407997)', diff saved to https://phabricator.wikimedia.org/P84559 and previous config saved to /var/cache/conftool/dbconfig/20251031-140046-marostegui.json [14:01:30] (03CR) 10Gehel: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1200317 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [14:03:04] (03CR) 10Gehel: [C:04-1] "We should probably keep this file sorted alphabetically." [puppet] - 10https://gerrit.wikimedia.org/r/1200289 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [14:04:09] (03CR) 10Gehel: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1200290 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [14:07:44] (03CR) 10Federico Ceratto: [C:03+2] site.pp, es2034.yaml: Decommission es2034 [puppet] - 10https://gerrit.wikimedia.org/r/1200313 (https://phabricator.wikimedia.org/T408414) (owner: 10Federico Ceratto) [14:07:49] (03CR) 10Federico Ceratto: [C:03+2] site.pp, es2033.yaml: Decommission es2033 [puppet] - 10https://gerrit.wikimedia.org/r/1200312 (https://phabricator.wikimedia.org/T408412) (owner: 10Federico Ceratto) [14:07:56] (03CR) 10Federico Ceratto: [C:03+2] site.pp, es2032.yaml: Decommission es2032 [puppet] - 10https://gerrit.wikimedia.org/r/1200310 (https://phabricator.wikimedia.org/T408411) (owner: 10Federico Ceratto) [14:08:20] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml: remove es2029 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199740 (https://phabricator.wikimedia.org/T408408) (owner: 10Federico Ceratto) [14:08:22] (03CR) 10Fabfur: [C:03+1] P:cache::varnish::frontend: render known-client rate limit VCL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [14:09:51] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T408893#11331322 (10phaultfinder) [14:11:24] 06SRE, 10LDAP-Access-Requests: Grant Access to Superset for vicaplet-wmde - https://phabricator.wikimedia.org/T408920 (10Virginie.caplet) 03NEW [14:13:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T407997)', diff saved to https://phabricator.wikimedia.org/P84560 and previous config saved to /var/cache/conftool/dbconfig/20251031-141309-marostegui.json [14:13:14] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [14:14:02] (03PS1) 10Clément Goubert: README: pre-commit hook [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200356 [14:15:05] (03PS2) 10Clément Goubert: README: pre-commit hook [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200356 [14:15:58] (03CR) 10Clément Goubert: [C:03+1] api-gateway: remove mwdebug* hosts from networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200331 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [14:16:15] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission es2033 - https://phabricator.wikimedia.org/T408412#11331348 (10FCeratto-WMF) [14:16:20] (03PS1) 10Brouberol: airflow-platform-eng: define a connection to the spur.us API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200357 (https://phabricator.wikimedia.org/T408238) [14:16:42] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission es2034 - https://phabricator.wikimedia.org/T408414#11331353 (10FCeratto-WMF) [14:16:59] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission es2032 - https://phabricator.wikimedia.org/T408411#11331356 (10FCeratto-WMF) [14:17:55] (03CR) 10Clément Goubert: [C:03+1] site.pp: remove decommed mwdebug hosts [puppet] - 10https://gerrit.wikimedia.org/r/1200332 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [14:17:55] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission es2027 - https://phabricator.wikimedia.org/T408406#11331362 (10FCeratto-WMF) [14:18:00] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission es2032 - https://phabricator.wikimedia.org/T408411#11331363 (10FCeratto-WMF) [14:18:08] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission es2033 - https://phabricator.wikimedia.org/T408412#11331364 (10FCeratto-WMF) [14:18:10] (03PS3) 10Brouberol: Enable normal caching for growthbook.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1200289 (https://phabricator.wikimedia.org/T408903) [14:18:11] (03PS3) 10Brouberol: Expose the growthbook service publicly [puppet] - 10https://gerrit.wikimedia.org/r/1200290 (https://phabricator.wikimedia.org/T408903) [14:18:18] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission es2034 - https://phabricator.wikimedia.org/T408414#11331365 (10FCeratto-WMF) [14:18:19] (03CR) 10Brouberol: Enable normal caching for growthbook.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1200289 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [14:18:58] (03PS1) 10Tiziano Fogli: dns: enable nrpe2nodexp wrapper on authdns_update_run check [puppet] - 10https://gerrit.wikimedia.org/r/1200359 (https://phabricator.wikimedia.org/T384425) [14:20:54] (03CR) 10Tiziano Fogli: "This change enables the nrpe2nodexp wrapper to export NRPE plugin results to Prometheus via the node exporter." [puppet] - 10https://gerrit.wikimedia.org/r/1200359 (https://phabricator.wikimedia.org/T384425) (owner: 10Tiziano Fogli) [14:22:26] (03PS2) 10Brouberol: airflow-platform-eng: define a connection to the spur.us API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200357 (https://phabricator.wikimedia.org/T408238) [14:23:41] (03PS2) 10Brouberol: airflow-platform-eng: allow task pods to egress to the urldownloader hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200354 (https://phabricator.wikimedia.org/T408238) [14:23:41] (03PS3) 10Brouberol: airflow-platform-eng: define a connection to the spur.us API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200357 (https://phabricator.wikimedia.org/T408238) [14:23:44] (03CR) 10Brouberol: airflow-platform-eng: allow task pods to egress to the urldownloader hosts (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200354 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [14:25:52] 06SRE, 10SRE-Access-Requests: Requesting access to ops-limited for dpogorzelski - https://phabricator.wikimedia.org/T407955#11331387 (10elukey) To keep archives happy: the user was added with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1197672 [14:28:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P84561 and previous config saved to /var/cache/conftool/dbconfig/20251031-142816-marostegui.json [14:29:21] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T408893#11331395 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm missed this one getting the new alert limits set. fixed. [14:34:00] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:34:37] (03PS1) 10Tiziano Fogli: lvs: enable nrpe2nodexp wrapper on check_rp_filter_disabled check [puppet] - 10https://gerrit.wikimedia.org/r/1200362 (https://phabricator.wikimedia.org/T407330) [14:40:11] (03CR) 10Tiziano Fogli: "This change enables the nrpe2nodexp wrapper to export NRPE plugin results to Prometheus via the node exporter." [puppet] - 10https://gerrit.wikimedia.org/r/1200362 (https://phabricator.wikimedia.org/T407330) (owner: 10Tiziano Fogli) [14:43:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P84562 and previous config saved to /var/cache/conftool/dbconfig/20251031-144324-marostegui.json [14:51:17] (03CR) 10Brouberol: "I have a small suggestion that should allow you to remove the `command` function and replace it with some `subprocess` builtin." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200356 (owner: 10Clément Goubert) [14:51:22] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [14:54:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:55:25] (03PS1) 10Tiziano Fogli: netbox: enable nrpe2nodexp wrapper on check_uncommitted_dns_changes check [puppet] - 10https://gerrit.wikimedia.org/r/1200365 (https://phabricator.wikimedia.org/T350694) [14:55:25] (03CR) 10Tiziano Fogli: "This change enables the nrpe2nodexp wrapper to export NRPE plugin results to Prometheus via the node exporter." [puppet] - 10https://gerrit.wikimedia.org/r/1200365 (https://phabricator.wikimedia.org/T350694) (owner: 10Tiziano Fogli) [14:55:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:56:25] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie [14:57:08] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2009.codfw.wmnet with OS trixie [14:57:17] 10ops-codfw, 06SRE, 06DC-Ops: sretest2009 test in nokia rack - https://phabricator.wikimedia.org/T404115#11331472 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host sretest2009.codfw.wmnet with OS trixie executed with errors: - sretest2009 (**FAIL**) - Removed... [14:58:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T407997)', diff saved to https://phabricator.wikimedia.org/P84563 and previous config saved to /var/cache/conftool/dbconfig/20251031-145834-marostegui.json [14:58:40] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [14:58:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2198.codfw.wmnet with reason: Maintenance [14:59:20] (03CR) 10Bking: [C:03+1] airflow-platform-eng: define a connection to the spur.us API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200357 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [15:00:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:00:32] (03PS3) 10Clément Goubert: README: pre-commit hook [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200356 [15:00:38] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [15:03:20] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2010.codfw.wmnet with OS trixie [15:05:04] (03CR) 10Brouberol: "You need to bump the version in `charts/mediawiki-dumps-legacy/Chart.yaml` as well" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200192 (https://phabricator.wikimedia.org/T40888) (owner: 10Ahoelzl) [15:06:54] (03CR) 10Clément Goubert: README: pre-commit hook (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200356 (owner: 10Clément Goubert) [15:06:59] (03CR) 10Xcollazo: Adding terms of use for download-index.html (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200192 (https://phabricator.wikimedia.org/T40888) (owner: 10Ahoelzl) [15:08:51] FIRING: [3x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:17] (03CR) 10Herron: [C:03+1] alertmanager: Add dashboard and runbook for Slack alerts [puppet] - 10https://gerrit.wikimedia.org/r/1200124 (https://phabricator.wikimedia.org/T408145) (owner: 10Andrea Denisse) [15:12:29] (03CR) 10Bking: [C:03+2] opensearch-cluster: fix chart typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200171 (https://phabricator.wikimedia.org/T408012) (owner: 10Bking) [15:13:40] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: Apply JVM upgrade to 11.0.29 - eevans@cumin1003 [15:14:48] (03CR) 10Gehel: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1200289 (https://phabricator.wikimedia.org/T408903) (owner: 10Brouberol) [15:16:41] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [15:17:35] (03PS7) 10Xcollazo: dumps: Release the new MW Content File Export. Deprecate legacy XML dumps. [puppet] - 10https://gerrit.wikimedia.org/r/1199783 (https://phabricator.wikimedia.org/T401022) [15:18:31] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T408924 (10ItamarWMDE) 03NEW [15:19:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:22:01] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2010.codfw.wmnet with OS trixie [15:22:25] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host sretest2010.codfw.wmnet with OS trixie [15:27:47] (03CR) 10CDanis: [C:03+1] haproxy: Add python-httpx to ua_library_default ACL [puppet] - 10https://gerrit.wikimedia.org/r/1200334 (owner: 10Vgutierrez) [15:27:57] (03CR) 10Hnowlan: "There is an update to this script that makes it possible to amend commits (which was a major version with this one) https://gist.github.co" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200356 (owner: 10Clément Goubert) [15:28:25] (03CR) 10Hnowlan: "s/version/issue/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200356 (owner: 10Clément Goubert) [15:28:38] (03CR) 10Subramanya Sastry: [C:03+1] Allow proofread page to use parsoid when parsoid render is requested [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198537 (https://phabricator.wikimedia.org/T278481) (owner: 10Jgiannelos) [15:29:41] (03CR) 10Clément Goubert: "Ah cool, I couldn't remember if I'd wrote it or found it x)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200356 (owner: 10Clément Goubert) [15:30:00] (03CR) 10Vgutierrez: [C:03+2] haproxy: Add python-httpx to ua_library_default ACL [puppet] - 10https://gerrit.wikimedia.org/r/1200334 (owner: 10Vgutierrez) [15:30:26] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11331570 (10elukey) I may have found something interesting: ` >>> pprint(r.request("GET", "/redfish/v1/Systems/1/Oem/Supermicro/FixedBootOrder").json()) {'@odata.etag': '"c9fb8f9... [15:31:29] (03PS1) 10Stevemunene: Revert "airflow: Update the pythonpath" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200368 [15:33:36] (03CR) 10Clément Goubert: "Hmm that would work on linux, but I'm not sure it would work on Windows. People that use it will have to try it out and see." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200356 (owner: 10Clément Goubert) [15:33:51] FIRING: [3x] JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:24] (03Abandoned) 10Stevemunene: Revert "airflow: Update the pythonpath" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200368 (owner: 10Stevemunene) [15:37:41] (03PS1) 10Stevemunene: Revert "Deploy airflow images from airflow-dags repository build" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200373 [15:38:33] (03CR) 10Brouberol: [C:03+1] Revert "Deploy airflow images from airflow-dags repository build" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200373 (owner: 10Stevemunene) [15:41:48] (03CR) 10Stevemunene: [C:03+2] Revert "Deploy airflow images from airflow-dags repository build" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200373 (owner: 10Stevemunene) [15:41:49] (03PS4) 10Clément Goubert: README: pre-commit hook [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200356 [15:41:53] (03CR) 10Scott French: P:cache::varnish::frontend: render known-client rate limit VCL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [15:44:08] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T408924#11331601 (10WMDE-leszek) I approve this request on WMDE's behalf. To my knowledge @ItamarWMDE has fulfilled most of formal requirements as they got access to `analytics-privatedata-users`. (... [15:44:13] (03Merged) 10jenkins-bot: Revert "Deploy airflow images from airflow-dags repository build" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200373 (owner: 10Stevemunene) [15:46:34] (03CR) 10Brouberol: [C:03+1] README: pre-commit hook [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200356 (owner: 10Clément Goubert) [15:47:37] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:48:22] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:50:10] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2010.codfw.wmnet with reason: host reimage [15:52:16] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: host unresponsive for wikikube-worker2203.codfw.wmnet - https://phabricator.wikimedia.org/T408004#11331624 (10Jhancock.wm) finally got dell to ship me the parts. should have time to take care of the replacement monday or tuesday next week. [15:53:54] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2010.codfw.wmnet with reason: host reimage [16:01:30] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T408924#11331631 (10ItamarWMDE) [16:08:27] FIRING: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:09:46] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [16:10:26] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [16:11:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:11:38] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:13:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:18:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:22:36] (03PS1) 10Xcollazo: dumps: Add Terms of use to the index page. [puppet] - 10https://gerrit.wikimedia.org/r/1200379 (https://phabricator.wikimedia.org/T408881) [16:22:46] (03PS2) 10Scott French: haproxy: inject stub lua.request_check in tests [puppet] - 10https://gerrit.wikimedia.org/r/1200378 [16:23:53] (03CR) 10Ahoelzl: [V:03+1] dumps: Add Terms of use to the index page. [puppet] - 10https://gerrit.wikimedia.org/r/1200379 (https://phabricator.wikimedia.org/T408881) (owner: 10Xcollazo) [16:24:52] (03PS1) 10CDanis: xmldumps web: nginx: add blocked_cidrs from hiera [puppet] - 10https://gerrit.wikimedia.org/r/1200380 (https://phabricator.wikimedia.org/T408929) [16:25:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:26:09] (03CR) 10CDanis: [C:03+2] dumps: Add Terms of use to the index page. [puppet] - 10https://gerrit.wikimedia.org/r/1200379 (https://phabricator.wikimedia.org/T408881) (owner: 10Xcollazo) [16:29:48] (03PS2) 10CDanis: xmldumps web: nginx: add blocked_cidrs from hiera [puppet] - 10https://gerrit.wikimedia.org/r/1200380 (https://phabricator.wikimedia.org/T408929) [16:30:22] (03PS3) 10CDanis: xmldumps web: nginx: add blocked_cidrs from hiera [puppet] - 10https://gerrit.wikimedia.org/r/1200380 (https://phabricator.wikimedia.org/T408929) [16:30:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:30:34] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1200380 (https://phabricator.wikimedia.org/T408929) (owner: 10CDanis) [16:32:31] (03CR) 10Scott French: [C:03+1] xmldumps web: nginx: add blocked_cidrs from hiera [puppet] - 10https://gerrit.wikimedia.org/r/1200380 (https://phabricator.wikimedia.org/T408929) (owner: 10CDanis) [16:32:37] (03PS1) 10Effie Mouzeli: mw-experimental: remove update lock if older than 6hrs [puppet] - 10https://gerrit.wikimedia.org/r/1200381 [16:34:19] (03CR) 10CDanis: [C:03+2] xmldumps web: nginx: add blocked_cidrs from hiera [puppet] - 10https://gerrit.wikimedia.org/r/1200380 (https://phabricator.wikimedia.org/T408929) (owner: 10CDanis) [16:39:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:39:29] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2010.codfw.wmnet with OS trixie [16:40:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:47:14] (03PS2) 10Xcollazo: Adding terms of use for download-index.html [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200192 (https://phabricator.wikimedia.org/T408881) (owner: 10Ahoelzl) [16:48:11] (03CR) 10Xcollazo: [C:03+1] "Fixed patch issues. Should be good to go." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200192 (https://phabricator.wikimedia.org/T408881) (owner: 10Ahoelzl) [16:49:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:50:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:59:26] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715#11331773 (10MatthewVernon) A further complication - some wikis (I've found at least fr and de) add a lang{fr,de,...} prefix to the thumb... [17:09:00] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:18:12] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: Apply JVM upgrade to 11.0.29 - eevans@cumin1003 [17:18:51] FIRING: [2x] CertAlmostExpired: Certificate for service fasw2-c1a-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:27:32] (03CR) 10Bking: "It doesn't look like this comment has been addressed yet." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200192 (https://phabricator.wikimedia.org/T408881) (owner: 10Ahoelzl) [17:29:10] (03PS9) 10Hnowlan: svg: use rsvg-convert's language parameter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1042203 (https://phabricator.wikimedia.org/T261192) [17:30:25] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:37:28] (03CR) 10CI reject: [V:04-1] svg: use rsvg-convert's language parameter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1042203 (https://phabricator.wikimedia.org/T261192) (owner: 10Hnowlan) [17:38:44] (03CR) 10CDanis: [C:03+1] haproxy: inject stub lua.request_check in tests [puppet] - 10https://gerrit.wikimedia.org/r/1200378 (owner: 10Scott French) [17:40:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:40:29] (03CR) 10Scott French: [C:03+2] haproxy: inject stub lua.request_check in tests [puppet] - 10https://gerrit.wikimedia.org/r/1200378 (owner: 10Scott French) [17:45:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:49:00] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:52:53] (03CR) 10Xcollazo: "Ah missed that. Fixing..." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200192 (https://phabricator.wikimedia.org/T408881) (owner: 10Ahoelzl) [17:53:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:54:48] (03PS3) 10Xcollazo: Adding terms of use for download-index.html [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200192 (https://phabricator.wikimedia.org/T408881) (owner: 10Ahoelzl) [17:58:19] (03CR) 10Bking: [C:03+2] Add OpenSearch cluster configs for net-new clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198139 (https://phabricator.wikimedia.org/T357753) (owner: 10Bking) [18:02:03] 06SRE-OnFire, 10Cassandra, 06MediaWiki-Platform-Team (Radar), 07Sustainability (Incident Followup): sessionstorage namespacing - https://phabricator.wikimedia.org/T392170#11331931 (10Eevans) p:05Triage→03Medium [18:12:39] 06SRE-OnFire, 10Cassandra, 06MediaWiki-Platform-Team (Radar), 07Sustainability (Incident Followup): Provision anonymous session storage - https://phabricator.wikimedia.org/T408935 (10Eevans) 03NEW [18:13:16] 06SRE-OnFire, 10Cassandra, 06MediaWiki-Platform-Team (Radar), 07Sustainability (Incident Followup): Provision anonymous session storage - https://phabricator.wikimedia.org/T408935#11331965 (10Eevans) [18:18:39] 06SRE-OnFire, 10Cassandra, 06MediaWiki-Platform-Team (Radar), 07Sustainability (Incident Followup): Provision anonymous session storage - https://phabricator.wikimedia.org/T408935#11331986 (10Eevans) [18:19:03] 06SRE-OnFire, 10Cassandra, 06MediaWiki-Platform-Team (Radar), 07Sustainability (Incident Followup): Provision anonymous session storage - https://phabricator.wikimedia.org/T408935#11331987 (10Eevans) p:05Triage→03Medium [18:22:10] 06SRE-OnFire, 10Cassandra, 06MediaWiki-Platform-Team (Radar), 07Sustainability (Incident Followup): Provision anonymous session storage - https://phabricator.wikimedia.org/T408935#11331988 (10Eevans) @Tgr what portion of the overall workload is anon? Is there a dashboard for this? [18:24:56] (03CR) 10Bking: [C:03+2] "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200192 (https://phabricator.wikimedia.org/T408881) (owner: 10Ahoelzl) [18:27:23] (03PS1) 10ZhaoFJx: zhwiki: Add SecurePoll Rights to CheckUser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200400 (https://phabricator.wikimedia.org/T408902) [18:28:20] 06SRE-OnFire, 10Cassandra, 06MediaWiki-Platform-Team (Radar), 07Sustainability (Incident Followup): sessionstorage namespacing - https://phabricator.wikimedia.org/T392170#11331996 (10Eevans) @Tgr at this point, is there any obstacle and/or objections to separating storage of central auth sessions? Using a... [18:29:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200400 (https://phabricator.wikimedia.org/T408902) (owner: 10ZhaoFJx) [18:31:28] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Key packages missing from trixie-wikimedia - https://phabricator.wikimedia.org/T407513#11332007 (10LSobanski) To avoid confusion I believe the above statement should say "now available" instead of "not available" and link to {T408776} [18:50:49] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission es2034 - https://phabricator.wikimedia.org/T408414#11332051 (10Jhancock.wm) 05Open→03Resolved [18:51:15] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission es2033 - https://phabricator.wikimedia.org/T408412#11332056 (10Jhancock.wm) [18:51:26] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission es2033 - https://phabricator.wikimedia.org/T408412#11332057 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [18:51:56] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission es2032 - https://phabricator.wikimedia.org/T408411#11332059 (10Jhancock.wm) 05Open→03Resolved [18:52:18] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission es2027 - https://phabricator.wikimedia.org/T408406#11332061 (10Jhancock.wm) 05Open→03Resolved [18:59:28] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission es2026 - https://phabricator.wikimedia.org/T408385#11332083 (10Jhancock.wm) a:03Jhancock.wm [19:00:14] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission es2034 - https://phabricator.wikimedia.org/T408414#11332087 (10Jhancock.wm) a:03Jhancock.wm [19:00:44] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission es2032 - https://phabricator.wikimedia.org/T408411#11332088 (10Jhancock.wm) a:03Jhancock.wm [19:01:14] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission es2027 - https://phabricator.wikimedia.org/T408406#11332089 (10Jhancock.wm) a:03Jhancock.wm [19:07:51] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406272#11332095 (10Jhancock.wm) a:03VRiley-WMF [19:12:13] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [19:12:21] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [19:20:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:26:27] FIRING: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:31:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:34:00] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:35:56] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [19:36:04] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [19:36:10] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [19:36:28] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [19:38:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:40:41] (03CR) 10Superpes15: zhwiki: Add SecurePoll Rights to CheckUser (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200400 (https://phabricator.wikimedia.org/T408902) (owner: 10ZhaoFJx) [19:42:03] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup1001-dev.eqiad.wmnet with OS trixie [19:52:49] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup1001-dev.eqiad.wmnet with reason: host reimage [19:59:08] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup1001-dev.eqiad.wmnet with reason: host reimage [20:06:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:08:14] (03CR) 10Reedy: CommonSettings: Remove some OATHAuth config overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198180 (https://phabricator.wikimedia.org/T404806) (owner: 10Reedy) [20:19:32] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [20:32:26] (03PS2) 10Cparle: Enable pagination on Special:EditWatchlist everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200105 (https://phabricator.wikimedia.org/T41510) [20:32:48] (03CR) 10Cparle: Enable pagination on Special:EditWatchlist everywhere (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200105 (https://phabricator.wikimedia.org/T41510) (owner: 10Cparle) [20:57:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:07:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:09:00] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:16:44] 10SRE-SLO, 06Experimentation Lab (Experiment Platform Sprint 14), 07OKR-Work: Create Pyrra SLOs for xLab - https://phabricator.wikimedia.org/T398869#11332520 (10dr0ptp4kt) I've been using this last couple days. Looking forward to turning on alerting the week after next. I'll close this task and file a separa... [21:19:00] FIRING: [2x] CertAlmostExpired: Certificate for service fasw2-c1a-eqiad.mgmt.eqiad.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:20:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:33:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:36:12] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715#11332556 (10AntiCompositeNumber) The regex in {https://phabricator.wikimedia.org/diffusion/THMBREXT/browse/master/wikimedia_thumbor/handl... [21:43:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:46:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:49:00] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:52:08] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [21:53:58] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30032 bytes in 0.440 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [22:04:08] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [22:05:00] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 2.602 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [22:09:35] 10SRE-SLO, 06Experimentation Lab (Experiment Platform Sprint 14), 07OKR-Work: Create Pyrra SLOs for xLab - https://phabricator.wikimedia.org/T398869#11332592 (10dr0ptp4kt) 05Open→03Resolved [22:14:29] 10SRE-SLO, 06Experimentation Lab (Experiment Platform Sprint 14), 07OKR-Work: Create Pyrra SLOs for xLab - https://phabricator.wikimedia.org/T398869#11332611 (10dr0ptp4kt) 05Resolved→03Open Unresolving so it stays in the Done column for any sprint close-out activities next week. [22:38:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:48:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:58:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:03:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:03:57] FIRING: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:12:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:13:57] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:18:57] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:22:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:34:00] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable