[00:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251111T0000) [00:05:46] RoanKattouw: alas, nothing is jumping out in the k8s event logs. I would just go ahead and retry. [00:06:31] the image has already been built and pushed, and since it rolled nearly completely out, it's now cached on the worker nodes [00:07:00] which is to say this _should_ be quick [00:10:59] swfrench-wmf: OK I'll kick off a retry [00:11:51] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1203097|i18n: Update wikimedia-emailauth-login-help to link to Special:AccountRecovery (T399749)]], [[gerrit:1203126|OATHManage: Don't always set the page title to "Create new recovery codes"]], [[gerrit:1203535|OATHAuth: Increase 2FA opt-in to 70% of users (T399664)]] [00:11:56] T399749: Link to Zendesk form from EmailAuth failure message - https://phabricator.wikimedia.org/T399749 [00:11:56] T399664: Expand 2FA Opt-In Privileges - https://phabricator.wikimedia.org/T399664 [00:12:14] (03PS1) 10Superpes15: [arwikibooks] Add an alias for project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203566 (https://phabricator.wikimedia.org/T40978) [00:12:28] (03PS1) 10Dzahn: codesearch: add logrotate snippet for /var/log/account/ [puppet] - 10https://gerrit.wikimedia.org/r/1203567 (https://phabricator.wikimedia.org/T408234) [00:13:57] (03CR) 10Dzahn: [C:03+2] codesearch: add logrotate snippet for /var/log/account/ [puppet] - 10https://gerrit.wikimedia.org/r/1203567 (https://phabricator.wikimedia.org/T408234) (owner: 10Dzahn) [00:13:59] (03PS2) 10Superpes15: [arwikibooks] Add an alias for project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203566 (https://phabricator.wikimedia.org/T409789) [00:14:32] (03CR) 10CI reject: [V:04-1] codesearch: add logrotate snippet for /var/log/account/ [puppet] - 10https://gerrit.wikimedia.org/r/1203567 (https://phabricator.wikimedia.org/T408234) (owner: 10Dzahn) [00:15:28] !log catrope@deploy2002 catrope, mstyles: Backport for [[gerrit:1203097|i18n: Update wikimedia-emailauth-login-help to link to Special:AccountRecovery (T399749)]], [[gerrit:1203126|OATHManage: Don't always set the page title to "Create new recovery codes"]], [[gerrit:1203535|OATHAuth: Increase 2FA opt-in to 70% of users (T399664)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can no [00:15:28] w be verified there. [00:15:49] !log catrope@deploy2002 catrope, mstyles: Continuing with sync [00:18:12] (03PS2) 10Dzahn: codesearch: add logrotate snippet for /var/log/account/ [puppet] - 10https://gerrit.wikimedia.org/r/1203567 (https://phabricator.wikimedia.org/T408234) [00:19:11] RESOLVED: [2x] PfwCoreBGPDown: Fundraising Firewall core BGP session down between pfw1-codfw and (null) (10.195.0.248) - group VPN - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [00:19:16] RESOLVED: [2x] PfwCoreBGPDown: Fundraising Firewall core BGP session down between pfw1-codfw and (null) (10.195.0.248) - group VPN - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DPfwCoreBGPDown [00:20:52] (03PS3) 10Dzahn: codesearch: add logrotate snippet for /var/log/account/ [puppet] - 10https://gerrit.wikimedia.org/r/1203567 (https://phabricator.wikimedia.org/T408234) [00:21:43] (03PS4) 10Dzahn: codesearch: add logrotate snippet for /var/log/account/ [puppet] - 10https://gerrit.wikimedia.org/r/1203567 (https://phabricator.wikimedia.org/T408234) [00:22:47] !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1203097|i18n: Update wikimedia-emailauth-login-help to link to Special:AccountRecovery (T399749)]], [[gerrit:1203126|OATHManage: Don't always set the page title to "Create new recovery codes"]], [[gerrit:1203535|OATHAuth: Increase 2FA opt-in to 70% of users (T399664)]] (duration: 10m 56s) [00:22:52] T399749: Link to Zendesk form from EmailAuth failure message - https://phabricator.wikimedia.org/T399749 [00:22:53] T399664: Expand 2FA Opt-In Privileges - https://phabricator.wikimedia.org/T399664 [00:22:55] \o/ [00:23:48] (03CR) 10CI reject: [V:04-1] codesearch: add logrotate snippet for /var/log/account/ [puppet] - 10https://gerrit.wikimedia.org/r/1203567 (https://phabricator.wikimedia.org/T408234) (owner: 10Dzahn) [00:23:50] I wish I had a satisfying explanation for what was going on with that mw-wikifunctions/group1 update ... [00:25:44] (03PS5) 10Dzahn: codesearch: add logrotate snippet for /var/log/account/ [puppet] - 10https://gerrit.wikimedia.org/r/1203567 (https://phabricator.wikimedia.org/T408234) [00:30:41] Well I'm just glad the deployment is done [00:30:42] (03CR) 10Dzahn: [C:03+2] codesearch: add logrotate snippet for /var/log/account/ [puppet] - 10https://gerrit.wikimedia.org/r/1203567 (https://phabricator.wikimedia.org/T408234) (owner: 10Dzahn) [00:31:22] I wish I was done though, I have a security patch which I'll put through next, starting in about 5 minutes [00:37:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1203570 [00:37:50] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1203570 (owner: 10TrainBranchBot) [00:52:25] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1203570 (owner: 10TrainBranchBot) [00:57:54] !log catrope Deployed security patch for T409743 [01:00:39] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:01:52] (03PS1) 10Scott French: hiera: temporarily point codfw LVS at conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/1203556 (https://phabricator.wikimedia.org/T352245) [01:01:54] (03PS1) 10Scott French: hiera: switch codfw etcd-main cluster to cfssl/pki [puppet] - 10https://gerrit.wikimedia.org/r/1203557 (https://phabricator.wikimedia.org/T352245) [01:01:57] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 01m 18s) [01:04:57] (03CR) 10Scott French: "Thanks for taking a look at [0] @ssingh@wikimedia.org. If I could ask you to review this as well, that would be greatly appreciated." [puppet] - 10https://gerrit.wikimedia.org/r/1203556 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [01:06:08] (03CR) 10Scott French: "If you have time review this before Wednesday, that would be greatly appreciated." [puppet] - 10https://gerrit.wikimedia.org/r/1203557 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [01:07:51] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1203576 [01:07:52] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1203576 (owner: 10TrainBranchBot) [01:09:06] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:13:56] (03PS1) 10Scott French: mw-(api-ext|web): return capacity from migration to main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203571 (https://phabricator.wikimedia.org/T405955) [01:18:38] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [01:19:12] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [01:28:30] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1203576 (owner: 10TrainBranchBot) [01:33:22] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:40:21] (03PS1) 10Scott French: deployment_server: migrate mediawiki-dumps-legacy to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1203578 (https://phabricator.wikimedia.org/T405955) [01:48:22] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:07:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.46.0-wmf.2 [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1203580 (https://phabricator.wikimedia.org/T408272) [02:07:55] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.46.0-wmf.2 [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1203580 (https://phabricator.wikimedia.org/T408272) (owner: 10TrainBranchBot) [02:11:35] andrew@cumin2002 reimage (PID 2211005) is awaiting input [02:18:23] (03Merged) 10jenkins-bot: Branch commit for wmf/1.46.0-wmf.2 [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1203580 (https://phabricator.wikimedia.org/T408272) (owner: 10TrainBranchBot) [02:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251111T0300) [03:04:01] PROBLEM - Host db1230 #page is DOWN: PING CRITICAL - Packet loss = 100% [03:04:02] PROBLEM - Host db1242 #page is DOWN: PING CRITICAL - Packet loss = 100% [03:04:02] PROBLEM - Host db1252 #page is DOWN: PING CRITICAL - Packet loss = 100% [03:04:09] PROBLEM - Host db1169 #page is DOWN: PING CRITICAL - Packet loss = 100% [03:04:11] PROBLEM - Host db1219 #page is DOWN: PING CRITICAL - Packet loss = 100% [03:04:12] PROBLEM - Host db1167 #page is DOWN: PING CRITICAL - Packet loss = 100% [03:04:16] PROBLEM - Host db1189 #page is DOWN: PING CRITICAL - Packet loss = 100% [03:04:19] PROBLEM - Host db1166 #page is DOWN: PING CRITICAL - Packet loss = 100% [03:04:25] PROBLEM - Host cirrussearch1118 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:31] PROBLEM - Host db1168 #page is DOWN: PING CRITICAL - Packet loss = 100% [03:04:32] PROBLEM - Host db1220 #page is DOWN: PING CRITICAL - Packet loss = 100% [03:04:36] PROBLEM - Host es1057 #page is DOWN: PING CRITICAL - Packet loss = 100% [03:04:39] PROBLEM - Host mc1047 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:39] PROBLEM - Host mc-gp1005 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:39] PROBLEM - Host ml-serve1003 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:39] PROBLEM - Host ml-cache1002 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:40] PROBLEM - Host ms-fe1011 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:41] PROBLEM - Host ms-backup1002 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:44] PROBLEM - Host pc1016 #page is DOWN: PING CRITICAL - Packet loss = 100% [03:04:45] PROBLEM - Host poolcounter1007 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:45] PROBLEM - Host ping1004 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:45] PROBLEM - Host clouddb1017 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:45] PROBLEM - Host cloudelastic1009 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:46] PROBLEM - Host dse-k8s-worker1003 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:46] PROBLEM - Host cp1111 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:47] PROBLEM - Host ganeti1028 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:47] PROBLEM - Host sretest1006 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:48] PROBLEM - Host maps1013 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:48] PROBLEM - Host ldap-maint1001 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:49] PROBLEM - Host rpki1001 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:49] PROBLEM - Host lvs1019 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:50] PROBLEM - Host mwlog1002 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:50] PROBLEM - Host kafkamon1003 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:51] PROBLEM - Host tcp-proxy1001 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:51] PROBLEM - Host ms-be1082 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:52] PROBLEM - Host moss-be1002 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:52] PROBLEM - Host wikikube-worker1263 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:53] PROBLEM - Host wikikube-worker1055 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:53] PROBLEM - Host wikikube-worker1154 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:53] oh boy. [03:04:54] PROBLEM - Host wikikube-worker1135 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:54] PROBLEM - Host wikikube-worker1063 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:55] PROBLEM - Host wikikube-worker1053 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:55] PROBLEM - Host wikikube-worker1265 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:56] PROBLEM - Host wikikube-ctrl1003 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:56] PROBLEM - Host wikikube-worker1136 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:57] PROBLEM - Host wikikube-worker1054 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:57] PROBLEM - Host webperf1003 is DOWN: PING CRITICAL - Packet loss = 100% [03:05:02] Oh [03:05:49] PROBLEM - Host wikikube-worker1155 is DOWN: PING CRITICAL - Packet loss = 100% [03:05:49] PROBLEM - Debmonitor Health Check on debmonitor.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Debmonitor [03:05:59] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [03:06:00] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [03:06:00] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [03:06:00] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [03:06:03] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [03:06:03] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [03:06:03] PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [03:06:05] PROBLEM - haproxy failover on dbproxy1028 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [03:06:05] PROBLEM - haproxy failover on dbproxy1023 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [03:06:23] PROBLEM - MariaDB Replica IO: s3 #page on db1212 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1189.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1189.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:06:35] PROBLEM - MariaDB Replica IO: s3 #page on db1157 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1189.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1189.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:06:36] PROBLEM - MariaDB Replica IO: s7 #page on db1158 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1181.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1181.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:06:36] PROBLEM - MariaDB Replica IO: matomo on db1208 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@matomo1003.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on matomo1003.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:06:55] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 1224 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, discovered_master: True, active_primary_shards: 570, active_shards: 643, relocating_shards: 0, initializing_shards: 16, unassigned_shards: 1208, delayed_unassigned_shards: [03:06:55] r_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 34.440278521692555 https://wikitech.wikimedia.org/wiki/Search%23Administration [03:06:55] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 1015 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, discovered_master: True, active_primary_shards: 488, active_shards: 600, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 1011, delayed_unassigned_shards: 0 [03:06:55] _of_pending_tasks: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 146, active_shards_percent_as_number: 37.15170278637771 https://wikitech.wikimedia.org/wiki/Search%23Administration [03:07:02] FIRING: [6x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:07:03] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [03:07:03] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [03:07:05] PROBLEM - MariaDB Replica IO: s3 on db1240 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1189.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1189.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:07:05] PROBLEM - MariaDB Replica IO: s7 #page on db1227 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1181.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1181.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:07:06] PROBLEM - MariaDB Replica IO: s7 #page on db1236 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1181.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1181.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:07:11] FIRING: [2x] ProbeDown: Service eventgate-main:4492 has failed probes (http_eventgate-main_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:07:25] FIRING: ProbeDown: Service people1005:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1005:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:07:36] PROBLEM - MariaDB Replica IO: s8 on db1154 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1167.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1167.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:07:53] FIRING: SLOMetricAbsent: charts-client-side-availability-v1 - https://slo.wikimedia.org/?search=charts-client-side-availability-v1 - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:08:22] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:08:22] FIRING: [38x] ProbeDown: Service aqs1018-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:08:45] FIRING: CirrusStreamingUpdaterUnknownErrors: CirrusSearch consumer-cloudelastic@eqiad is failing write requests because of unknown errors - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterUnknownErrors [03:09:06] FIRING: [43x] ProbeDown: Service aqs1018-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:09:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.23% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:09:15] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-debug releases routed via pinkunicorn (k8s) 1.875s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:09:24] FIRING: SLOMetricAbsent: edit-check-pre-save-checks-ratio - https://slo.wikimedia.org/?search=edit-check-pre-save-checks-ratio - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:09:51] FIRING: [7x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [03:10:25] FIRING: SystemdUnitFailed: netbox_ganeti_eqiad_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:11:57] FIRING: [14x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:11:57] FIRING: [7x] ProbeDown: Service wikikube-ctrl1003:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:12:23] RECOVERY - MariaDB Replica IO: s3 #page on db1212 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:12:25] RECOVERY - Host wikikube-worker1269 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [03:12:25] RECOVERY - Host sessionstore1005 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [03:12:25] RECOVERY - Host wdqs1013 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [03:12:25] RECOVERY - Host ncredir1001 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [03:12:25] RECOVERY - Host kafka-main1008 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [03:12:25] RECOVERY - Host ganeti1045 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [03:12:25] RECOVERY - Host clouddb1018 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [03:12:27] RECOVERY - Host clouddb1017 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [03:12:28] RECOVERY - Host db1169 #page is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [03:12:29] RECOVERY - Host db1218 #page is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [03:12:29] RECOVERY - Host db1242 #page is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [03:12:30] RECOVERY - Host db1219 #page is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [03:12:30] RECOVERY - Host wdqs1014 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [03:12:31] RECOVERY - Host db1252 #page is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [03:12:32] RECOVERY - Host db1220 #page is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [03:12:32] RECOVERY - Host es1057 #page is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [03:13:38] RECOVERY - MariaDB Replica IO: s8 on db1154 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:13:54] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1587, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 28, delayed_unassigned_shards: 0, number_of_pending_tas [03:13:54] umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 98.14471243042672 https://wikitech.wikimedia.org/wiki/Search%23Administration [03:13:54] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 916, active_shards: 1682, relocating_shards: 0, initializing_shards: 13, unassigned_shards: 175, delayed_unassigned_shards: 0, number_of_pending_task [03:13:54] umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 322, active_shards_percent_as_number: 89.94652406417111 https://wikitech.wikimedia.org/wiki/Search%23Administration [03:13:54] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1587, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 28, delayed_unassigned_shards: 0, number_of_pending_tas [03:13:54] umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 98.14471243042672 https://wikitech.wikimedia.org/wiki/Search%23Administration [03:13:56] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 916, active_shards: 1682, relocating_shards: 0, initializing_shards: 13, unassigned_shards: 175, delayed_unassigned_shards: 0, number_of_pending_task [03:13:56] umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 362, active_shards_percent_as_number: 89.94652406417111 https://wikitech.wikimedia.org/wiki/Search%23Administration [03:13:56] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1591, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 25, delayed_unassigned_shards: 0, number_of_pending_tas [03:13:56] umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 98.39208410636982 https://wikitech.wikimedia.org/wiki/Search%23Administration [03:13:57] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1591, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 25, delayed_unassigned_shards: 0, number_of_pending_tas [03:13:57] umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 98.39208410636982 https://wikitech.wikimedia.org/wiki/Search%23Administration [03:13:58] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 922, active_shards: 1693, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 175, delayed_unassigned_shards: 0, number_of_pending_tasks [03:13:58] ber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 414, active_shards_percent_as_number: 90.53475935828878 https://wikitech.wikimedia.org/wiki/Search%23Administration [03:13:59] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 808, active_shards: 1591, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 25, delayed_unassigned_shards: 0, number_of_pending_tas [03:14:00] umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 98.39208410636982 https://wikitech.wikimedia.org/wiki/Search%23Administration [03:14:00] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 922, active_shards: 1693, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 175, delayed_unassigned_shards: 0, number_of_pending_tasks [03:14:00] ber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 430, active_shards_percent_as_number: 90.53475935828878 https://wikitech.wikimedia.org/wiki/Search%23Administration [03:14:01] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: red, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, discovered_master: True, active_primary_shards: 922, active_shards: 1693, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 175, delayed_unassigned_shards: 0, number_of_pending_tasks [03:14:02] ber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 476, active_shards_percent_as_number: 90.53475935828878 https://wikitech.wikimedia.org/wiki/Search%23Administration [03:14:38] RECOVERY - MariaDB Replica IO: matomo on db1208 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:14:48] PROBLEM - haproxy failover on dbproxy1024 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [03:15:06] PROBLEM - haproxy failover on dbproxy1029 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [03:16:57] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver2001 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:17:56] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver1003 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:19:31] FIRING: [44x] ProbeDown: Service aqs1018-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:19:35] FIRING: [122x] KubernetesCalicoDown: aux-k8s-worker1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:20:41] FIRING: [43x] ProbeDown: Service aqs1018-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:20:45] FIRING: [126x] KubernetesCalicoDown: aux-k8s-ctrl1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:20:56] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.77% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:21:00] RESOLVED: [3x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-debug releases routed via pinkunicorn (k8s) 1.938s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:21:15] RESOLVED: SLOMetricAbsent: edit-check-pre-save-checks-ratio - https://slo.wikimedia.org/?search=edit-check-pre-save-checks-ratio - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:21:43] RESOLVED: [10x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [03:23:10] FIRING: [14x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:23:21] RESOLVED: [7x] ProbeDown: Service wikikube-ctrl1003:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:23:34] RESOLVED: ProbeDown: Service people1005:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1005:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:24:21] RESOLVED: SLOMetricAbsent: charts-client-side-availability-v1 - https://slo.wikimedia.org/?search=charts-client-side-availability-v1 - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:30:58] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:31:14] RESOLVED: [41x] ProbeDown: Service aqs1018-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:31:18] RESOLVED: [149x] KubernetesCalicoDown: aux-k8s-ctrl1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:32:06] RESOLVED: CirrusStreamingUpdaterUnknownErrors: CirrusSearch consumer-cloudelastic@eqiad is failing write requests because of unknown errors - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterUnknownErrors [03:32:32] FIRING: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1135:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:32:36] FIRING: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:32:50] FIRING: [6x] SystemdUnitFailed: netbox_ganeti_eqiad_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:33:09] FIRING: [16x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:34:46] RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1135:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:34:50] RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-worker1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:34:55] FIRING: [6x] SystemdUnitFailed: netbox_ganeti_eqiad_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:35:03] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:35:08] FIRING: [6x] SystemdUnitFailed: netbox_ganeti_eqiad_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:35:39] FIRING: [6x] SystemdUnitFailed: netbox_ganeti_eqiad_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:36:56] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver2001 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:37:56] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver1003 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:38:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [03:40:25] RESOLVED: [6x] SystemdUnitFailed: netbox_ganeti_eqiad_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:41:45] (03PS3) 10Aaron Schulz: Route /page/lint(.*) to the gateway on group1 [puppet] - 10https://gerrit.wikimedia.org/r/1199034 (https://phabricator.wikimedia.org/T384216) [03:50:08] (03PS4) 10Aaron Schulz: Route transform/wikitext/to/lint(.*) to the gateway on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1194996 (https://phabricator.wikimedia.org/T385066) [03:58:49] (03PS4) 10Aaron Schulz: Route /page/lint(.*) to the gateway on group1 [puppet] - 10https://gerrit.wikimedia.org/r/1199034 (https://phabricator.wikimedia.org/T384216) [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251111T0400) [04:01:57] FIRING: [10x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:02:03] (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203592 (https://phabricator.wikimedia.org/T408272) [04:02:05] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203592 (https://phabricator.wikimedia.org/T408272) (owner: 10TrainBranchBot) [04:02:54] (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203592 (https://phabricator.wikimedia.org/T408272) (owner: 10TrainBranchBot) [04:03:02] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:03:26] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.46.0-wmf.2 refs T408272 [04:03:30] T408272: 1.46.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T408272 [04:05:52] RECOVERY - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-presto1013 is OK: communication: 0 OK : controller: 0 OK : physical_disk: 0 OK : virtual_disk: 0 OK : bbu: 0 OK : enclosure: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [04:08:02] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:18:22] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:28:22] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:28:34] FIRING: DiskSpace: Disk space serpens:9100:/ 3.503% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [04:33:02] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:49:53] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.46.0-wmf.2 refs T408272 (duration: 46m 27s) [04:49:57] T408272: 1.46.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T408272 [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251111T0500) [05:02:33] !log mwpresync@deploy2002 Pruned MediaWiki: 1.45.0-wmf.24 (duration: 02m 30s) [05:08:02] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:08:22] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:06] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:11:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:13:02] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:16:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:23:46] 06SRE, 06Infrastructure-Foundations, 10netops: Row C traffic outage Nov 11 2025 - https://phabricator.wikimedia.org/T409800 (10cmooney) 03NEW p:05Triage→03High [05:26:57] FIRING: [8x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:28:02] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:32:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:33:22] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:37:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:45:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:50:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:51:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:53:02] 06SRE, 06Infrastructure-Foundations, 10netops: Row C traffic outage Nov 11 2025 - https://phabricator.wikimedia.org/T409800#11361879 (10cmooney) [05:55:58] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:58:32] 06SRE, 06Infrastructure-Foundations, 10netops: Row C traffic outage Nov 11 2025 - https://phabricator.wikimedia.org/T409800#11361886 (10cmooney) [05:58:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:01:25] FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:08:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:09:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:14:08] (03CR) 10KartikMistry: [C:03+2] apertium: staging: Update to 2025-11-10-034557-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203296 (https://phabricator.wikimedia.org/T408515) (owner: 10KartikMistry) [06:14:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:15:53] (03Merged) 10jenkins-bot: apertium: staging: Update to 2025-11-10-034557-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203296 (https://phabricator.wikimedia.org/T408515) (owner: 10KartikMistry) [06:17:09] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11361889 (10Marostegui) >>! In T409374#11359455, @Jclark-ctr wrote: > @marostegui I did finally get confirmation on tracking on replacement memory It should be onsite by e... [06:20:58] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:21:01] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2203.codfw.wmnet with reason: Maintenance [06:23:28] Deploying Apertium, staging only. [06:24:29] !log Deploy schema change on x1 codfw master with replication T409101 [06:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:32] T409101: Apply ce_address cleanup schema changes in production (x1) - https://phabricator.wikimedia.org/T409101 [06:25:58] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:29:34] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/apertium: apply [06:29:59] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/apertium: apply [06:30:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:31:14] !log apertium: staging: Update to 2025-11-10-034557-production (T408515) [06:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:17] T408515: Update Apertium service to Trixie - https://phabricator.wikimedia.org/T408515 [06:31:57] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:32:56] (03PS4) 10KartikMistry: machinetranslation: Increase replica and memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202642 (https://phabricator.wikimedia.org/T386371) [06:33:47] !log Deploy schema change on x1 codfw master with replication T409733 [06:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:51] T409733: Add the sis_trigger_id and sis_trigger_type columns to the cusi_signal table on WMF wikis - https://phabricator.wikimedia.org/T409733 [06:35:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:37:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:41:51] (03PS1) 10Marostegui: clouddb1026-clouddb1033: New hosts [puppet] - 10https://gerrit.wikimedia.org/r/1203598 (https://phabricator.wikimedia.org/T409557) [06:42:16] (03CR) 10CI reject: [V:04-1] clouddb1026-clouddb1033: New hosts [puppet] - 10https://gerrit.wikimedia.org/r/1203598 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [06:42:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2147.codfw.wmnet with reason: Maintenance [06:42:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2147 (T407997)', diff saved to https://phabricator.wikimedia.org/P85131 and previous config saved to /var/cache/conftool/dbconfig/20251111-064257-marostegui.json [06:43:01] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [06:44:27] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1203598 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [06:45:41] (03CR) 10Marostegui: [C:03+2] clouddb1026-clouddb1033: New hosts [puppet] - 10https://gerrit.wikimedia.org/r/1203598 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [06:47:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:48:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:51:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T407997)', diff saved to https://phabricator.wikimedia.org/P85132 and previous config saved to /var/cache/conftool/dbconfig/20251111-065112-marostegui.json [06:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:51:17] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [06:51:27] (03PS1) 10Marostegui: mariadb: Productionize db1264 [puppet] - 10https://gerrit.wikimedia.org/r/1203600 (https://phabricator.wikimedia.org/T407941) [06:52:26] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db1264 [puppet] - 10https://gerrit.wikimedia.org/r/1203600 (https://phabricator.wikimedia.org/T407941) (owner: 10Marostegui) [06:53:11] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Cloning another host [06:55:46] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1220.eqiad.wmnet with reason: Cloning another host [06:55:59] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone of db1220.eqiad.wmnet onto db1264.eqiad.wmnet [06:56:03] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool db1220 - Depool db1220.eqiad.wmnet to then clone it to db1264.eqiad.wmnet - marostegui@cumin1003 [06:56:22] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1220 - Depool db1220.eqiad.wmnet to then clone it to db1264.eqiad.wmnet - marostegui@cumin1003 [06:57:58] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:59:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251111T0700) [07:00:05] marostegui, Amir1, and federico3: That opportune time for a Primary database switchover deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251111T0700). [07:02:58] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [07:03:34] RESOLVED: DiskSpace: Disk space serpens:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:06:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P85134 and previous config saved to /var/cache/conftool/dbconfig/20251111-070620-marostegui.json [07:06:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [07:11:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [07:13:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [07:15:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie [07:16:25] RESOLVED: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:21:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P85135 and previous config saved to /var/cache/conftool/dbconfig/20251111-072127-marostegui.json [07:23:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [07:33:59] (03PS1) 10Muehlenhoff: partman: db-trixie.cfg: Flip value for additional test [puppet] - 10https://gerrit.wikimedia.org/r/1203603 [07:35:36] (03CR) 10Muehlenhoff: [C:03+2] partman: db-trixie.cfg: Flip value for additional test [puppet] - 10https://gerrit.wikimedia.org/r/1203603 (owner: 10Muehlenhoff) [07:36:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T407997)', diff saved to https://phabricator.wikimedia.org/P85136 and previous config saved to /var/cache/conftool/dbconfig/20251111-073635-marostegui.json [07:36:39] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [07:36:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2155.codfw.wmnet with reason: Maintenance [07:36:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2155 (T407997)', diff saved to https://phabricator.wikimedia.org/P85137 and previous config saved to /var/cache/conftool/dbconfig/20251111-073659-marostegui.json [07:37:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [07:38:29] (03PS1) 10KartikMistry: Apertium: Update to 2025-11-10-034557-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203644 (https://phabricator.wikimedia.org/T408515) [07:42:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [07:44:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T407997)', diff saved to https://phabricator.wikimedia.org/P85138 and previous config saved to /var/cache/conftool/dbconfig/20251111-074404-marostegui.json [07:44:08] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [07:44:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [07:52:05] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es2028.codfw.wmnet with OS trixie [07:52:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie [07:59:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P85139 and previous config saved to /var/cache/conftool/dbconfig/20251111-075911-marostegui.json [08:00:05] Amir1, Urbanecm, and awight: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251111T0800) [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:04:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:08:40] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2028.codfw.wmnet with reason: host reimage [08:09:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:13:43] !log installing intel-microcode security updates [08:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2028.codfw.wmnet with reason: host reimage [08:14:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P85140 and previous config saved to /var/cache/conftool/dbconfig/20251111-081419-marostegui.json [08:14:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:15:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:19:58] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:21:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:26:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:29:06] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:29:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T407997)', diff saved to https://phabricator.wikimedia.org/P85141 and previous config saved to /var/cache/conftool/dbconfig/20251111-082927-marostegui.json [08:29:31] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [08:29:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2172.codfw.wmnet with reason: Maintenance [08:29:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2172 (T407997)', diff saved to https://phabricator.wikimedia.org/P85142 and previous config saved to /var/cache/conftool/dbconfig/20251111-082950-marostegui.json [08:30:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:31:57] FIRING: [4x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:35:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:36:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T407997)', diff saved to https://phabricator.wikimedia.org/P85143 and previous config saved to /var/cache/conftool/dbconfig/20251111-083657-marostegui.json [08:37:01] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [08:37:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2028.codfw.wmnet with OS trixie [08:41:17] (03CR) 10Dpogorzelski: [C:03+1] containerd: add cni bin directory config on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1203500 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey) [08:46:27] !log ryankemper@cumin1002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin1002 - T390860 [08:46:31] T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860 [08:52:03] (03PS1) 10Slyngshede: data.yaml: retire non-FIDO2 ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1203759 [08:52:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P85144 and previous config saved to /var/cache/conftool/dbconfig/20251111-085204-marostegui.json [08:55:07] (03PS1) 10Majavah: hieradata: Enable jumbo frames on cloudvirt1062 [puppet] - 10https://gerrit.wikimedia.org/r/1203760 (https://phabricator.wikimedia.org/T330075) [08:58:02] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:58:17] (03CR) 10Filippo Giunchedi: [C:03+1] hieradata: Enable jumbo frames on cloudvirt1062 [puppet] - 10https://gerrit.wikimedia.org/r/1203760 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah) [08:58:56] (03CR) 10Majavah: [C:03+2] hieradata: Enable jumbo frames on cloudvirt1062 [puppet] - 10https://gerrit.wikimedia.org/r/1203760 (https://phabricator.wikimedia.org/T330075) (owner: 10Majavah) [09:00:05] andre and jeena: MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251111T0900). Please do the needful. [09:01:21] (03CR) 10Elukey: "sigh ./cookbooks/sre/elasticsearch/ban.py uses Elasticsearch from the dependency, we should remove it." [cookbooks] - 10https://gerrit.wikimedia.org/r/1203491 (https://phabricator.wikimedia.org/T390860) (owner: 10Muehlenhoff) [09:03:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:07:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P85145 and previous config saved to /var/cache/conftool/dbconfig/20251111-090712-marostegui.json [09:08:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:09:07] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:12:06] (03CR) 10Itamar Givon: [C:03+1] Report integrity metric from wikidata dump scripts [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze) [09:18:14] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11362108 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host ms-be2091.codfw.wmnet with OS bullseye execute... [09:20:08] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki: Update location of startupregistrystats script [puppet] - 10https://gerrit.wikimedia.org/r/1202872 (https://phabricator.wikimedia.org/T409212) (owner: 10Zabe) [09:21:13] (03CR) 10Daniel Kinzler: [C:03+1] "Let's go ahead with this, we can change the descriptor name later" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201736 (https://phabricator.wikimedia.org/T409155) (owner: 10Pmiazga) [09:21:54] (03PS3) 10Daniel Kinzler: Note that per-route rate limits require Envoy 1.33 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200090 [09:22:08] (03CR) 10Clément Goubert: [C:03+2] api-geteway: rename symbols used in restgw ratelimiter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201736 (https://phabricator.wikimedia.org/T409155) (owner: 10Pmiazga) [09:23:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:23:28] (03CR) 10CI reject: [V:04-1] Note that per-route rate limits require Envoy 1.33 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200090 (owner: 10Daniel Kinzler) [09:23:59] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203761 (https://phabricator.wikimedia.org/T408272) [09:24:00] (03Merged) 10jenkins-bot: api-geteway: rename symbols used in restgw ratelimiter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201736 (https://phabricator.wikimedia.org/T409155) (owner: 10Pmiazga) [09:24:01] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203761 (https://phabricator.wikimedia.org/T408272) (owner: 10TrainBranchBot) [09:24:14] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11362128 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host ms-be2093.codfw.wmnet with OS bullseye execute... [09:24:56] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203761 (https://phabricator.wikimedia.org/T408272) (owner: 10TrainBranchBot) [09:31:02] (03CR) 10Muehlenhoff: "Yes, I already reopened https://phabricator.wikimedia.org/T390860#11359914 yesterday. I wasn't sure whether it's obsolete and to be remove" [cookbooks] - 10https://gerrit.wikimedia.org/r/1203491 (https://phabricator.wikimedia.org/T390860) (owner: 10Muehlenhoff) [09:31:25] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [09:31:46] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [09:32:49] (03Abandoned) 10Elukey: admin_ng: allow Istio gateways to customize histogram buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143584 (https://phabricator.wikimedia.org/T392886) (owner: 10Elukey) [09:33:23] (03CR) 10Daniel Kinzler: Note that per-route rate limits require Envoy 1.33 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200090 (owner: 10Daniel Kinzler) [09:34:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T407997)', diff saved to https://phabricator.wikimedia.org/P85148 and previous config saved to /var/cache/conftool/dbconfig/20251111-093410-marostegui.json [09:34:15] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [09:35:36] (03PS6) 10Pmiazga: api-gateway: Make x-ratelimit response header configurable. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201729 (https://phabricator.wikimedia.org/T408839) [09:35:53] (03CR) 10Daniel Kinzler: api-gateway: Make x-ratelimit response header configurable. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201729 (https://phabricator.wikimedia.org/T408839) (owner: 10Pmiazga) [09:36:47] (03PS7) 10Clément Goubert: api-gateway: Make x-ratelimit response header configurable. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201729 (https://phabricator.wikimedia.org/T408839) (owner: 10Pmiazga) [09:36:54] !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.2 refs T408272 [09:36:58] T408272: 1.46.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T408272 [09:38:36] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1203759 (owner: 10Slyngshede) [09:41:14] (03CR) 10Clément Goubert: [C:03+2] api-gateway: Make x-ratelimit response header configurable. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201729 (https://phabricator.wikimedia.org/T408839) (owner: 10Pmiazga) [09:43:10] (03Merged) 10jenkins-bot: api-gateway: Make x-ratelimit response header configurable. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201729 (https://phabricator.wikimedia.org/T408839) (owner: 10Pmiazga) [09:43:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202144 (https://phabricator.wikimedia.org/T400067) (owner: 10Abijeet Patro) [09:44:20] (03PS1) 10Muehlenhoff: partman: Test Partman recipe for DB reuse on trixie [puppet] - 10https://gerrit.wikimedia.org/r/1203770 (https://phabricator.wikimedia.org/T408777) [09:44:56] (03CR) 10CI reject: [V:04-1] partman: Test Partman recipe for DB reuse on trixie [puppet] - 10https://gerrit.wikimedia.org/r/1203770 (https://phabricator.wikimedia.org/T408777) (owner: 10Muehlenhoff) [09:45:26] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [09:45:37] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [09:45:43] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [09:45:51] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [09:46:21] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [09:46:32] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [09:47:15] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [09:47:35] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [09:47:44] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [09:48:18] (03PS2) 10Muehlenhoff: partman: Test Partman recipe for DB reuse on trixie [puppet] - 10https://gerrit.wikimedia.org/r/1203770 (https://phabricator.wikimedia.org/T408777) [09:48:20] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [09:49:02] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [09:49:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P85149 and previous config saved to /var/cache/conftool/dbconfig/20251111-094918-marostegui.json [09:49:47] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [09:50:01] (03CR) 10Muehlenhoff: [C:03+2] partman: Test Partman recipe for DB reuse on trixie [puppet] - 10https://gerrit.wikimedia.org/r/1203770 (https://phabricator.wikimedia.org/T408777) (owner: 10Muehlenhoff) [09:54:16] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: enable rate limit infrastructure, enforce no limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202647 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [09:54:24] (03CR) 10CI reject: [V:04-1] rest-gateway: enable rate limit infrastructure, enforce no limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202647 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [09:54:33] (03PS4) 10Daniel Kinzler: rest-gateway: enable rate limit infrastructure, enforce no limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202647 (https://phabricator.wikimedia.org/T406498) [09:55:13] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: enable rate limit infrastructure, enforce no limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202647 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [09:58:13] (03PS1) 10Muehlenhoff: Test Partman workaround to also cover the reuse workflow [puppet] - 10https://gerrit.wikimedia.org/r/1203778 (https://phabricator.wikimedia.org/T408777) [10:01:05] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2090.codfw.wmnet with OS bullseye [10:01:12] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11362275 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host ms-be2090.codfw.wmnet with OS bullseye execute... [10:02:58] (03CR) 10Marostegui: [C:03+1] "You can erase es2028 as needed." [puppet] - 10https://gerrit.wikimedia.org/r/1203778 (https://phabricator.wikimedia.org/T408777) (owner: 10Muehlenhoff) [10:03:11] (03CR) 10Marostegui: [C:03+1] "Or non erase, anything you need to do can be done with that host." [puppet] - 10https://gerrit.wikimedia.org/r/1203778 (https://phabricator.wikimedia.org/T408777) (owner: 10Muehlenhoff) [10:04:19] (03PS5) 10Daniel Kinzler: rest-gateway: enable rate limit infrastructure, enforce no limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202647 (https://phabricator.wikimedia.org/T406498) [10:04:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P85150 and previous config saved to /var/cache/conftool/dbconfig/20251111-100425-marostegui.json [10:10:16] (03PS6) 10Daniel Kinzler: rest-gateway: enable rate limit infrastructure, enforce no limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202647 (https://phabricator.wikimedia.org/T406498) [10:13:10] FIRING: [2x] BFDdown: BFD session down between cr2-drmrs and fe80::ee38:7300:1ae8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:16:52] (03PS1) 10Filippo Giunchedi: profile: fix etcd::v3 firewall [puppet] - 10https://gerrit.wikimedia.org/r/1203782 [10:18:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-drmrs and fe80::ee38:7300:1ae8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:19:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T407997)', diff saved to https://phabricator.wikimedia.org/P85152 and previous config saved to /var/cache/conftool/dbconfig/20251111-101933-marostegui.json [10:19:37] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [10:19:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2210.codfw.wmnet with reason: Maintenance [10:19:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2210 (T407997)', diff saved to https://phabricator.wikimedia.org/P85153 and previous config saved to /var/cache/conftool/dbconfig/20251111-101956-marostegui.json [10:20:07] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: enable rate limit infrastructure, enforce no limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202647 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [10:22:08] (03Merged) 10jenkins-bot: rest-gateway: enable rate limit infrastructure, enforce no limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202647 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [10:23:06] (03CR) 10Filippo Giunchedi: "Deployment plan is outlined in https://phabricator.wikimedia.org/T399180#11310845" [puppet] - 10https://gerrit.wikimedia.org/r/1203383 (https://phabricator.wikimedia.org/T399180) (owner: 10Filippo Giunchedi) [10:23:48] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:23:51] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:25:03] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:26:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T407997)', diff saved to https://phabricator.wikimedia.org/P85154 and previous config saved to /var/cache/conftool/dbconfig/20251111-102637-marostegui.json [10:26:41] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [10:32:10] (03PS1) 10Clément Goubert: api-gateway: Set sane default for ratelimiter.log_level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203785 [10:33:10] (03PS1) 10Daniel Kinzler: rest_gateway: Rename the user_class descriptor key to ratelimit_class. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203786 (https://phabricator.wikimedia.org/T409155) [10:33:41] (03CR) 10Muehlenhoff: "For clarification: The purpose of this reimage is that test that the installation is both unattended and retains /srv" [puppet] - 10https://gerrit.wikimedia.org/r/1203778 (https://phabricator.wikimedia.org/T408777) (owner: 10Muehlenhoff) [10:33:43] (03CR) 10Muehlenhoff: [C:03+2] Test Partman workaround to also cover the reuse workflow [puppet] - 10https://gerrit.wikimedia.org/r/1203778 (https://phabricator.wikimedia.org/T408777) (owner: 10Muehlenhoff) [10:36:00] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:37:08] (03CR) 10JMeybohm: containerd: add cni bin directory config on Trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1203500 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey) [10:37:23] (03CR) 10Clément Goubert: [C:03+2] api-gateway: Set sane default for ratelimiter.log_level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203785 (owner: 10Clément Goubert) [10:38:29] (03CR) 10JMeybohm: mw-web: Remove the hard-coded k8s version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1201804 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [10:39:18] (03Merged) 10jenkins-bot: api-gateway: Set sane default for ratelimiter.log_level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203785 (owner: 10Clément Goubert) [10:39:56] (03CR) 10Elukey: containerd: add cni bin directory config on Trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1203500 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey) [10:41:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P85155 and previous config saved to /var/cache/conftool/dbconfig/20251111-104145-marostegui.json [10:42:04] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:42:04] (03PS5) 10Elukey: containerd: add cni bin directory config on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1203500 (https://phabricator.wikimedia.org/T405891) [10:42:13] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:42:17] (03CR) 10Elukey: containerd: add cni bin directory config on Trixie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1203500 (https://phabricator.wikimedia.org/T405891) (owner: 10Elukey) [10:44:35] (03PS1) 10Dpogorzelski: ml-services: fix resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203787 (https://phabricator.wikimedia.org/T409414) [10:44:44] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:45:00] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:45:09] (03CR) 10Dpogorzelski: [C:03+2] ml-services: fix resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203787 (https://phabricator.wikimedia.org/T409414) (owner: 10Dpogorzelski) [10:46:56] (03Merged) 10jenkins-bot: ml-services: fix resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203787 (https://phabricator.wikimedia.org/T409414) (owner: 10Dpogorzelski) [10:48:43] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [10:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:51:55] (03PS6) 10Daniel Kinzler: rest-gateway: enable rate limit infrastructure, allow manual testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202654 (https://phabricator.wikimedia.org/T406498) [10:52:42] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:52:56] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:54:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie [10:54:33] (03PS7) 10Daniel Kinzler: rest-gateway: enable rate limit infrastructure, allow manual testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202654 (https://phabricator.wikimedia.org/T406498) [10:56:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P85156 and previous config saved to /var/cache/conftool/dbconfig/20251111-105652-marostegui.json [10:57:02] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [10:58:08] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [10:58:47] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: enable rate limit infrastructure, allow manual testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202654 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [10:59:28] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251111T1100) [11:00:43] (03Merged) 10jenkins-bot: rest-gateway: enable rate limit infrastructure, allow manual testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202654 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [11:03:03] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [11:04:53] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:04:56] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:05:49] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:06:05] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:07:00] (03CR) 10Slyngshede: [C:03+2] data.yaml: retire non-FIDO2 ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1203759 (owner: 10Slyngshede) [11:10:23] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:10:35] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:11:59] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [11:12:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T407997)', diff saved to https://phabricator.wikimedia.org/P85157 and previous config saved to /var/cache/conftool/dbconfig/20251111-111200-marostegui.json [11:12:04] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [11:12:17] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2219.codfw.wmnet with reason: Maintenance [11:12:19] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [11:12:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2219 (T407997)', diff saved to https://phabricator.wikimedia.org/P85158 and previous config saved to /var/cache/conftool/dbconfig/20251111-111225-marostegui.json [11:12:29] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [11:12:52] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [11:12:56] 07sre-alert-triage, 06serviceops: Alert in need of triage: SystemdUnitFailed (instance registry1005:9100) - https://phabricator.wikimedia.org/T409817 (10LSobanski) 03NEW [11:12:57] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [11:13:18] 07sre-alert-triage, 06serviceops: Alert in need of triage: SystemdUnitFailed (instance registry1005:9100) - https://phabricator.wikimedia.org/T409817#11362530 (10LSobanski) Same alert is also active for registry2004 and registry2005. [11:13:20] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [11:14:07] (03PS1) 10Muehlenhoff: es2028: Test standard reuse recipe [puppet] - 10https://gerrit.wikimedia.org/r/1203788 (https://phabricator.wikimedia.org/T408777) [11:15:06] (03PS3) 10Daniel Kinzler: rest-gateway: enable rate limits on some routes in shadow mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202658 (https://phabricator.wikimedia.org/T406498) [11:15:16] (03PS1) 10Btullis: Standardize the group ownership of the keytab files on stat servers [puppet] - 10https://gerrit.wikimedia.org/r/1203789 (https://phabricator.wikimedia.org/T409770) [11:16:01] RECOVERY - haproxy failover on dbproxy1024 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:16:09] RECOVERY - haproxy failover on dbproxy1029 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:16:11] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7602/co" [puppet] - 10https://gerrit.wikimedia.org/r/1203789 (https://phabricator.wikimedia.org/T409770) (owner: 10Btullis) [11:16:28] !log reload haproxy on dbprox1024, dbproxy1029 [11:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:05] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: enable rate limits on some routes in shadow mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202658 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [11:18:19] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1203790 (https://phabricator.wikimedia.org/T409818) [11:18:24] (03PS1) 10Gerrit maintenance bot: wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1203791 (https://phabricator.wikimedia.org/T409818) [11:19:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T407997)', diff saved to https://phabricator.wikimedia.org/P85159 and previous config saved to /var/cache/conftool/dbconfig/20251111-111906-marostegui.json [11:19:11] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [11:19:53] (03Merged) 10jenkins-bot: rest-gateway: enable rate limits on some routes in shadow mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202658 (https://phabricator.wikimedia.org/T406498) (owner: 10Daniel Kinzler) [11:20:38] 07sre-alert-triage, 06serviceops: Alert in need of triage: SystemdUnitFailed (instance registry1005:9100) - https://phabricator.wikimedia.org/T409817#11362573 (10MoritzMuehlenhoff) That's for the unused stub service, the actual registries are served by docker_registry::instance. I think replacing the service... [11:20:40] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [11:20:55] (03CR) 10Muehlenhoff: [C:03+2] es2028: Test standard reuse recipe [puppet] - 10https://gerrit.wikimedia.org/r/1203788 (https://phabricator.wikimedia.org/T408777) (owner: 10Muehlenhoff) [11:21:10] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es2028.codfw.wmnet with OS trixie [11:23:00] (03CR) 10Gehel: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1203789 (https://phabricator.wikimedia.org/T409770) (owner: 10Btullis) [11:26:39] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:26:42] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:31:11] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:31:24] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:33:57] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:34:09] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:34:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P85160 and previous config saved to /var/cache/conftool/dbconfig/20251111-113414-marostegui.json [11:36:00] (03PS2) 10Daniel Kinzler: rest-gateway: define catch-all rate limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202998 (https://phabricator.wikimedia.org/T409543) [11:38:46] (03CR) 10Btullis: [V:03+1 C:03+2] Standardize the group ownership of the keytab files on stat servers [puppet] - 10https://gerrit.wikimedia.org/r/1203789 (https://phabricator.wikimedia.org/T409770) (owner: 10Btullis) [11:41:08] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: define catch-all rate limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202998 (https://phabricator.wikimedia.org/T409543) (owner: 10Daniel Kinzler) [11:43:09] (03Merged) 10jenkins-bot: rest-gateway: define catch-all rate limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202998 (https://phabricator.wikimedia.org/T409543) (owner: 10Daniel Kinzler) [11:48:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie [11:49:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P85161 and previous config saved to /var/cache/conftool/dbconfig/20251111-114921-marostegui.json [11:49:25] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:49:36] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:55:22] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:55:37] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:56:27] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:56:37] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:58:12] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es2028.codfw.wmnet with OS trixie [11:58:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS bookworm [11:59:06] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [11:59:27] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [12:04:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T407997)', diff saved to https://phabricator.wikimedia.org/P85162 and previous config saved to /var/cache/conftool/dbconfig/20251111-120429-marostegui.json [12:04:34] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [12:04:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2236.codfw.wmnet with reason: Maintenance [12:04:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2236 (T407997)', diff saved to https://phabricator.wikimedia.org/P85163 and previous config saved to /var/cache/conftool/dbconfig/20251111-120453-marostegui.json [12:05:38] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [12:08:36] (03PS1) 10Kosta Harlan: hCaptcha: Disable addurl trigger for hCaptcha edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203797 (https://phabricator.wikimedia.org/T409822) [12:11:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T407997)', diff saved to https://phabricator.wikimedia.org/P85164 and previous config saved to /var/cache/conftool/dbconfig/20251111-121133-marostegui.json [12:11:37] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [12:16:00] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2028.codfw.wmnet with reason: host reimage [12:16:36] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [12:19:43] (03CR) 10Jakob: Report integrity metric from wikidata dump scripts (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze) [12:19:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2028.codfw.wmnet with reason: host reimage [12:21:31] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094#11362755 (10Nahid) 05Open→03Resolved Thank you for the pointer, @Dzahn. And, thank you, @Raine. Sarah and I had a call, and this works now. Apologies, it was the iss... [12:22:44] !log drop database if exists alswikibooks; drop database if exists alswikiquote; drop database if exists alswiktionary; drop database if exists boardvote2005; drop database if exists boardvote2006; drop database if exists boardvote; (T297297) [12:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:48] T297297: Investigate the unusual dbs in s3 - https://phabricator.wikimedia.org/T297297 [12:26:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P85165 and previous config saved to /var/cache/conftool/dbconfig/20251111-122640-marostegui.json [12:29:06] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:29:39] !log ladsgroup@deploy2002:~$ mwscript-k8s --dblist=all --follow -- userOptions.php --delete mfMode [12:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:17] (03PS1) 10Dreamy Jazz: CodexTablePager: Only show visible table caption if configured [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1203800 (https://phabricator.wikimedia.org/T409807) [12:30:27] jouncebot: nowandnext [12:30:27] No deployments scheduled for the next 0 hour(s) and 29 minute(s) [12:30:27] In 0 hour(s) and 29 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251111T1300) [12:30:37] 06SRE, 06Traffic: Meta query about why we map 31.13.103.0/24 to US - https://phabricator.wikimedia.org/T409735#11362769 (10cmooney) Thanks @ssingh. I'm just reading about this RFC for the first time, I wonder longer term might it be a goal to automate the ingestion of data from such feeds to update our maps a... [12:30:45] Anyone mind if I deploy a backport now? [12:31:20] Dreamy_Jazz: go go go [12:31:25] Thanks! [12:32:12] FIRING: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:32:17] (03PS1) 10Kosta Harlan: OutputPage: Export the error message key as a client-side config var [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1203801 (https://phabricator.wikimedia.org/T409431) [12:32:25] (03PS1) 10Kosta Harlan: ext.wikimediaEvents.createAccount: Instrument error page erros [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1203802 (https://phabricator.wikimedia.org/T409431) [12:32:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1203800 (https://phabricator.wikimedia.org/T409807) (owner: 10Dreamy Jazz) [12:33:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1203802 (https://phabricator.wikimedia.org/T409431) (owner: 10Kosta Harlan) [12:33:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203547 (https://phabricator.wikimedia.org/T405595) (owner: 10Kosta Harlan) [12:33:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203797 (https://phabricator.wikimedia.org/T409822) (owner: 10Kosta Harlan) [12:41:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P85166 and previous config saved to /var/cache/conftool/dbconfig/20251111-124148-marostegui.json [12:43:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2028.codfw.wmnet with OS bookworm [12:47:15] (03CR) 10Zabe: "I think this is ready now." [puppet] - 10https://gerrit.wikimedia.org/r/1191090 (https://phabricator.wikimedia.org/T389026) (owner: 10Zabe) [12:48:42] (03PS3) 10Zabe: maintain-views: Hide rev_sha1 and ar_sha1 from wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/1191090 (https://phabricator.wikimedia.org/T389026) [12:48:57] (03CR) 10Ladsgroup: [V:03+2 C:03+2] maintain-views: Hide rev_sha1 and ar_sha1 from wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/1191090 (https://phabricator.wikimedia.org/T389026) (owner: 10Zabe) [12:49:32] (03Merged) 10jenkins-bot: CodexTablePager: Only show visible table caption if configured [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1203800 (https://phabricator.wikimedia.org/T409807) (owner: 10Dreamy Jazz) [12:50:27] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1203800|CodexTablePager: Only show visible table caption if configured (T409807)]] [12:50:31] T409807: SuggestedInvestigations shows unexpected caption for table - https://phabricator.wikimedia.org/T409807 [12:51:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie [12:52:47] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1203800|CodexTablePager: Only show visible table caption if configured (T409807)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:56:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T407997)', diff saved to https://phabricator.wikimedia.org/P85167 and previous config saved to /var/cache/conftool/dbconfig/20251111-125656-marostegui.json [12:57:00] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [12:57:13] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2237.codfw.wmnet with reason: Maintenance [12:57:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2237 (T407997)', diff saved to https://phabricator.wikimedia.org/P85168 and previous config saved to /var/cache/conftool/dbconfig/20251111-125720-marostegui.json [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251111T1300) [13:02:36] (03PS1) 10Muehlenhoff: reuse-db-trixie.cfg: Try different option [puppet] - 10https://gerrit.wikimedia.org/r/1203806 (https://phabricator.wikimedia.org/T408777) [13:03:03] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [13:04:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T407997)', diff saved to https://phabricator.wikimedia.org/P85169 and previous config saved to /var/cache/conftool/dbconfig/20251111-130405-marostegui.json [13:04:09] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [13:04:15] (03CR) 10Muehlenhoff: [C:03+2] reuse-db-trixie.cfg: Try different option [puppet] - 10https://gerrit.wikimedia.org/r/1203806 (https://phabricator.wikimedia.org/T408777) (owner: 10Muehlenhoff) [13:06:12] if the current window is not used, I'll start on my backports (scheduled for the window an hour from now) soon [13:09:03] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [13:09:07] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:09:41] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1203800|CodexTablePager: Only show visible table caption if configured (T409807)]] (duration: 19m 14s) [13:09:45] T409807: SuggestedInvestigations shows unexpected caption for table - https://phabricator.wikimedia.org/T409807 [13:09:51] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1202087 (owner: 10Elukey) [13:18:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1203801 (https://phabricator.wikimedia.org/T409431) (owner: 10Kosta Harlan) [13:18:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1203802 (https://phabricator.wikimedia.org/T409431) (owner: 10Kosta Harlan) [13:19:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P85170 and previous config saved to /var/cache/conftool/dbconfig/20251111-131912-marostegui.json [13:19:16] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [13:19:33] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1220 gradually with 4 steps - Pool db1220.eqiad.wmnet in after cloning [13:22:53] (03PS5) 10Btullis: Enable an oauth2-proxy for growthbook frontend and api pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183) [13:24:57] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [13:27:43] (03CR) 10Dreamy Jazz: hCaptcha: Disable addurl trigger for hCaptcha edits (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203797 (https://phabricator.wikimedia.org/T409822) (owner: 10Kosta Harlan) [13:28:54] (03CR) 10Dreamy Jazz: hCaptcha: Use FancyCaptcha for API edits and page creations (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203547 (https://phabricator.wikimedia.org/T405595) (owner: 10Kosta Harlan) [13:29:04] (03PS6) 10Btullis: Enable an oauth2-proxy for growthbook frontend and api pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202726 (https://phabricator.wikimedia.org/T409183) [13:29:09] !log ladsgroup@cumin1003 START - Cookbook sre.wikireplicas.update-views [13:29:27] (03CR) 10Kosta Harlan: hCaptcha: Disable addurl trigger for hCaptcha edits (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203797 (https://phabricator.wikimedia.org/T409822) (owner: 10Kosta Harlan) [13:29:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203416 (https://phabricator.wikimedia.org/T409691) (owner: 10STran) [13:30:07] (03CR) 10Dreamy Jazz: hCaptcha: Disable addurl trigger for hCaptcha edits (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203797 (https://phabricator.wikimedia.org/T409822) (owner: 10Kosta Harlan) [13:30:29] (03CR) 10Dreamy Jazz: [C:03+1] hCaptcha: Disable addurl trigger for hCaptcha edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203797 (https://phabricator.wikimedia.org/T409822) (owner: 10Kosta Harlan) [13:30:40] (03CR) 10Kosta Harlan: hCaptcha: Use FancyCaptcha for API edits and page creations (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203547 (https://phabricator.wikimedia.org/T405595) (owner: 10Kosta Harlan) [13:31:14] (03PS5) 10Esanders: Enable DiscussionTools visual enhancements everywhere except enwiki and ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133975 (https://phabricator.wikimedia.org/T379264) [13:31:38] (03CR) 10Dreamy Jazz: hCaptcha: Use FancyCaptcha for API edits and page creations (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203547 (https://phabricator.wikimedia.org/T405595) (owner: 10Kosta Harlan) [13:33:20] (03PS1) 10Muehlenhoff: Switch es2028 back to the trixie variant of the Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1203809 [13:33:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065#11362892 (10Jclark-ctr) @BTullis Unfortunately, that is a 4TB drive, and we would need to order a replacement. Please let me know if yo... [13:34:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P85172 and previous config saved to /var/cache/conftool/dbconfig/20251111-133419-marostegui.json [13:34:35] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [13:35:11] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [13:35:37] (03Merged) 10jenkins-bot: OutputPage: Export the error message key as a client-side config var [core] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1203801 (https://phabricator.wikimedia.org/T409431) (owner: 10Kosta Harlan) [13:35:40] (03Merged) 10jenkins-bot: ext.wikimediaEvents.createAccount: Instrument error page erros [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1203802 (https://phabricator.wikimedia.org/T409431) (owner: 10Kosta Harlan) [13:36:12] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1203801|OutputPage: Export the error message key as a client-side config var (T409431)]], [[gerrit:1203802|ext.wikimediaEvents.createAccount: Instrument error page erros (T409431)]] [13:36:15] T409431: SpecialCreateAccount instrumentation: Record event on error page - https://phabricator.wikimedia.org/T409431 [13:36:32] (03PS1) 10Kosta Harlan: hcaptcha: Don't prevent form submissions unless making an edit [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1203810 (https://phabricator.wikimedia.org/T408693) [13:36:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1203810 (https://phabricator.wikimedia.org/T408693) (owner: 10Kosta Harlan) [13:37:43] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1203557 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [13:38:04] FIRING: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [13:38:26] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1203801|OutputPage: Export the error message key as a client-side config var (T409431)]], [[gerrit:1203802|ext.wikimediaEvents.createAccount: Instrument error page erros (T409431)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:42:32] !log kharlan@deploy2002 kharlan: Continuing with sync [13:43:00] !log ladsgroup@cumin1003 START - Cookbook sre.wikireplicas.update-views [13:43:04] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [13:43:28] (03CR) 10Muehlenhoff: [C:03+2] Switch es2028 back to the trixie variant of the Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1203809 (owner: 10Muehlenhoff) [13:43:50] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2028.codfw.wmnet with OS trixie [13:44:20] (03PS1) 10Marostegui: control-mariadb-10.11-bookworm: Update version [software] - 10https://gerrit.wikimedia.org/r/1203811 (https://phabricator.wikimedia.org/T409533) [13:45:04] FIRING: OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [13:46:02] (03CR) 10Marostegui: [C:03+2] control-mariadb-10.11-bookworm: Update version [software] - 10https://gerrit.wikimedia.org/r/1203811 (https://phabricator.wikimedia.org/T409533) (owner: 10Marostegui) [13:46:31] (03Merged) 10jenkins-bot: control-mariadb-10.11-bookworm: Update version [software] - 10https://gerrit.wikimedia.org/r/1203811 (https://phabricator.wikimedia.org/T409533) (owner: 10Marostegui) [13:47:16] (03PS1) 10Sergio Gimeno: EventStramConfig: add stream for Growth's revise tone experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203812 (https://phabricator.wikimedia.org/T405177) [13:47:35] (03PS1) 10Filippo Giunchedi: replace legacy facts in etcd v3 [puppet] - 10https://gerrit.wikimedia.org/r/1203813 [13:48:34] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1203801|OutputPage: Export the error message key as a client-side config var (T409431)]], [[gerrit:1203802|ext.wikimediaEvents.createAccount: Instrument error page erros (T409431)]] (duration: 12m 22s) [13:48:38] T409431: SpecialCreateAccount instrumentation: Record event on error page - https://phabricator.wikimedia.org/T409431 [13:49:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T407997)', diff saved to https://phabricator.wikimedia.org/P85174 and previous config saved to /var/cache/conftool/dbconfig/20251111-134926-marostegui.json [13:49:28] !log ladsgroup@cumin1003 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [13:49:31] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [13:49:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2239.codfw.wmnet with reason: Maintenance [13:50:04] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [13:53:04] FIRING: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [13:53:30] (03PS2) 10Stevemunene: stat: Remove the airflow package from stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/1202050 (https://phabricator.wikimedia.org/T409262) [13:53:50] (03CR) 10Gehel: [C:03+2] stat: Remove the airflow package from stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/1202050 (https://phabricator.wikimedia.org/T409262) (owner: 10Stevemunene) [13:54:36] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2240.codfw.wmnet with reason: Maintenance [13:54:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2240 (T407997)', diff saved to https://phabricator.wikimedia.org/P85176 and previous config saved to /var/cache/conftool/dbconfig/20251111-135444-marostegui.json [13:54:48] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [13:55:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1203810 (https://phabricator.wikimedia.org/T408693) (owner: 10Kosta Harlan) [13:56:00] !log Install new MariaDB 10.11.15 on db1169 T409533 [13:56:00] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1169.eqiad.wmnet with reason: Upgrading mariadb [13:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:03] T409533: Compile and package MariaDB 10.11.15 - https://phabricator.wikimedia.org/T409533 [13:57:36] (03Merged) 10jenkins-bot: hcaptcha: Don't prevent form submissions unless making an edit [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1203810 (https://phabricator.wikimedia.org/T408693) (owner: 10Kosta Harlan) [13:58:04] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [13:58:06] (03PS3) 10Federico Ceratto: sre.mysql.clone: Pool in source host ASAP [cookbooks] - 10https://gerrit.wikimedia.org/r/1202679 [13:58:09] (03CR) 10Lucas Werkmeister (WMDE): "approved (T407737#11350215)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202164 (https://phabricator.wikimedia.org/T407737) (owner: 10Lucas Werkmeister (WMDE)) [13:58:10] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1203810|hcaptcha: Don't prevent form submissions unless making an edit (T408693)]] [13:58:14] T408693: hCaptcha: Clicking "Show preview" and "Show changes" triggers hCaptcha, and then publishes edit - https://phabricator.wikimedia.org/T408693 [13:58:53] (03PS2) 10Lucas Werkmeister (WMDE): Enable the MEX / wbui2025 beta feature on testwikidata (v2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202164 (https://phabricator.wikimedia.org/T407737) [13:59:39] jouncebot: refresh [13:59:40] I refreshed my knowledge about deployments. [14:00:04] FIRING: OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [14:00:05] Urbanecm and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251111T1400). [14:00:05] abijeet and kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:20] o/ [14:00:23] (03PS4) 10Federico Ceratto: sre.mysql.clone: Pool in source host ASAP [cookbooks] - 10https://gerrit.wikimedia.org/r/1202679 [14:00:33] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11362997 (10Chandra-WMDE) Hi @CDanis - Fine, No worries. I can create a new one. Can we continue with the same request or do I need to create a new one ? [14:01:17] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1203813 (owner: 10Filippo Giunchedi) [14:01:19] hi abijeet [14:01:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T407997)', diff saved to https://phabricator.wikimedia.org/P85177 and previous config saved to /var/cache/conftool/dbconfig/20251111-140125-marostegui.json [14:01:29] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [14:01:47] hi kostajh [14:01:56] kostajh, can you deploy my patch too? [14:02:02] abijeet: yes, was about to offer doing that [14:02:09] kostajh, thank you :-) [14:02:20] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1203810|hcaptcha: Don't prevent form submissions unless making an edit (T408693)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:02:29] abijeet: do you need to verify it? [14:02:38] Hi I've a patch too (can be also deployed with other patches) [14:03:12] kostajh, yea, will take a minute. Just need to ensure i didn't bring the wiki down. [14:03:54] (03Abandoned) 10Federico Ceratto: sre.mysql.clone: Pool in source host ASAP [cookbooks] - 10https://gerrit.wikimedia.org/r/1202679 (owner: 10Federico Ceratto) [14:05:02] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1220 gradually with 4 steps - Pool db1220.eqiad.wmnet in after cloning [14:05:04] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1220.eqiad.wmnet onto db1264.eqiad.wmnet [14:05:04] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [14:05:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie [14:05:27] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7603/console" [puppet] - 10https://gerrit.wikimedia.org/r/1203813 (owner: 10Filippo Giunchedi) [14:05:39] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11363033 (10AndrewTavis_WMDE) I'm sure we'll be fine with continuing on this request @Chandra-WMDE, and you can edit the task to reflect the new public key :) [14:05:47] (03CR) 10JMeybohm: [C:03+1] "Nice find!" [puppet] - 10https://gerrit.wikimedia.org/r/1203782 (owner: 10Filippo Giunchedi) [14:05:48] (03PS2) 10Muehlenhoff: Fix cumin alias for maps [puppet] - 10https://gerrit.wikimedia.org/r/1203390 (https://phabricator.wikimedia.org/T381565) [14:05:52] !log kharlan@deploy2002 kharlan: Continuing with sync [14:06:43] (03PS1) 10Elukey: admin_ng: add lsw1-d8-eqiad to BGPPeers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203820 [14:07:37] Superpes12: ok, I can sync your patch too. Do you need to verify it? [14:07:37] (03CR) 10Alexandros Kosiaris: [C:04-1] "I am fine with the pod bump, but as I comment inline, 64GB RAM is occupying 60% of a worker in many case and 50% of a worker in the best c" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202642 (https://phabricator.wikimedia.org/T386371) (owner: 10KartikMistry) [14:08:04] FIRING: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [14:08:19] (03PS1) 10Marostegui: es1033: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1203821 (https://phabricator.wikimedia.org/T409257) [14:08:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool es1033 T409257', diff saved to https://phabricator.wikimedia.org/P85179 and previous config saved to /var/cache/conftool/dbconfig/20251111-140849-marostegui.json [14:08:53] (03CR) 10Marostegui: [C:03+2] es1033: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1203821 (https://phabricator.wikimedia.org/T409257) (owner: 10Marostegui) [14:08:53] T409257: Move es1033 (es2 Debian Trixie) to es7 - https://phabricator.wikimedia.org/T409257 [14:09:22] (03CR) 10Elukey: [C:03+1] Fix cumin alias for maps [puppet] - 10https://gerrit.wikimedia.org/r/1203390 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:09:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es1033.eqiad.wmnet with reason: Moving es1033 to es7 [14:09:34] kostajh I need to test it but it's very simple [14:09:46] Superpes12: ok, I'll let you know when it's on mwdebug [14:12:00] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1203810|hcaptcha: Don't prevent form submissions unless making an edit (T408693)]] (duration: 13m 50s) [14:12:04] T408693: hCaptcha: Clicking "Show preview" and "Show changes" triggers hCaptcha, and then publishes edit - https://phabricator.wikimedia.org/T408693 [14:12:15] (03CR) 10Muehlenhoff: [C:03+2] Fix cumin alias for maps [puppet] - 10https://gerrit.wikimedia.org/r/1203390 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:13:04] abijeet: ok I'll sync your config patch next [14:13:04] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [14:13:10] kostajh, ok [14:13:13] (03CR) 10Ssingh: [C:03+1] hiera: temporarily point codfw LVS at conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/1203556 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [14:13:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202144 (https://phabricator.wikimedia.org/T400067) (owner: 10Abijeet Patro) [14:13:29] (03PS1) 10Marostegui: mariadb: Move es1033 to es7 [puppet] - 10https://gerrit.wikimedia.org/r/1203823 (https://phabricator.wikimedia.org/T409257) [14:14:02] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] replace legacy facts in etcd v3 [puppet] - 10https://gerrit.wikimedia.org/r/1203813 (owner: 10Filippo Giunchedi) [14:14:07] (03PS2) 10Filippo Giunchedi: replace legacy facts in etcd v3 [puppet] - 10https://gerrit.wikimedia.org/r/1203813 [14:14:18] (03Merged) 10jenkins-bot: Remove SpecialContributeSkinsEnabled for special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202144 (https://phabricator.wikimedia.org/T400067) (owner: 10Abijeet Patro) [14:14:18] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] replace legacy facts in etcd v3 [puppet] - 10https://gerrit.wikimedia.org/r/1203813 (owner: 10Filippo Giunchedi) [14:14:32] (03CR) 10Filippo Giunchedi: [C:03+2] profile: fix etcd::v3 firewall [puppet] - 10https://gerrit.wikimedia.org/r/1203782 (owner: 10Filippo Giunchedi) [14:14:42] (03CR) 10Marostegui: [C:03+2] mariadb: Move es1033 to es7 [puppet] - 10https://gerrit.wikimedia.org/r/1203823 (https://phabricator.wikimedia.org/T409257) (owner: 10Marostegui) [14:14:47] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1202144|Remove SpecialContributeSkinsEnabled for special wikis (T400067)]] [14:14:51] T400067: Clean up LPL-owned settings on ex-wikipedia special wikis - https://phabricator.wikimedia.org/T400067 [14:15:04] FIRING: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [14:16:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P85180 and previous config saved to /var/cache/conftool/dbconfig/20251111-141632-marostegui.json [14:17:16] !log kharlan@deploy2002 kharlan, abi: Backport for [[gerrit:1202144|Remove SpecialContributeSkinsEnabled for special wikis (T400067)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:18:06] abijeet: it's on mwdebug [14:18:19] thanks, testing [14:19:35] !log marostegui@cumin1003 START - Cookbook sre.mysql.clone of es1039.eqiad.wmnet onto es1033.eqiad.wmnet [14:19:39] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool es1039 - Depool es1039.eqiad.wmnet to then clone it to es1033.eqiad.wmnet - marostegui@cumin1003 [14:19:43] (03PS3) 10Muehlenhoff: osm_replica: Remove support for pre Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1202664 (https://phabricator.wikimedia.org/T381565) [14:19:44] kostajh, looks ok, nothing broken [14:20:04] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [14:20:26] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test2001.codfw.wmnet [14:21:03] abijeet: ok! [14:21:05] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2028.codfw.wmnet with reason: host reimage [14:21:06] 06SRE, 06Traffic: Meta query about why we map 31.13.103.0/24 to US - https://phabricator.wikimedia.org/T409735#11363159 (10ssingh) >>! In T409735#11362769, @cmooney wrote: > Thanks @ssingh. I'm just reading about this RFC for the first time, I wonder longer term might it be a goal to automate the ingestion of... [14:21:06] !log kharlan@deploy2002 kharlan, abi: Continuing with sync [14:21:18] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es1039 - Depool es1039.eqiad.wmnet to then clone it to es1033.eqiad.wmnet - marostegui@cumin1003 [14:22:04] FIRING: OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [14:22:29] (03CR) 10Elukey: [C:03+1] osm_replica: Remove support for pre Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1202664 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:23:36] (03PS1) 10Kosta Harlan: Instrument hCaptcha risk signal in edits [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1203824 (https://phabricator.wikimedia.org/T405597) [14:24:06] (03CR) 10Muehlenhoff: [C:03+2] osm_replica: Remove support for pre Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1202664 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:24:07] fceratto@cumin1003 clone (PID 2017767) is awaiting input [14:24:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2028.codfw.wmnet with reason: host reimage [14:25:31] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1202144|Remove SpecialContributeSkinsEnabled for special wikis (T400067)]] (duration: 10m 43s) [14:25:35] T400067: Clean up LPL-owned settings on ex-wikipedia special wikis - https://phabricator.wikimedia.org/T400067 [14:26:19] (03CR) 10Atieno: [V:03+1 C:03+1] Route /page/lint(.*) to the gateway on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1199035 (https://phabricator.wikimedia.org/T384216) (owner: 10Aaron Schulz) [14:26:55] Superpes12: syncing your patch next [14:26:58] kostajh, thank you [14:27:04] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [14:27:05]  Yep I'm here :) [14:27:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203566 (https://phabricator.wikimedia.org/T409789) (owner: 10Superpes15) [14:27:29] abijeet: you're welcome! [14:28:16] (03Merged) 10jenkins-bot: [arwikibooks] Add an alias for project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203566 (https://phabricator.wikimedia.org/T409789) (owner: 10Superpes15) [14:28:48] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1203566|[arwikibooks] Add an alias for project namespace (T409789)]] [14:28:52] T409789: Namespace alias on ar.wikibooks - https://phabricator.wikimedia.org/T409789 [14:29:42] (03CR) 10Cathal Mooney: [C:04-1] "I'll take a look at the wider issue later today. The device should remain on the private1-d-eqiad vlan and peer with the CRs, as it was l" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203820 (owner: 10Elukey) [14:30:04] FIRING: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [14:31:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1203824 (https://phabricator.wikimedia.org/T405597) (owner: 10Kosta Harlan) [14:31:11] !log kharlan@deploy2002 superpes, kharlan: Backport for [[gerrit:1203566|[arwikibooks] Add an alias for project namespace (T409789)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:31:14] Testing! [14:31:35] Looks fine thanks kostajh :) [14:31:39] Superpes12: ok! [14:31:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P85182 and previous config saved to /var/cache/conftool/dbconfig/20251111-143140-marostegui.json [14:31:42] !log kharlan@deploy2002 superpes, kharlan: Continuing with sync [14:32:10] (03CR) 10Silvan Heintze: Report integrity metric from wikidata dump scripts (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze) [14:33:49] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2230.codfw.wmnet onto db-test2001.codfw.wmnet [14:35:04] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [14:35:58] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1203566|[arwikibooks] Add an alias for project namespace (T409789)]] (duration: 07m 10s) [14:36:02] T409789: Namespace alias on ar.wikibooks - https://phabricator.wikimedia.org/T409789 [14:36:21] Many thanks for your assistance kostajh :3 [14:36:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1203824 (https://phabricator.wikimedia.org/T405597) (owner: 10Kosta Harlan) [14:37:42] (03Merged) 10jenkins-bot: Instrument hCaptcha risk signal in edits [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1203824 (https://phabricator.wikimedia.org/T405597) (owner: 10Kosta Harlan) [14:38:13] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1203824|Instrument hCaptcha risk signal in edits (T405597)]] [14:38:17] T405597: hCaptcha: Update instrumentation for risk score - https://phabricator.wikimedia.org/T405597 [14:40:13] !log cmooney@cumin1003 START - Cookbook sre.hosts.dhcp for host sretest1006.eqiad.wmnet [14:40:34] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1203824|Instrument hCaptcha risk signal in edits (T405597)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:41:34] FIRING: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [14:43:16] !log kharlan@deploy2002 kharlan: Continuing with sync [14:43:16] cmooney@cumin1003 dhcp (PID 2038099) is awaiting input [14:43:28] (03Abandoned) 10Kamila Součková: benthos-cache-invalidator: clean up releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199340 (owner: 10Kamila Součková) [14:46:34] FIRING: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [14:46:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T407997)', diff saved to https://phabricator.wikimedia.org/P85183 and previous config saved to /var/cache/conftool/dbconfig/20251111-144648-marostegui.json [14:46:52] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [14:47:04] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2245.codfw.wmnet with reason: Maintenance [14:47:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2245 (T407997)', diff saved to https://phabricator.wikimedia.org/P85184 and previous config saved to /var/cache/conftool/dbconfig/20251111-144711-marostegui.json [14:47:28] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1203824|Instrument hCaptcha risk signal in edits (T405597)]] (duration: 09m 15s) [14:47:32] T405597: hCaptcha: Update instrumentation for risk score - https://phabricator.wikimedia.org/T405597 [14:49:24] (03CR) 10Kamila Součková: "Yeah, not for now, I meant eventually." [puppet] - 10https://gerrit.wikimedia.org/r/1201802 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [14:49:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2028.codfw.wmnet with OS trixie [14:50:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203547 (https://phabricator.wikimedia.org/T405595) (owner: 10Kosta Harlan) [14:50:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203797 (https://phabricator.wikimedia.org/T409822) (owner: 10Kosta Harlan) [14:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:51:30] (03Merged) 10jenkins-bot: hCaptcha: Use FancyCaptcha for API edits and page creations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203547 (https://phabricator.wikimedia.org/T405595) (owner: 10Kosta Harlan) [14:51:32] (03Merged) 10jenkins-bot: hCaptcha: Disable addurl trigger for hCaptcha edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203797 (https://phabricator.wikimedia.org/T409822) (owner: 10Kosta Harlan) [14:51:34] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [14:52:02] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1203547|hCaptcha: Use FancyCaptcha for API edits and page creations (T405595)]], [[gerrit:1203797|hCaptcha: Disable addurl trigger for hCaptcha edits (T409822)]] [14:52:09] T405595: hCaptcha: Create mechanism to allow the showcaptcha consequence in AbuseFilter to always challenge the user - https://phabricator.wikimedia.org/T405595 [14:52:09] T409822: hCaptcha: Disable addurl rule for the editing trial - https://phabricator.wikimedia.org/T409822 [14:53:25] (03PS1) 10Kosta Harlan: (WIP) EventLogging: Register mediawiki.hcaptcha.risk_score stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203829 (https://phabricator.wikimedia.org/T405597) [14:53:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245 (T407997)', diff saved to https://phabricator.wikimedia.org/P85185 and previous config saved to /var/cache/conftool/dbconfig/20251111-145355-marostegui.json [14:53:59] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [14:54:11] (03PS2) 10Clément Goubert: trafficserver: action api to rest-gateway group2 50% [puppet] - 10https://gerrit.wikimedia.org/r/1198936 (https://phabricator.wikimedia.org/T408223) [14:54:24] (03PS4) 10Muehlenhoff: Remove otto from ops group [puppet] - 10https://gerrit.wikimedia.org/r/1202114 [14:54:47] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1203547|hCaptcha: Use FancyCaptcha for API edits and page creations (T405595)]], [[gerrit:1203797|hCaptcha: Disable addurl trigger for hCaptcha edits (T409822)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:55:04] FIRING: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [14:56:05] (03CR) 10Kamila Součková: [C:03+1] sre.k8s: Handle errors in kubectl_version() [cookbooks] - 10https://gerrit.wikimedia.org/r/1193111 (https://phabricator.wikimedia.org/T406200) (owner: 10JMeybohm) [14:56:23] (03CR) 10Clément Goubert: [C:03+2] trafficserver: action api to rest-gateway group2 50% [puppet] - 10https://gerrit.wikimedia.org/r/1198936 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [14:58:18] (03CR) 10JMeybohm: [C:03+2] sre.k8s: Handle errors in kubectl_version() [cookbooks] - 10https://gerrit.wikimedia.org/r/1193111 (https://phabricator.wikimedia.org/T406200) (owner: 10JMeybohm) [14:59:28] !log kharlan@deploy2002 kharlan: Continuing with sync [15:00:04] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [15:00:05] Deploy window Metrics Platform Experimentation Lab Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251111T1500) [15:01:08] (03CR) 10Muehlenhoff: [C:03+2] Remove otto from ops group [puppet] - 10https://gerrit.wikimedia.org/r/1202114 (owner: 10Muehlenhoff) [15:01:37] (03CR) 10KartikMistry: machinetranslation: Increase replica and memory (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202642 (https://phabricator.wikimedia.org/T386371) (owner: 10KartikMistry) [15:02:04] FIRING: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [15:04:40] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1203547|hCaptcha: Use FancyCaptcha for API edits and page creations (T405595)]], [[gerrit:1203797|hCaptcha: Disable addurl trigger for hCaptcha edits (T409822)]] (duration: 12m 38s) [15:04:45] T405595: hCaptcha: Create mechanism to allow the showcaptcha consequence in AbuseFilter to always challenge the user - https://phabricator.wikimedia.org/T405595 [15:04:46] T409822: hCaptcha: Disable addurl rule for the editing trial - https://phabricator.wikimedia.org/T409822 [15:05:02] (03CR) 10CI reject: [V:04-1] sre.k8s: Handle errors in kubectl_version() [cookbooks] - 10https://gerrit.wikimedia.org/r/1193111 (https://phabricator.wikimedia.org/T406200) (owner: 10JMeybohm) [15:07:04] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [15:07:07] (03CR) 10JMeybohm: [V:03+2 C:03+2] "Unrelated CI error" [cookbooks] - 10https://gerrit.wikimedia.org/r/1193111 (https://phabricator.wikimedia.org/T406200) (owner: 10JMeybohm) [15:08:10] FIRING: BFDdown: BFD session down between cr2-drmrs and fe80::5e5e:ab00:103d:83c7 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:08:22] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245', diff saved to https://phabricator.wikimedia.org/P85186 and previous config saved to /var/cache/conftool/dbconfig/20251111-150902-marostegui.json [15:09:45] !log UTC afternoon deploys done [15:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:04] FIRING: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [15:11:24] (03PS1) 10Elukey: Turn paging on for kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/1203835 (https://phabricator.wikimedia.org/T381565) [15:13:10] RESOLVED: BFDdown: BFD session down between cr2-drmrs and fe80::5e5e:ab00:103d:83c7 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:14:52] (03CR) 10Volans: [C:03+1] "Makes sense, one optional alternative inline" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1202087 (owner: 10Elukey) [15:14:57] (03PS2) 10Kosta Harlan: EventLogging: Register mediawiki.hcaptcha.risk_score stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203829 (https://phabricator.wikimedia.org/T405597) [15:15:04] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [15:17:04] FIRING: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [15:20:15] ^ yay the monitoring works..... not yay wtf is going on [15:20:21] * topranks looking [15:21:06] (03CR) 10Muehlenhoff: "I'm not convinced; this was an explicit decision (see the original mail "Proposal to switch from paging to only IRC/email alerting for map" [puppet] - 10https://gerrit.wikimedia.org/r/1203835 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [15:22:04] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [15:24:04] FIRING: OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [15:24:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245', diff saved to https://phabricator.wikimedia.org/P85187 and previous config saved to /var/cache/conftool/dbconfig/20251111-152410-marostegui.json [15:27:13] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11363464 (10WMDECyn) request approved from WMDE side [15:29:04] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [15:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251111T1530) [15:30:08] 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Nokia OSPF alerts not working - https://phabricator.wikimedia.org/T408378#11363473 (10cmooney) 05Open→03Resolved a:03cmooney >>! In T408378#11351612, @colewhite wrote: > In today's case, the alert criteria wasn't met because... [15:32:04] FIRING: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [15:32:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [15:33:22] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:44] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11363505 (10ssingh) Hi @Jdrewniak: Daniel has already commented on the questions from Traffic's end (and as it related to the CDN and DNS) and what he has mentione... [15:37:04] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [15:37:45] (03CR) 10Elukey: "Thanks for the refresh, I didn't remember that email, but a lot of things changed since then. From the service reliability point of view, " [puppet] - 10https://gerrit.wikimedia.org/r/1203835 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [15:38:22] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:38:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [15:39:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245 (T407997)', diff saved to https://phabricator.wikimedia.org/P85188 and previous config saved to /var/cache/conftool/dbconfig/20251111-153918-marostegui.json [15:39:22] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [15:39:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2246.codfw.wmnet with reason: Maintenance [15:39:39] (03PS2) 10Elukey: Add name_filters support to the k8s browser [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1202087 [15:39:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2246 (T407997)', diff saved to https://phabricator.wikimedia.org/P85189 and previous config saved to /var/cache/conftool/dbconfig/20251111-153942-marostegui.json [15:39:51] (03PS1) 10Kosta Harlan: hCaptcha: Set fallback for ConfirmEditTriggersCaptcha [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203841 (https://phabricator.wikimedia.org/T409736) [15:40:45] (03CR) 10Elukey: Add name_filters support to the k8s browser (031 comment) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1202087 (owner: 10Elukey) [15:41:39] (03CR) 10CI reject: [V:04-1] Add name_filters support to the k8s browser [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1202087 (owner: 10Elukey) [15:43:35] FIRING: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [15:46:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246 (T407997)', diff saved to https://phabricator.wikimedia.org/P85191 and previous config saved to /var/cache/conftool/dbconfig/20251111-154624-marostegui.json [15:46:29] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [15:46:52] !log btullis@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{dse-k8s-worker1004.eqiad.wmnet} and (A:dse-k8s-master-eqiad or A:dse-k8s-worker-eqiad) [15:47:18] (03PS3) 10Elukey: Add name_filters support to the k8s browser [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1202087 [15:47:56] (03CR) 10Elukey: Add name_filters support to the k8s browser (031 comment) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1202087 (owner: 10Elukey) [15:48:34] FIRING: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [15:49:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet [15:53:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet [15:53:34] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [15:57:04] FIRING: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [15:57:14] (03PS1) 10Daniel Kinzler: rest-gateway: enable rate limit in shadow mode on some routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203843 [15:57:42] (03CR) 10Elukey: [C:03+2] Add name_filters support to the k8s browser [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1202087 (owner: 10Elukey) [15:59:04] FIRING: OspfAdjError: OSPF Adjacency not formed on ssw1-d8-eqiad interface ethernet-1/11.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjError [15:59:40] (03Merged) 10jenkins-bot: Add name_filters support to the k8s browser [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1202087 (owner: 10Elukey) [16:00:05] jelto, arnoldokoth, and mutante: OwO what's this, a deployment window?? SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251111T1600). nyaa~ [16:01:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246', diff saved to https://phabricator.wikimedia.org/P85192 and previous config saved to /var/cache/conftool/dbconfig/20251111-160132-marostegui.json [16:02:04] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [16:03:49] FIRING: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [16:04:04] RESOLVED: OspfAdjError: OSPF Adjacency not formed on ssw1-d8-eqiad interface ethernet-1/11.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjError [16:07:54] (03CR) 10Elukey: "I just realized that https://netbox.wikimedia.org/dcim/devices/6344/ shows no BGP flag, maybe that is the only issue? I didn't get 100% yo" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203820 (owner: 10Elukey) [16:08:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2246 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P85193 and previous config saved to /var/cache/conftool/dbconfig/20251111-160759-root.json [16:08:49] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [16:09:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1221.eqiad.wmnet with reason: Maintenance [16:09:27] (03PS1) 10Muehlenhoff: Cleanup unused setting [puppet] - 10https://gerrit.wikimedia.org/r/1203847 (https://phabricator.wikimedia.org/T408777) [16:09:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:09:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1221 (T407997)', diff saved to https://phabricator.wikimedia.org/P85194 and previous config saved to /var/cache/conftool/dbconfig/20251111-160950-marostegui.json [16:09:55] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [16:10:04] (03PS1) 10Daniel Kinzler: rest-gateway: enable shadow mode limits on nearly all routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203848 (https://phabricator.wikimedia.org/T406498) [16:11:04] FIRING: OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [16:12:27] (03PS2) 10Daniel Kinzler: rest-gateway: enable rate limit in shadow mode on some routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203843 [16:12:45] (03PS2) 10Daniel Kinzler: rest-gateway: enable shadow mode limits on nearly all routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203848 (https://phabricator.wikimedia.org/T406498) [16:13:49] FIRING: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [16:13:51] (03PS3) 10Daniel Kinzler: rest-gateway: enable rate limit in shadow mode on some routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203843 (https://phabricator.wikimedia.org/T406498) [16:16:04] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [16:16:16] (03PS1) 10Daniel Kinzler: rest-gateway: clean up test config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203849 (https://phabricator.wikimedia.org/T406498) [16:16:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T407997)', diff saved to https://phabricator.wikimedia.org/P85195 and previous config saved to /var/cache/conftool/dbconfig/20251111-161636-marostegui.json [16:16:41] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [16:17:42] (03CR) 10Muehlenhoff: [C:03+2] Cleanup unused setting [puppet] - 10https://gerrit.wikimedia.org/r/1203847 (https://phabricator.wikimedia.org/T408777) (owner: 10Muehlenhoff) [16:18:49] FIRING: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [16:20:24] (03PS3) 10Kosta Harlan: EventLogging: Register mediawiki.hcaptcha.risk_score stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203829 (https://phabricator.wikimedia.org/T405597) [16:21:17] !log Install MariaDB 10.11.15 on Debian Trixie es1033 T409533 [16:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:21] T409533: Compile and package MariaDB 10.11.15 - https://phabricator.wikimedia.org/T409533 [16:22:38] !log Drop afl_ip related triggers from s4 T408780 [16:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:41] T408780: Drop abuse_filter_log trigger for afl_ip column - https://phabricator.wikimedia.org/T408780 [16:23:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2246 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P85196 and previous config saved to /var/cache/conftool/dbconfig/20251111-162305-root.json [16:23:49] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [16:24:30] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on P{dse-k8s-worker1004.eqiad.wmnet} and (A:dse-k8s-master-eqiad or A:dse-k8s-worker-eqiad) [16:25:34] (03PS1) 10Marostegui: filtered_tables.txt: Remove afl_ip trigger [puppet] - 10https://gerrit.wikimedia.org/r/1203850 (https://phabricator.wikimedia.org/T408780) [16:26:04] FIRING: OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [16:26:43] (03CR) 10Ladsgroup: [C:03+1] filtered_tables.txt: Remove afl_ip trigger [puppet] - 10https://gerrit.wikimedia.org/r/1203850 (https://phabricator.wikimedia.org/T408780) (owner: 10Marostegui) [16:26:59] (03CR) 10Marostegui: [C:03+2] filtered_tables.txt: Remove afl_ip trigger [puppet] - 10https://gerrit.wikimedia.org/r/1203850 (https://phabricator.wikimedia.org/T408780) (owner: 10Marostegui) [16:27:30] (03CR) 10Phuedx: [C:03+1] EventLogging: Register mediawiki.hcaptcha.risk_score stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203829 (https://phabricator.wikimedia.org/T405597) (owner: 10Kosta Harlan) [16:28:49] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [16:29:06] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:31:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P85197 and previous config saved to /var/cache/conftool/dbconfig/20251111-163144-marostegui.json [16:32:12] FIRING: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:34:04] FIRING: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [16:38:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2246 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P85198 and previous config saved to /var/cache/conftool/dbconfig/20251111-163811-root.json [16:39:04] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [16:44:19] FIRING: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [16:45:45] (03PS1) 10Eevans: cassandra: create ml_inference_service Cassandra role [puppet] - 10https://gerrit.wikimedia.org/r/1203857 (https://phabricator.wikimedia.org/T409850) [16:46:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P85199 and previous config saved to /var/cache/conftool/dbconfig/20251111-164651-marostegui.json [16:49:19] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [16:51:19] FIRING: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [16:55:34] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [16:58:19] FIRING: OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [17:00:05] jhathaway and moritzm: Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251111T1700). Please do the needful. [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:34] FIRING: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [17:02:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T407997)', diff saved to https://phabricator.wikimedia.org/P85200 and previous config saved to /var/cache/conftool/dbconfig/20251111-170159-marostegui.json [17:02:04] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [17:02:06] jhathaway / moritzm: would it be okay for me to deploy a security patch in the window? [17:02:09] (if there are no requests) [17:03:19] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [17:04:56] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1242.eqiad.wmnet with reason: Maintenance [17:05:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1242 (T407997)', diff saved to https://phabricator.wikimedia.org/P85201 and previous config saved to /var/cache/conftool/dbconfig/20251111-170504-marostegui.json [17:06:19] FIRING: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [17:06:44] Lucas_WMDE: it's not really a replacement, unless there are Puppet patches scheduled, it's not really used a a deployment window for mediawiki things [17:07:10] yeah, I was viewing it more as an empty time in the deployment calendar in that case [17:07:49] (normally I wouldn’t wait for a backport+config window to deploy security fixes, I’d just look for a time with no windows. but it’s evening here and I don’t want to wait another hour) [17:08:10] makes sense [17:08:41] (correction, the next slot with no deployment windows seems to be… 23:00 UTC?) [17:09:07] s/in the window/during the window/ if that makes more sense :) [17:09:07] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:10:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 17.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:10:34] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [17:11:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T407997)', diff saved to https://phabricator.wikimedia.org/P85202 and previous config saved to /var/cache/conftool/dbconfig/20251111-171145-marostegui.json [17:11:49] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [17:14:30] I’ll try using deploy_securitry.py, though it’ll probably break again like it did a few days ago (Thursday? I think?) [17:16:59] 10SRE-SLO: Sloth: onboard subset of existing SLOs to pilot - https://phabricator.wikimedia.org/T409310#11363845 (10elukey) @herron Hi! Could you please backfill `slo:period_error_budget_remaining:ratio` too? I see that the time series start from Oct 27th, this is the rolling window metric and I'd like to see how... [17:18:01] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-d6-eqiad [17:18:30] !log cmooney@cumin1003 END (ERROR) - Cookbook sre.network.tls (exit_code=97) for network device lsw1-d6-eqiad [17:19:24] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-d6-eqiad [17:19:39] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-d6-eqiad [17:21:46] (03PS1) 10Elukey: Release version 0.0.17-1 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1203871 [17:21:46] RECOVERY - Host lsw1-d6-eqiad.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.61 ms [17:22:01] !log lucaswerkmeister-wmde Deployed security patch for T409737 [17:22:10] yup, failed [17:25:09] quit] [17:25:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 11.88% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:26:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P85203 and previous config saved to /var/cache/conftool/dbconfig/20251111-172653-marostegui.json [17:28:51] (03CR) 10Elukey: "tagged and built on build2002, all good." [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1203871 (owner: 10Elukey) [17:28:59] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-d6-eqiad [17:29:17] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-d6-eqiad [17:30:11] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-d6-eqiad [17:30:11] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-d6-eqiad [17:33:16] !log lucaswerkmeister-wmde Deployed security patch for T409737 [17:33:55] 06SRE, 10SRE-Access-Requests: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas - https://phabricator.wikimedia.org/T409854 (10ngkountas) 03NEW [17:34:52] 10SRE-swift-storage, 10Ceph, 06collaboration-services, 10Data-Persistence-Backup: Evaluate generic backup tooling for object storage buckets - https://phabricator.wikimedia.org/T406824#11363901 (10jcrespo) > A more generic and proper tool for creating bucket-level backups We do have a backup system for me... [17:36:32] * Lucas_WMDE done deploying [17:36:40] and filed T409855 about the annoyingly broken security deployment process [17:36:40] T409855: Document correct way to deploy security patches - https://phabricator.wikimedia.org/T409855 [17:38:33] (03PS1) 10David Caro: maintaint_dbusers: count skipped account as such [puppet] - 10https://gerrit.wikimedia.org/r/1203873 (https://phabricator.wikimedia.org/T409847) [17:42:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P85204 and previous config saved to /var/cache/conftool/dbconfig/20251111-174200-marostegui.json [17:42:34] RECOVERY - Host lsw1-d6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [17:42:54] (03PS2) 10David Caro: maintaint_dbusers: count skipped account as such [puppet] - 10https://gerrit.wikimedia.org/r/1203873 (https://phabricator.wikimedia.org/T409847) [17:43:02] RECOVERY - Host lsw1-d6-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [17:45:26] (03CR) 10Cathal Mooney: [C:03+1] admin_ng: add lsw1-d8-eqiad to BGPPeers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203820 (owner: 10Elukey) [17:46:54] (03CR) 10David Caro: [C:03+2] maintaint_dbusers: count skipped account as such [puppet] - 10https://gerrit.wikimedia.org/r/1203873 (https://phabricator.wikimedia.org/T409847) (owner: 10David Caro) [17:47:48] (03CR) 10CI reject: [V:04-1] replica_cnf_api: Do not check for file existence [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [17:49:56] (03CR) 10Cathal Mooney: [C:03+1] "Sorry I'm an idiot, I've been swimming in all these new hostnames for the Nokia boxes I didn't realise this was rack E8. Which is a Junip" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203820 (owner: 10Elukey) [17:49:59] (03CR) 10Cathal Mooney: [C:03+2] admin_ng: add lsw1-d8-eqiad to BGPPeers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203820 (owner: 10Elukey) [17:51:43] (03PS6) 10David Caro: replica_cnf_api: Do not check for file existence [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [17:52:42] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [17:53:42] (03CR) 10CI reject: [V:04-1] replica_cnf_api: Do not check for file existence [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [17:54:44] (03PS7) 10David Caro: replica_cnf_api: Do not check for file existence [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [17:55:16] (03PS8) 10David Caro: replica_cnf_api: Do not check for file existence [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [17:56:17] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding hcaptcha-proxy.anycast.wmnet - sukhe@cumin1003" [17:56:19] !log add hcaptcha-proxy.anycast.wmnet 10.3.0.10/32: T409780 [17:56:21] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding hcaptcha-proxy.anycast.wmnet - sukhe@cumin1003" [17:56:21] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T407997)', diff saved to https://phabricator.wikimedia.org/P85205 and previous config saved to /var/cache/conftool/dbconfig/20251111-175708-marostegui.json [17:57:12] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [17:57:17] (03CR) 10CI reject: [V:04-1] replica_cnf_api: Do not check for file existence [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [17:57:24] (03Merged) 10jenkins-bot: admin_ng: add lsw1-d8-eqiad to BGPPeers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203820 (owner: 10Elukey) [17:57:25] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1243.eqiad.wmnet with reason: Maintenance [17:57:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1243 (T407997)', diff saved to https://phabricator.wikimedia.org/P85206 and previous config saved to /var/cache/conftool/dbconfig/20251111-175732-marostegui.json [17:59:23] (03PS1) 10Ssingh: site.pp: add new VMs for hcaptcha proxy (bird/anycast) [puppet] - 10https://gerrit.wikimedia.org/r/1203917 (https://phabricator.wikimedia.org/T409780) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251111T1800) [18:04:02] (03PS2) 10Ssingh: site.pp and preseed.yaml: add new VMs for hcaptcha proxy (bird/anycast) [puppet] - 10https://gerrit.wikimedia.org/r/1203917 (https://phabricator.wikimedia.org/T409780) [18:04:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T407997)', diff saved to https://phabricator.wikimedia.org/P85207 and previous config saved to /var/cache/conftool/dbconfig/20251111-180414-marostegui.json [18:04:18] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [18:04:27] (03PS3) 10Ssingh: site.pp and preseed.yaml: add new VMs for hcaptcha proxy (bird/anycast) [puppet] - 10https://gerrit.wikimedia.org/r/1203917 (https://phabricator.wikimedia.org/T409780) [18:06:02] (03PS9) 10David Caro: replica_cnf_api: Do not check for file existence [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [18:06:06] (03PS4) 10Ssingh: site.pp and preseed.yaml: add new VMs for hcaptcha proxy (bird/anycast) [puppet] - 10https://gerrit.wikimedia.org/r/1203917 (https://phabricator.wikimedia.org/T409780) [18:07:17] (03PS5) 10Ssingh: site.pp and preseed.yaml: add new VMs for hcaptcha proxy (bird/anycast) [puppet] - 10https://gerrit.wikimedia.org/r/1203917 (https://phabricator.wikimedia.org/T409780) [18:09:06] (03CR) 10CI reject: [V:04-1] replica_cnf_api: Do not check for file existence [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [18:11:37] (03PS3) 10Tiziano Fogli: metamonitoring: add icinga module [puppet] - 10https://gerrit.wikimedia.org/r/1203845 (https://phabricator.wikimedia.org/T397003) [18:11:41] (03CR) 10Tiziano Fogli: "The current Icinga external monitoring script has been added as a module to the script used to monitor the Prometheus/Thanos stack." [puppet] - 10https://gerrit.wikimedia.org/r/1203845 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [18:12:13] (03PS10) 10David Caro: replica_cnf_api: Do not check for file existence [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [18:12:27] (03CR) 10CI reject: [V:04-1] replica_cnf_api: Do not check for file existence [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [18:12:36] (03PS11) 10David Caro: replica_cnf_api: Do not check for file existence [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [18:12:40] (03PS1) 10Cathal Mooney: Machine Learning beast servers: allow BGP to alternate rack [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1204072 [18:13:37] (03CR) 10CI reject: [V:04-1] replica_cnf_api: Do not check for file existence [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [18:14:06] (03PS1) 10Ssingh: O:haptcha::proxy: add new role for hCaptcha proxy VMs (bird/anycast) [puppet] - 10https://gerrit.wikimedia.org/r/1204073 (https://phabricator.wikimedia.org/T409780) [18:18:59] (03PS12) 10David Caro: replica_cnf_api: Do not check for file existence [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [18:19:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P85208 and previous config saved to /var/cache/conftool/dbconfig/20251111-181921-marostegui.json [18:20:35] 06SRE, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast) - https://phabricator.wikimedia.org/T409860 (10ssingh) 03NEW [18:20:53] (03CR) 10CI reject: [V:04-1] replica_cnf_api: Do not check for file existence [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [18:20:57] 06SRE, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11364269 (10ssingh) [18:22:21] (03CR) 10Cathal Mooney: [C:03+2] "and sorry in my defence the confusion is because the commit msg in this patch metnions D8, which is a Nokia, but ml-serve1012 is in E8 and" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203820 (owner: 10Elukey) [18:23:27] 06SRE, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11364275 (10ssingh) Initial role can be `insetup::traffic_nftables`. We will reimage to `hcaptcha::proxy` role later, with Debian... [18:24:54] 06SRE, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11364278 (10ssingh) Once the VMs are up, we will need to enable BGP for all of them in Netbox and then run `homer`. [18:25:21] (03PS13) 10David Caro: replica_cnf_api: Do not check for file existence [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [18:27:20] (03CR) 10CI reject: [V:04-1] replica_cnf_api: Do not check for file existence [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [18:27:34] (03PS1) 10Ssingh: P:bird::anycast_monitoring: add hcaptcha-proxy.anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1204074 (https://phabricator.wikimedia.org/T409780) [18:30:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 11 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200173 (https://phabricator.wikimedia.org/T402470) (owner: 10Jdlrobson) [18:34:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P85209 and previous config saved to /var/cache/conftool/dbconfig/20251111-183429-marostegui.json [18:35:41] (03PS2) 10Ssingh: O:haptcha::proxy: add new role for hCaptcha proxy VMs (bird/anycast) [puppet] - 10https://gerrit.wikimedia.org/r/1204073 (https://phabricator.wikimedia.org/T409780) [18:38:06] (03PS14) 10David Caro: replica_cnf_api: Do not check for file existence [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [18:49:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T407997)', diff saved to https://phabricator.wikimedia.org/P85210 and previous config saved to /var/cache/conftool/dbconfig/20251111-184936-marostegui.json [18:49:41] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [18:49:53] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1244.eqiad.wmnet with reason: Maintenance [18:50:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1244 (T407997)', diff saved to https://phabricator.wikimedia.org/P85211 and previous config saved to /var/cache/conftool/dbconfig/20251111-185000-marostegui.json [18:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:56:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T407997)', diff saved to https://phabricator.wikimedia.org/P85212 and previous config saved to /var/cache/conftool/dbconfig/20251111-185637-marostegui.json [18:56:43] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [19:00:05] andre and jeena: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251111T1900). [19:11:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P85213 and previous config saved to /var/cache/conftool/dbconfig/20251111-191146-marostegui.json [19:19:16] (03CR) 10Dreamy Jazz: hCaptcha: Set fallback for ConfirmEditTriggersCaptcha (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203841 (https://phabricator.wikimedia.org/T409736) (owner: 10Kosta Harlan) [19:26:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P85214 and previous config saved to /var/cache/conftool/dbconfig/20251111-192654-marostegui.json [19:41:34] (03PS2) 10Kosta Harlan: hCaptcha: Set fallback for ConfirmEditTriggersCaptcha [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203841 (https://phabricator.wikimedia.org/T409736) [19:41:38] (03CR) 10Kosta Harlan: hCaptcha: Set fallback for ConfirmEditTriggersCaptcha (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203841 (https://phabricator.wikimedia.org/T409736) (owner: 10Kosta Harlan) [19:42:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T407997)', diff saved to https://phabricator.wikimedia.org/P85215 and previous config saved to /var/cache/conftool/dbconfig/20251111-194201-marostegui.json [19:42:06] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [19:42:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1245.eqiad.wmnet with reason: Maintenance [19:43:11] marostegui@cumin1003 clone (PID 2016562) is awaiting input [19:47:07] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1247.eqiad.wmnet with reason: Maintenance [19:47:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1247 (T407997)', diff saved to https://phabricator.wikimedia.org/P85216 and previous config saved to /var/cache/conftool/dbconfig/20251111-194714-marostegui.json [19:47:19] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [19:53:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T407997)', diff saved to https://phabricator.wikimedia.org/P85217 and previous config saved to /var/cache/conftool/dbconfig/20251111-195354-marostegui.json [19:53:59] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [20:09:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P85218 and previous config saved to /var/cache/conftool/dbconfig/20251111-200901-marostegui.json [20:10:08] (03CR) 10Kamila Součková: [C:03+2] mediawiki: Update sendVerifyEmailReminderNotification script location [puppet] - 10https://gerrit.wikimedia.org/r/1202873 (owner: 10Zabe) [20:23:56] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply [20:24:08] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [20:24:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P85219 and previous config saved to /var/cache/conftool/dbconfig/20251111-202409-marostegui.json [20:24:53] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [20:24:58] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [20:29:06] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:32:13] FIRING: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:34:20] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1003.eqiad.wmnet [20:39:10] jouncebot: nowandnext [20:39:10] For the next 0 hour(s) and 20 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251111T1900) [20:39:10] In 0 hour(s) and 20 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251111T2100) [20:39:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T407997)', diff saved to https://phabricator.wikimedia.org/P85220 and previous config saved to /var/cache/conftool/dbconfig/20251111-203917-marostegui.json [20:39:21] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [20:39:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1248.eqiad.wmnet with reason: Maintenance [20:39:35] (03PS1) 10Reedy: fix: don't run listTaskCounts if Newcomer Task are not available [extensions/GrowthExperiments] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1204084 (https://phabricator.wikimedia.org/T408052) [20:39:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1248 (T407997)', diff saved to https://phabricator.wikimedia.org/P85221 and previous config saved to /var/cache/conftool/dbconfig/20251111-203942-marostegui.json [20:39:47] (03CR) 10Reedy: [C:03+2] fix: don't run listTaskCounts if Newcomer Task are not available [extensions/GrowthExperiments] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1204084 (https://phabricator.wikimedia.org/T408052) (owner: 10Reedy) [20:40:55] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1003.eqiad.wmnet [20:42:49] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1019.eqiad.wmnet [20:46:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T407997)', diff saved to https://phabricator.wikimedia.org/P85222 and previous config saved to /var/cache/conftool/dbconfig/20251111-204618-marostegui.json [20:46:23] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [20:49:06] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1019.eqiad.wmnet [20:49:35] (03Merged) 10jenkins-bot: fix: don't run listTaskCounts if Newcomer Task are not available [extensions/GrowthExperiments] (wmf/1.46.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1204084 (https://phabricator.wikimedia.org/T408052) (owner: 10Reedy) [20:51:17] (03PS4) 10Reedy: CommonSettings.php: Reduce usage of wmgUseCentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132720 [20:51:37] (03PS2) 10Reedy: InitialiseSettings: Update comment about wgPopupsConflictingRefTooltipsGadgetName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170736 (https://phabricator.wikimedia.org/T362771) [20:51:40] (03CR) 10Reedy: [C:03+2] InitialiseSettings: Update comment about wgPopupsConflictingRefTooltipsGadgetName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170736 (https://phabricator.wikimedia.org/T362771) (owner: 10Reedy) [20:51:51] (03PS5) 10Reedy: CommonSettings.php: Remove old $wgCentralDBname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129230 (https://phabricator.wikimedia.org/T389348) [20:51:56] (03CR) 10Reedy: [C:03+2] CommonSettings.php: Remove old $wgCentralDBname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129230 (https://phabricator.wikimedia.org/T389348) (owner: 10Reedy) [20:52:10] !log btullis@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{dse-k8s-worker1011.eqiad.wmnet} and (A:dse-k8s-master-eqiad or A:dse-k8s-worker-eqiad) [20:52:28] (03Merged) 10jenkins-bot: InitialiseSettings: Update comment about wgPopupsConflictingRefTooltipsGadgetName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170736 (https://phabricator.wikimedia.org/T362771) (owner: 10Reedy) [20:52:30] !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1204084|fix: don't run listTaskCounts if Newcomer Task are not available (T408052 T408531)]] [20:52:36] T408052: PHP Warning: Trying to access array offset on null (via GrowthExperiments listTaskCounts) - https://phabricator.wikimedia.org/T408052 [20:52:36] T408531: GrowthExperiments: Unexpected call to ConfigurationLoader::getTaskTypes when feature is disabled) - https://phabricator.wikimedia.org/T408531 [20:52:39] (03Merged) 10jenkins-bot: CommonSettings.php: Remove old $wgCentralDBname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129230 (https://phabricator.wikimedia.org/T389348) (owner: 10Reedy) [20:54:42] !log reedy@deploy2002 reedy: Backport for [[gerrit:1204084|fix: don't run listTaskCounts if Newcomer Task are not available (T408052 T408531)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:55:13] !log reedy@deploy2002 reedy: Continuing with sync [20:56:21] (03PS2) 10Reedy: Drop TemplateData EventStreams/EventLogging config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126134 (https://phabricator.wikimedia.org/T258917) [20:59:27] !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204084|fix: don't run listTaskCounts if Newcomer Task are not available (T408052 T408531)]] (duration: 06m 57s) [20:59:32] !log btullis@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on P{dse-k8s-worker1011.eqiad.wmnet} and (A:dse-k8s-master-eqiad or A:dse-k8s-worker-eqiad) [20:59:33] T408052: PHP Warning: Trying to access array offset on null (via GrowthExperiments listTaskCounts) - https://phabricator.wikimedia.org/T408052 [20:59:33] T408531: GrowthExperiments: Unexpected call to ConfigurationLoader::getTaskTypes when feature is disabled) - https://phabricator.wikimedia.org/T408531 [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251111T2100). [21:00:05] RoanKattouw: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:57] !log reedy@deploy2002 Started scap sync-world: Backport for [[gerrit:1129230|CommonSettings.php: Remove old $wgCentralDBname (T389348)]], [[gerrit:1170736|InitialiseSettings: Update comment about wgPopupsConflictingRefTooltipsGadgetName (T362771)]] [21:01:02] T389348: Migrate CentralNotice to virtual domains - https://phabricator.wikimedia.org/T389348 [21:01:03] T362771: Move ReferencePreviews related config flags to Cite's codebase - https://phabricator.wikimedia.org/T362771 [21:01:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P85223 and previous config saved to /var/cache/conftool/dbconfig/20251111-210126-marostegui.json [21:03:18] !log reedy@deploy2002 reedy: Backport for [[gerrit:1129230|CommonSettings.php: Remove old $wgCentralDBname (T389348)]], [[gerrit:1170736|InitialiseSettings: Update comment about wgPopupsConflictingRefTooltipsGadgetName (T362771)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:03:41] !log reedy@deploy2002 reedy: Continuing with sync [21:07:57] !log reedy@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129230|CommonSettings.php: Remove old $wgCentralDBname (T389348)]], [[gerrit:1170736|InitialiseSettings: Update comment about wgPopupsConflictingRefTooltipsGadgetName (T362771)]] (duration: 07m 00s) [21:08:02] T389348: Migrate CentralNotice to virtual domains - https://phabricator.wikimedia.org/T389348 [21:08:03] T362771: Move ReferencePreviews related config flags to Cite's codebase - https://phabricator.wikimedia.org/T362771 [21:09:07] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:16:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P85224 and previous config saved to /var/cache/conftool/dbconfig/20251111-211634-marostegui.json [21:31:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T407997)', diff saved to https://phabricator.wikimedia.org/P85225 and previous config saved to /var/cache/conftool/dbconfig/20251111-213141-marostegui.json [21:31:46] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [21:31:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1249.eqiad.wmnet with reason: Maintenance [21:32:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1249 (T407997)', diff saved to https://phabricator.wikimedia.org/P85226 and previous config saved to /var/cache/conftool/dbconfig/20251111-213205-marostegui.json [21:36:55] (03PS1) 10Kosta Harlan: ext.confirmEdit.hCaptcha: Consider action=submit an edit interface [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204088 (https://phabricator.wikimedia.org/T409701) [21:37:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 12 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1204088 (https://phabricator.wikimedia.org/T409701) (owner: 10Kosta Harlan) [21:38:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T407997)', diff saved to https://phabricator.wikimedia.org/P85227 and previous config saved to /var/cache/conftool/dbconfig/20251111-213842-marostegui.json [21:38:48] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [21:53:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P85228 and previous config saved to /var/cache/conftool/dbconfig/20251111-215350-marostegui.json [22:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251111T2200) [22:08:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P85229 and previous config saved to /var/cache/conftool/dbconfig/20251111-220858-marostegui.json [22:24:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T407997)', diff saved to https://phabricator.wikimedia.org/P85230 and previous config saved to /var/cache/conftool/dbconfig/20251111-222405-marostegui.json [22:24:10] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [22:24:22] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1252.eqiad.wmnet with reason: Maintenance [22:24:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1252 (T407997)', diff saved to https://phabricator.wikimedia.org/P85231 and previous config saved to /var/cache/conftool/dbconfig/20251111-222430-marostegui.json [22:31:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T407997)', diff saved to https://phabricator.wikimedia.org/P85232 and previous config saved to /var/cache/conftool/dbconfig/20251111-223109-marostegui.json [22:31:14] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [22:37:37] !log btullis@cumin1003 START - Cookbook sre.ceph.roll-restart-reboot-server rolling reboot on A:cephosd-eqiad [22:40:18] PROBLEM - BFD status on lsw1-e1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:46:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P85233 and previous config saved to /var/cache/conftool/dbconfig/20251111-224616-marostegui.json [22:46:18] RECOVERY - BFD status on lsw1-e1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:51:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:55:10] PROBLEM - BFD status on lsw1-e2-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:01:10] RECOVERY - BFD status on lsw1-e2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:01:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P85234 and previous config saved to /var/cache/conftool/dbconfig/20251111-230124-marostegui.json [23:10:18] PROBLEM - BFD status on lsw1-e3-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:16:18] RECOVERY - BFD status on lsw1-e3-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:16:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T407997)', diff saved to https://phabricator.wikimedia.org/P85235 and previous config saved to /var/cache/conftool/dbconfig/20251111-231632-marostegui.json [23:16:36] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [23:16:48] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1260.eqiad.wmnet with reason: Maintenance [23:16:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1260 (T407997)', diff saved to https://phabricator.wikimedia.org/P85236 and previous config saved to /var/cache/conftool/dbconfig/20251111-231655-marostegui.json [23:23:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260 (T407997)', diff saved to https://phabricator.wikimedia.org/P85237 and previous config saved to /var/cache/conftool/dbconfig/20251111-232334-marostegui.json [23:23:38] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [23:24:18] PROBLEM - BFD status on lsw1-f1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:31:18] RECOVERY - BFD status on lsw1-f1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:38:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260', diff saved to https://phabricator.wikimedia.org/P85238 and previous config saved to /var/cache/conftool/dbconfig/20251111-233842-marostegui.json [23:41:18] PROBLEM - BFD status on lsw1-f2-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:47:18] RECOVERY - BFD status on lsw1-f2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:49:46] !log btullis@cumin1003 END (PASS) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=0) rolling reboot on A:cephosd-eqiad [23:53:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260', diff saved to https://phabricator.wikimedia.org/P85239 and previous config saved to /var/cache/conftool/dbconfig/20251111-235349-marostegui.json