[00:08:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1154527 [00:08:24] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1154527 (owner: 10TrainBranchBot) [00:10:50] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 643.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:21:39] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [00:22:38] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [00:28:59] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1154527 (owner: 10TrainBranchBot) [00:44:14] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/a2fa85ed735289f379ffd6917bced58b823aefa2c9345e3c6dc97463710b74f4/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [00:57:38] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [01:04:14] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:55:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:00:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:03:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:05:50] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 56.48 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:08:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:21:39] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [04:22:38] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [04:44:57] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db2244 - https://phabricator.wikimedia.org/T393195#10894263 (10Marostegui) Thank you! [04:50:03] 10ops-codfw, 06DBA, 06DC-Ops: Broken disk on db2226 - https://phabricator.wikimedia.org/T396319 (10Marostegui) 03NEW [04:50:13] 10ops-codfw, 06DBA, 06DC-Ops: Broken disk on db2226 - https://phabricator.wikimedia.org/T396319#10894276 (10Marostegui) p:05Triage→03Medium [04:53:29] (03PS1) 10Marostegui: mariadb: Productionize db2244 [puppet] - 10https://gerrit.wikimedia.org/r/1154540 (https://phabricator.wikimedia.org/T393989) [04:54:50] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db2244 [puppet] - 10https://gerrit.wikimedia.org/r/1154540 (https://phabricator.wikimedia.org/T393989) (owner: 10Marostegui) [04:57:38] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [05:00:01] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet [05:00:05] !log marostegui@cumin1002 START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 [05:00:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 [05:01:51] (03PS1) 10Marostegui: instances.yaml: Add db2244 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1154541 (https://phabricator.wikimedia.org/T393989) [05:02:21] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db2244 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1154541 (https://phabricator.wikimedia.org/T393989) (owner: 10Marostegui) [05:03:17] FIRING: [2x] ProbeDown: Service wdqs2014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:15:29] ACKNOWLEDGEMENT - Dell PowerEdge RAID / Supermicro Broadcom Controller on db2226 is CRITICAL: communication: 0 OK Marostegui https://phabricator.wikimedia.org/T396319 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [05:22:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:24:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:24:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add db2244 to dbctl depooled T393989', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json [05:24:57] T393989: Productionize new x3 hosts - https://phabricator.wikimedia.org/T393989 [05:27:03] (03CR) 10Marostegui: [C:04-1] "Let's add pc* too" [alerts] - 10https://gerrit.wikimedia.org/r/1154314 (owner: 10Federico Ceratto) [05:27:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:29:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:37:27] !log marostegui@cumin1002 START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning [05:42:47] 06SRE, 10Legalpad, 10Phabricator: Allow aklapper to view/edit L3 - https://phabricator.wikimedia.org/T394966#10894312 (10LSobanski) a:03LSobanski [05:42:53] !log Add MariaDB 10.11.13 to the repo T395663 [05:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:56] T395663: MariaDB 10.11.13 released - https://phabricator.wikimedia.org/T395663 [05:44:59] 07sre-alert-triage, 06Traffic: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T396320 (10LSobanski) 03NEW [05:45:26] 07sre-alert-triage, 06Traffic: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T396321 (10LSobanski) 03NEW [05:47:51] RESOLVED: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:49:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:53:43] (03PS1) 10Marostegui: production-m1.sql.erb: Add zuul user [puppet] - 10https://gerrit.wikimedia.org/r/1154543 (https://phabricator.wikimedia.org/T394844) [05:55:02] (03CR) 10Marostegui: "This is a noop until the grants are crated on the DB" [puppet] - 10https://gerrit.wikimedia.org/r/1154543 (https://phabricator.wikimedia.org/T394844) (owner: 10Marostegui) [05:58:39] (03PS2) 10Marostegui: production-m1.sql.erb: Add zuul user [puppet] - 10https://gerrit.wikimedia.org/r/1154543 (https://phabricator.wikimedia.org/T394844) [06:00:09] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:00:24] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:03:21] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:04:15] (03CR) 10Marostegui: [C:03+2] production-m1.sql.erb: Add zuul user [puppet] - 10https://gerrit.wikimedia.org/r/1154543 (https://phabricator.wikimedia.org/T394844) (owner: 10Marostegui) [06:05:09] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:18:21] RESOLVED: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:20:09] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:22:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning [06:23:53] (03PS1) 10Marostegui: db2244: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1154544 (https://phabricator.wikimedia.org/T393989) [06:26:57] (03CR) 10Marostegui: [C:03+2] db2244: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1154544 (https://phabricator.wikimedia.org/T393989) (owner: 10Marostegui) [06:43:32] ACKNOWLEDGEMENT - Dell PowerEdge RAID / Supermicro Broadcom Controller on db2226 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T396323 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [06:43:38] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2226 - https://phabricator.wikimedia.org/T396323 (10ops-monitoring-bot) 03NEW [06:44:27] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Broken disk on db2226 - https://phabricator.wikimedia.org/T396319#10894391 (10Marostegui) 05Open→03Resolved a:03Marostegui It was created finally: T396323 so I will close this one. [06:45:30] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2226 - https://phabricator.wikimedia.org/T396323#10894399 (10Marostegui) p:05Triage→03Medium Can we get a new disk? [06:47:02] 10SRE-tools, 06DBA, 06Infrastructure-Foundations: Raid handler for broadcom disk did't automatically open task on db2226 - https://phabricator.wikimedia.org/T396319#10894402 (10Volans) 05Resolved→03Open [06:47:25] 10SRE-tools, 06DBA, 06Infrastructure-Foundations: Raid handler for broadcom disk did't automatically open task on db2226 - https://phabricator.wikimedia.org/T396319#10894408 (10Volans) From IRC logs the handler was triggered fine: ` Jun 07 04:22:14 alert1002 icinga[1502926]: SERVICE EVENT HANDLER: db2226;Del... [06:54:24] (03PS3) 10Volans: Automatic reformat: noop change [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152007 [06:58:30] 10SRE-tools, 06DBA, 06Infrastructure-Foundations: Raid handler for broadcom disk did't automatically open task on db2226 - https://phabricator.wikimedia.org/T396319#10894411 (10Volans) As a side note `sudo /usr/lib/nagios/plugins/check_nrpe -4 -H db2226 -c get_raid_status_broadcom` returns an exit status of... [06:59:50] (03CR) 10Volans: [C:03+2] Automatic reformat: noop change [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152007 (owner: 10Volans) [07:00:05] Amir1, Urbanecm, and awight: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250609T0700) [07:00:05] sergi0 and -: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:24] (03CR) 10Volans: [C:03+2] Automatic reformat: move to double quote strings [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152008 (owner: 10Volans) [07:02:17] RECOVERY - Dell PowerEdge RAID / Supermicro Broadcom Controller on db2226 is OK: REPRO TEST volans https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [07:03:09] 10SRE-tools, 06Data-Persistence, 06Infrastructure-Foundations: Raid handler for broadcom disk did't automatically open task on db2226 - https://phabricator.wikimedia.org/T396319#10894413 (10Marostegui) [07:04:29] (03Merged) 10jenkins-bot: Automatic reformat: noop change [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152007 (owner: 10Volans) [07:08:01] PROBLEM - Dell PowerEdge RAID / Supermicro Broadcom Controller on db2226 is CRITICAL: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [07:08:10] here's the new critical [07:08:18] raid handler triggered [07:18:23] (03PS3) 10Volans: Automatic reformat: move to double quote strings [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152008 [07:18:35] (03PS3) 10Volans: doc: small improvements in the config file [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152009 [07:18:44] (03PS3) 10Volans: tests: small simplification [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152010 [07:18:52] (03PS3) 10Volans: wmflib: small simplification [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152011 [07:23:02] !log marostegui@cumin1002 START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning [07:23:54] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance [07:24:35] (03PS2) 10Fabfur: haproxy: normalize host header [puppet] - 10https://gerrit.wikimedia.org/r/1148373 (https://phabricator.wikimedia.org/T392880) [07:24:49] (03CR) 10Fabfur: haproxy: normalize host header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148373 (https://phabricator.wikimedia.org/T392880) (owner: 10Fabfur) [07:26:17] PROBLEM - Dell PowerEdge RAID - Supermicro Broadcom Controller on db2226 is CRITICAL: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [07:26:27] this is me [07:27:14] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148373 (https://phabricator.wikimedia.org/T392880) (owner: 10Fabfur) [07:28:55] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance [07:30:00] (03PS3) 10Fabfur: haproxy: normalize host header [puppet] - 10https://gerrit.wikimedia.org/r/1148373 (https://phabricator.wikimedia.org/T392880) [07:31:12] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148373 (https://phabricator.wikimedia.org/T392880) (owner: 10Fabfur) [07:31:47] PROBLEM - Dell PowerEdge RAID / Supermicro Broadcom Controller on db2226 is CRITICAL: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [07:32:45] ACKNOWLEDGEMENT - Dell PowerEdge RAID / Supermicro Broadcom Controller on db2226 is CRITICAL: communication: 0 OK Volans Broken disk - T396323 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [07:33:57] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [07:34:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T396130)', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json [07:34:07] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [07:41:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T396130)', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json [07:41:16] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [07:46:58] (03CR) 10Volans: Automatic reformat: move to double quote strings [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152008 (owner: 10Volans) [07:47:08] (03CR) 10Volans: [C:03+2] Automatic reformat: move to double quote strings [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152008 (owner: 10Volans) [07:47:22] (03CR) 10Volans: [C:03+2] doc: small improvements in the config file [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152009 (owner: 10Volans) [07:47:30] (03CR) 10Volans: [C:03+2] tests: small simplification [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152010 (owner: 10Volans) [07:47:42] (03CR) 10Volans: [C:03+2] wmflib: small simplification [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152011 (owner: 10Volans) [07:49:53] (03PS2) 10Federico Ceratto: team-data-persistence: Add predictive disk space alerts [alerts] - 10https://gerrit.wikimedia.org/r/1154314 [07:50:24] (03CR) 10Federico Ceratto: "Added pc*" [alerts] - 10https://gerrit.wikimedia.org/r/1154314 (owner: 10Federico Ceratto) [07:51:43] (03Merged) 10jenkins-bot: Automatic reformat: move to double quote strings [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152008 (owner: 10Volans) [07:51:50] (03Merged) 10jenkins-bot: doc: small improvements in the config file [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152009 (owner: 10Volans) [07:51:59] (03Merged) 10jenkins-bot: tests: small simplification [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152010 (owner: 10Volans) [07:52:27] (03Merged) 10jenkins-bot: wmflib: small simplification [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152011 (owner: 10Volans) [07:52:29] (03PS3) 10Volans: dns: alias DnsNotFound to DnsNotFoundError [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152012 [07:52:34] (03PS3) 10Volans: config: make the raises argument keyword only [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152013 [07:52:45] (03PS3) 10Volans: phabricator: make secondary arguments keyword only [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152014 [07:52:51] (03PS3) 10Volans: Automatic reformat: reorder imports [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152015 [07:52:58] (03PS3) 10Volans: tox: completely refactor static checkers/linters [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152016 [07:56:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json [08:08:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning [08:08:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet [08:11:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json [08:15:27] (03PS1) 10Tiziano Fogli: prometheus/magru: remove 7001, add 7002 [puppet] - 10https://gerrit.wikimedia.org/r/1154762 (https://phabricator.wikimedia.org/T395130) [08:16:29] jouncebot: nowandnext [08:16:29] No deployments scheduled for the next 1 hour(s) and 43 minute(s) [08:16:29] In 1 hour(s) and 43 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250609T1000) [08:16:54] (03CR) 10Urbanecm: [C:04-1] [beta] GrowthExperiments: enable limiting add a link task via config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154282 (https://phabricator.wikimedia.org/T393769) (owner: 10Sergio Gimeno) [08:19:14] (03CR) 10Volans: [C:03+2] dns: alias DnsNotFound to DnsNotFoundError [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152012 (owner: 10Volans) [08:19:29] (03CR) 10Volans: [C:03+2] config: make the raises argument keyword only [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152013 (owner: 10Volans) [08:19:46] (03CR) 10Volans: [C:03+2] phabricator: make secondary arguments keyword only [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152014 (owner: 10Volans) [08:20:08] (03CR) 10Volans: [C:03+2] Automatic reformat: reorder imports [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152015 (owner: 10Volans) [08:20:34] (03CR) 10Volans: [C:03+2] tox: completely refactor static checkers/linters [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152016 (owner: 10Volans) [08:21:32] (03PS1) 10Tiziano Fogli: prometheus/magru: update svc record [dns] - 10https://gerrit.wikimedia.org/r/1154764 (https://phabricator.wikimedia.org/T359130) [08:21:39] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [08:22:38] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [08:24:18] (03CR) 10Vgutierrez: [C:03+1] haproxy: normalize host header [puppet] - 10https://gerrit.wikimedia.org/r/1148373 (https://phabricator.wikimedia.org/T392880) (owner: 10Fabfur) [08:24:19] (03Merged) 10jenkins-bot: dns: alias DnsNotFound to DnsNotFoundError [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152012 (owner: 10Volans) [08:24:19] (03Merged) 10jenkins-bot: config: make the raises argument keyword only [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152013 (owner: 10Volans) [08:24:19] (03Merged) 10jenkins-bot: phabricator: make secondary arguments keyword only [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152014 (owner: 10Volans) [08:24:59] (03Merged) 10jenkins-bot: Automatic reformat: reorder imports [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152015 (owner: 10Volans) [08:26:28] (03Merged) 10jenkins-bot: tox: completely refactor static checkers/linters [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152016 (owner: 10Volans) [08:26:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T396130)', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json [08:26:44] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [08:26:48] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance [08:26:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T396130)', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json [08:27:10] (03PS1) 10Volans: tox: add style checker and formatter environments [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1154766 [08:29:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T396130)', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json [08:38:29] (03CR) 10Filippo Giunchedi: [C:03+1] prometheus/magru: update svc record [dns] - 10https://gerrit.wikimedia.org/r/1154764 (https://phabricator.wikimedia.org/T359130) (owner: 10Tiziano Fogli) [08:39:45] (03CR) 10Filippo Giunchedi: [C:03+1] prometheus/magru: remove 7001, add 7002 [puppet] - 10https://gerrit.wikimedia.org/r/1154762 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli) [08:40:06] (03PS2) 10Tiziano Fogli: prometheus/magru: update svc record [dns] - 10https://gerrit.wikimedia.org/r/1154764 (https://phabricator.wikimedia.org/T395130) [08:44:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json [08:45:16] (03PS1) 10Vgutierrez: Revert^6 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154768 [08:47:22] (03PS2) 10Vgutierrez: Revert^6 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154768 (https://phabricator.wikimedia.org/T395228) [08:57:38] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [08:59:23] (03CR) 10Fabfur: [C:03+2] haproxy: normalize host header [puppet] - 10https://gerrit.wikimedia.org/r/1148373 (https://phabricator.wikimedia.org/T392880) (owner: 10Fabfur) [09:00:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json [09:03:17] FIRING: [2x] ProbeDown: Service wdqs2014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:13:10] (03CR) 10Tiziano Fogli: [C:03+2] prometheus/magru: remove 7001, add 7002 [puppet] - 10https://gerrit.wikimedia.org/r/1154762 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli) [09:15:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T396130)', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json [09:15:12] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [09:15:22] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance [09:15:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T396130)', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json [09:16:00] (03PS1) 10Tiziano Fogli: Revert "prometheus::pop: temporarily exclude ops instance from prometheus7002" [puppet] - 10https://gerrit.wikimedia.org/r/1154771 [09:16:53] (03CR) 10Tiziano Fogli: [C:03+2] Revert "prometheus::pop: temporarily exclude ops instance from prometheus7002" [puppet] - 10https://gerrit.wikimedia.org/r/1154771 (owner: 10Tiziano Fogli) [09:18:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T396130)', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json [09:20:36] !log upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - T395228 [09:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:39] T395228: Test katran forwarding plane on lvs1013 - https://phabricator.wikimedia.org/T395228 [09:22:23] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154768 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [09:22:45] /16 [09:22:48] err :) [09:22:59] elukey: E_TOO_MANY_WINDOWS [09:23:52] FIRING: SLOMetricAbsent: wdqs-availability magru - https://slo.wikimedia.org/?search=wdqs-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [09:27:09] (03CR) 10Tiziano Fogli: [C:03+2] prometheus/magru: update svc record [dns] - 10https://gerrit.wikimedia.org/r/1154764 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli) [09:27:11] PROBLEM - Broadcom Controller on db2226 is CRITICAL: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [09:27:38] RESOLVED: SLOMetricAbsent: wdqs-availability magru - https://slo.wikimedia.org/?search=wdqs-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [09:27:51] !log tappof@dns1004 START - running authdns-update [09:28:36] !log tappof@dns1004 END - running authdns-update [09:30:15] (03CR) 10Vgutierrez: [C:03+2] Revert^6 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154768 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [09:31:17] !log depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - T395228 [09:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:20] T395228: Test katran forwarding plane on lvs1013 - https://phabricator.wikimedia.org/T395228 [09:32:04] FIRING: [2x] JobUnavailable: Reduced availability for job pdu_pro4x in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:32:38] FIRING: SLOMetricAbsent: wdqs-availability magru - https://slo.wikimedia.org/?search=wdqs-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [09:33:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json [09:34:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:34:57] !log vgutierrez@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs1013.eqiad.wmnet} and A:liberica [09:35:14] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs1013.eqiad.wmnet} and A:liberica [09:35:30] FIRING: LibericaStaleConfig: Liberica instance lvs1013 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=eqiad&var-instance=lvs1013 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [09:35:55] really? I just triggered the config reload :D [09:36:56] 10SRE-tools, 06Data-Persistence, 06Infrastructure-Foundations, 10Observability-Alerting: Raid handler for broadcom disk didn't automatically open task on db2226 - https://phabricator.wikimedia.org/T396319#10894787 (10elukey) I've tried to disable puppet on alert1002, and manually change the service_descrip... [09:37:04] FIRING: [2x] JobUnavailable: Reduced availability for job pdu_pro4x in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:37:38] RESOLVED: SLOMetricAbsent: wdqs-availability magru - https://slo.wikimedia.org/?search=wdqs-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [09:38:08] (03PS1) 10Vgutierrez: Revert^6 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154774 (https://phabricator.wikimedia.org/T395228) [09:38:37] (03CR) 10CI reject: [V:04-1] Revert^6 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154774 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [09:38:40] (03PS2) 10Vgutierrez: Revert^6 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154774 (https://phabricator.wikimedia.org/T395228) [09:39:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:39:28] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154774 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [09:40:30] RESOLVED: LibericaStaleConfig: Liberica instance lvs1013 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=eqiad&var-instance=lvs1013 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [09:42:47] (03PS1) 10Marostegui: mariadb: Migrate s2 eqiad to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1154775 (https://phabricator.wikimedia.org/T383795) [09:42:55] !log Migrate s2 eqiad dbmaint to SBR T383795 [09:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:58] T383795: Move sX to STATEMENT based replication - https://phabricator.wikimedia.org/T383795 [09:43:05] (03CR) 10Volans: "I don't think you will need this. There is a patch in the pipeline for wmflib that should cover your use case." [cookbooks] - 10https://gerrit.wikimedia.org/r/1154240 (https://phabricator.wikimedia.org/T395427) (owner: 10Federico Ceratto) [09:44:13] (03CR) 10Marostegui: "This is a NOOP until it is executed lively in the DBs" [puppet] - 10https://gerrit.wikimedia.org/r/1154775 (https://phabricator.wikimedia.org/T383795) (owner: 10Marostegui) [09:44:14] (03CR) 10Marostegui: [C:03+2] mariadb: Migrate s2 eqiad to SBR [puppet] - 10https://gerrit.wikimedia.org/r/1154775 (https://phabricator.wikimedia.org/T383795) (owner: 10Marostegui) [09:44:43] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154774 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [09:46:37] (03CR) 10Marostegui: [C:03+1] team-data-persistence: Add predictive disk space alerts [alerts] - 10https://gerrit.wikimedia.org/r/1154314 (owner: 10Federico Ceratto) [09:47:58] (03CR) 10Vgutierrez: [C:03+2] Revert^6 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154774 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [09:48:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json [09:51:23] (03PS1) 10Volans: git: add .git-blame-ignore-revs [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1154776 [09:53:27] (03PS1) 10Tiziano Fogli: prometheus/magru: remove 7001 from prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1154777 (https://phabricator.wikimedia.org/T395130) [09:53:34] PROBLEM - Broadcom Controller elukey test 1 on db2226 is CRITICAL: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [09:58:19] PROBLEM - Broadcom Controller elukey test 2 on db2226 is CRITICAL: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250609T1000) [10:03:09] (03CR) 10Federico Ceratto: [C:03+2] team-data-persistence: Add predictive disk space alerts [alerts] - 10https://gerrit.wikimedia.org/r/1154314 (owner: 10Federico Ceratto) [10:03:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T396130)', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json [10:03:42] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [10:03:54] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance [10:04:23] (03Merged) 10jenkins-bot: team-data-persistence: Add predictive disk space alerts [alerts] - 10https://gerrit.wikimedia.org/r/1154314 (owner: 10Federico Ceratto) [10:05:43] (03PS1) 10Marostegui: db2214: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154778 (https://phabricator.wikimedia.org/T395989) [10:06:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2214 T395989', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json [10:06:09] T395989: Migrate s6 to MariaDB 10.11 - https://phabricator.wikimedia.org/T395989 [10:06:35] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance [10:07:34] (03CR) 10Marostegui: [C:03+2] db2214: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154778 (https://phabricator.wikimedia.org/T395989) (owner: 10Marostegui) [10:08:20] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance [10:12:45] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet [10:12:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json [10:14:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json [10:15:46] (03PS2) 10Hnowlan: alertmanager: adjust phab project to security-team rather than security tag [puppet] - 10https://gerrit.wikimedia.org/r/1150624 (https://phabricator.wikimedia.org/T388531) [10:15:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T396130)', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json [10:16:02] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [10:16:48] !log repooling lvs1013 handling ncredir@eqiad using katran based load balancing - T395228 [10:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:51] T395228: Test katran forwarding plane on lvs1013 - https://phabricator.wikimedia.org/T395228 [10:17:12] (03PS1) 10Vgutierrez: Revert^7 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154779 [10:17:54] (03PS2) 10Vgutierrez: Revert^7 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154779 (https://phabricator.wikimedia.org/T395228) [10:18:06] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154779 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [10:18:23] (03CR) 10Hnowlan: [C:03+2] alertmanager: adjust phab project to security-team rather than security tag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1150624 (https://phabricator.wikimedia.org/T388531) (owner: 10Hnowlan) [10:18:37] (03CR) 10Hnowlan: [C:03+2] alertmanager: adjust phab project to security-team rather than security tag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1150624 (https://phabricator.wikimedia.org/T388531) (owner: 10Hnowlan) [10:19:30] (03CR) 10Vgutierrez: [C:03+2] Revert^7 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154779 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [10:19:50] (03CR) 10Vgutierrez: [C:04-2] Revert^7 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154779 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [10:19:58] (03Abandoned) 10Vgutierrez: Revert^7 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154779 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [10:20:21] (03PS1) 10Vgutierrez: Revert^7 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154781 [10:20:33] (03PS2) 10Vgutierrez: Revert^7 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154781 (https://phabricator.wikimedia.org/T395228) [10:20:35] (03CR) 10CI reject: [V:04-1] Revert^7 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154781 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [10:20:47] (03CR) 10CI reject: [V:04-1] Revert^7 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154781 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [10:21:53] (03PS3) 10Vgutierrez: Revert^7 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154781 (https://phabricator.wikimedia.org/T395228) [10:22:07] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [10:22:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1160 (T395241)', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json [10:25:23] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154781 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [10:26:57] (03PS1) 10Filippo Giunchedi: thanos: add tracing define [puppet] - 10https://gerrit.wikimedia.org/r/1154782 (https://phabricator.wikimedia.org/T394318) [10:27:04] (03CR) 10Vgutierrez: [C:03+2] Revert^7 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154781 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [10:29:52] !log vgutierrez@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs1013.eqiad.wmnet} and A:liberica [10:30:09] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs1013.eqiad.wmnet} and A:liberica [10:31:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json [10:31:14] !log fceratto@cumin1002 START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet [10:31:15] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet [10:31:40] !log fceratto@cumin1002 START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet [10:31:40] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet [10:33:14] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in [10:33:48] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:33:57] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:34:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T395241)', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json [10:34:24] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in [10:42:48] PROBLEM - Broadcom Controller elukey test 3 on db2226 is CRITICAL: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [10:46:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77238 and previous config saved to /var/cache/conftool/dbconfig/20250609-104611-marostegui.json [10:47:03] PROBLEM - Broadcom Controller elukey test 4 on db2226 is CRITICAL: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [10:47:04] ACKNOWLEDGEMENT - Broadcom Controller elukey test 4 on db2226 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T396340 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [10:47:09] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2226 - https://phabricator.wikimedia.org/T396340 (10ops-monitoring-bot) 03NEW [10:47:41] (03CR) 10Hnowlan: [C:03+2] (api|rest)-gateway: remove envoyproxy annotation, scrape all ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154298 (owner: 10Hnowlan) [10:48:17] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2226 - https://phabricator.wikimedia.org/T396323#10894973 (10Marostegui) [10:48:18] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2226 - https://phabricator.wikimedia.org/T396340#10894975 (10Marostegui) →14Duplicate dup:03T396323 [10:49:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77240 and previous config saved to /var/cache/conftool/dbconfig/20250609-104911-fceratto.json [10:49:33] (03Merged) 10jenkins-bot: (api|rest)-gateway: remove envoyproxy annotation, scrape all ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154298 (owner: 10Hnowlan) [10:53:14] (03PS1) 10Elukey: raid::broadcom: remove "/" from description [puppet] - 10https://gerrit.wikimedia.org/r/1154783 (https://phabricator.wikimedia.org/T395688) [10:54:14] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:54:18] 10SRE-swift-storage, 06Commons: Commons: fault in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T396186#10894979 (10Aklapper) 05Open→03Resolved Assuming this specific issue got fixed by (non-public) T396185. [10:54:21] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:54:52] (03PS1) 10Filippo Giunchedi: hieradata: set default otel-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1154784 (https://phabricator.wikimedia.org/T394318) [10:54:53] (03PS1) 10Filippo Giunchedi: thanos-sidecar: enable tracing [puppet] - 10https://gerrit.wikimedia.org/r/1154785 (https://phabricator.wikimedia.org/T394318) [10:55:13] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1154783 (https://phabricator.wikimedia.org/T395688) (owner: 10Elukey) [10:55:19] (03CR) 10Elukey: [C:03+2] raid::broadcom: remove "/" from description [puppet] - 10https://gerrit.wikimedia.org/r/1154783 (https://phabricator.wikimedia.org/T395688) (owner: 10Elukey) [10:57:47] (03PS3) 10Sergio Gimeno: [beta] GrowthExperiments: enable limiting add a link task via config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154282 (https://phabricator.wikimedia.org/T393769) [11:01:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T396130)', diff saved to https://phabricator.wikimedia.org/P77242 and previous config saved to /var/cache/conftool/dbconfig/20250609-110118-marostegui.json [11:01:23] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [11:01:34] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2217.codfw.wmnet with reason: Maintenance [11:01:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T396130)', diff saved to https://phabricator.wikimedia.org/P77243 and previous config saved to /var/cache/conftool/dbconfig/20250609-110140-marostegui.json [11:04:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77245 and previous config saved to /var/cache/conftool/dbconfig/20250609-110418-fceratto.json [11:06:00] PROBLEM - Disk space on an-worker1093 is CRITICAL: DISK CRITICAL - free space: / 2116 MB (3% inode=95%): /tmp 2116 MB (3% inode=95%): /var/tmp 2116 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1093&var-datasource=eqiad+prometheus/ops [11:06:36] PROBLEM - Dell PowerEdge or Supermicro Broadcom RAID Controller on db2226 is CRITICAL: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [11:06:37] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on db2226 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T396341 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [11:06:46] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2226 - https://phabricator.wikimedia.org/T396341 (10ops-monitoring-bot) 03NEW [11:08:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T396130)', diff saved to https://phabricator.wikimedia.org/P77247 and previous config saved to /var/cache/conftool/dbconfig/20250609-110807-marostegui.json [11:08:10] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [11:09:51] 10SRE-tools, 06Data-Persistence, 06Infrastructure-Foundations, 10Observability-Alerting: Raid handler for broadcom disk didn't automatically open task on db2226 - https://phabricator.wikimedia.org/T396319#10895014 (10elukey) 05Open→03Resolved To keep archives happy, cross posting from the parent ta... [11:10:52] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2226 - https://phabricator.wikimedia.org/T396323#10895020 (10Marostegui) [11:10:54] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2226 - https://phabricator.wikimedia.org/T396341#10895022 (10Marostegui) →14Duplicate dup:03T396323 [11:18:40] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2158* gradually with 4 steps - Pooling in [11:19:02] (03PS1) 10Volans: phabricator: expand support for Phabricator tasks [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1154786 [11:19:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T395241)', diff saved to https://phabricator.wikimedia.org/P77250 and previous config saved to /var/cache/conftool/dbconfig/20250609-111926-fceratto.json [11:19:45] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [11:19:50] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2151* gradually with 4 steps - Pooling in [11:19:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1190 (T395241)', diff saved to https://phabricator.wikimedia.org/P77252 and previous config saved to /var/cache/conftool/dbconfig/20250609-111951-fceratto.json [11:20:02] (03CR) 10Volans: phabricator: expand support for Phabricator tasks (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1154786 (owner: 10Volans) [11:23:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P77253 and previous config saved to /var/cache/conftool/dbconfig/20250609-112314-marostegui.json [11:23:25] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [11:31:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T395241)', diff saved to https://phabricator.wikimedia.org/P77254 and previous config saved to /var/cache/conftool/dbconfig/20250609-113113-fceratto.json [11:38:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P77255 and previous config saved to /var/cache/conftool/dbconfig/20250609-113821-marostegui.json [11:39:54] (03PS1) 10Hnowlan: (api|rest)-gateway: define containerPort for telemetry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154789 [11:41:24] (03CR) 10CI reject: [V:04-1] (api|rest)-gateway: define containerPort for telemetry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154789 (owner: 10Hnowlan) [11:42:58] (03PS2) 10Hnowlan: (api|rest)-gateway: define containerPort for telemetry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154789 [11:44:31] (03CR) 10CI reject: [V:04-1] (api|rest)-gateway: define containerPort for telemetry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154789 (owner: 10Hnowlan) [11:46:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P77256 and previous config saved to /var/cache/conftool/dbconfig/20250609-114622-fceratto.json [11:53:22] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [11:53:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T396130)', diff saved to https://phabricator.wikimedia.org/P77257 and previous config saved to /var/cache/conftool/dbconfig/20250609-115328-marostegui.json [11:53:31] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [11:53:43] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2224.codfw.wmnet with reason: Maintenance [11:53:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2224 (T396130)', diff saved to https://phabricator.wikimedia.org/P77258 and previous config saved to /var/cache/conftool/dbconfig/20250609-115350-marostegui.json [11:55:27] (03CR) 10Sergio Gimeno: [beta] GrowthExperiments: enable limiting add a link task via config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154282 (https://phabricator.wikimedia.org/T393769) (owner: 10Sergio Gimeno) [11:58:59] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [11:59:43] (03PS3) 10Hnowlan: (api|rest)-gateway: define containerPort for telemetry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154789 [12:00:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T396130)', diff saved to https://phabricator.wikimedia.org/P77260 and previous config saved to /var/cache/conftool/dbconfig/20250609-120013-marostegui.json [12:00:17] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [12:00:21] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [12:01:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P77261 and previous config saved to /var/cache/conftool/dbconfig/20250609-120129-fceratto.json [12:08:33] PROBLEM - Hadoop NodeManager on an-worker1195 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:10:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:10:58] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [12:13:40] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [12:13:55] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [12:14:47] (03PS2) 10Andrew Bogott: wmcs instance backups: adjust scheduling of purge_vm_backup [puppet] - 10https://gerrit.wikimedia.org/r/1154525 (https://phabricator.wikimedia.org/T394618) [12:14:48] (03PS1) 10Andrew Bogott: Put cloudcontrol2010-dev into service [puppet] - 10https://gerrit.wikimedia.org/r/1154793 (https://phabricator.wikimedia.org/T396064) [12:15:15] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:15:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P77262 and previous config saved to /var/cache/conftool/dbconfig/20250609-121520-marostegui.json [12:15:24] mmhh thanos doesn't look super happy atm, I'm taking a look [12:16:14] (03CR) 10Andrew Bogott: [C:03+2] Put cloudcontrol2010-dev into service [puppet] - 10https://gerrit.wikimedia.org/r/1154793 (https://phabricator.wikimedia.org/T396064) (owner: 10Andrew Bogott) [12:16:31] Yes I'm also seeing errors in Grafana and no data [12:16:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T395241)', diff saved to https://phabricator.wikimedia.org/P77263 and previous config saved to /var/cache/conftool/dbconfig/20250609-121636-fceratto.json [12:16:43] !log bounce thanos-store on titan1* [12:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:54] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [12:17:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1199 (T395241)', diff saved to https://phabricator.wikimedia.org/P77264 and previous config saved to /var/cache/conftool/dbconfig/20250609-121700-fceratto.json [12:17:29] jelto: yep we're back [12:17:39] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new private IP for cloudcontrol2010-dev - andrew@cumin1002" [12:17:45] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new private IP for cloudcontrol2010-dev - andrew@cumin1002" [12:17:45] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:17:56] yes thanks, grafana looks good again ! [12:17:59] (03CR) 10Andrew Bogott: [C:03+2] wmcs instance backups: adjust scheduling of purge_vm_backup [puppet] - 10https://gerrit.wikimedia.org/r/1154525 (https://phabricator.wikimedia.org/T394618) (owner: 10Andrew Bogott) [12:18:08] np [12:19:55] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [12:20:48] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [12:21:38] (03PS1) 10Filippo Giunchedi: thanos: set gomemlimit for sidecar [puppet] - 10https://gerrit.wikimedia.org/r/1154798 (https://phabricator.wikimedia.org/T394318) [12:21:39] (03PS1) 10Filippo Giunchedi: thanos: set memorymax for thanos-sidecar instances [puppet] - 10https://gerrit.wikimedia.org/r/1154799 (https://phabricator.wikimedia.org/T394318) [12:21:54] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154800 [12:23:25] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [12:23:40] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new private IP for cloudcontrol2010-dev - andrew@cumin1002" [12:23:46] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new private IP for cloudcontrol2010-dev - andrew@cumin1002" [12:23:46] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:24:00] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [12:24:03] (03PS2) 10FNegri: clouddb1015: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1148297 (https://phabricator.wikimedia.org/T394372) [12:24:04] (03PS1) 10FNegri: clouddb1019: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154803 (https://phabricator.wikimedia.org/T394372) [12:24:05] (03PS1) 10FNegri: clouddb1013: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154804 (https://phabricator.wikimedia.org/T394372) [12:24:06] (03PS1) 10FNegri: clouddb1017: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154805 (https://phabricator.wikimedia.org/T394372) [12:24:09] (03PS1) 10FNegri: clouddb1014: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154806 (https://phabricator.wikimedia.org/T394372) [12:24:10] (03PS1) 10FNegri: clouddb1018: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154807 (https://phabricator.wikimedia.org/T394372) [12:24:12] (03PS1) 10FNegri: clouddb1016: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154808 (https://phabricator.wikimedia.org/T394372) [12:24:16] (03PS1) 10FNegri: clouddb1020: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154809 (https://phabricator.wikimedia.org/T394372) [12:24:33] RECOVERY - Hadoop NodeManager on an-worker1195 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:24:36] (03CR) 10Filippo Giunchedi: [C:03+2] alertmanager: route network devices alerts to fr [puppet] - 10https://gerrit.wikimedia.org/r/1145169 (https://phabricator.wikimedia.org/T388641) (owner: 10Filippo Giunchedi) [12:26:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T395241)', diff saved to https://phabricator.wikimedia.org/P77265 and previous config saved to /var/cache/conftool/dbconfig/20250609-122644-fceratto.json [12:27:37] 06SRE, 06cloud-services-team, 10Pywikibot, 10Toolforge: pywikibot.org landing page is not updated - https://phabricator.wikimedia.org/T396338#10895144 (10Xqt) [12:29:16] 06SRE, 06cloud-services-team, 10Pywikibot, 10Toolforge: pywikibot.org landing page is not updated - https://phabricator.wikimedia.org/T396338#10895148 (10Xqt) [12:29:41] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [12:30:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P77266 and previous config saved to /var/cache/conftool/dbconfig/20250609-123027-marostegui.json [12:41:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P77267 and previous config saved to /var/cache/conftool/dbconfig/20250609-124152-fceratto.json [12:45:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T396130)', diff saved to https://phabricator.wikimedia.org/P77268 and previous config saved to /var/cache/conftool/dbconfig/20250609-124534-marostegui.json [12:45:38] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [12:50:53] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2229.codfw.wmnet with reason: Maintenance [12:55:32] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [12:57:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P77269 and previous config saved to /var/cache/conftool/dbconfig/20250609-125659-fceratto.json [12:58:05] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns1004*} and (A:dnsbox) [12:58:05] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns1004.wikimedia.org [12:58:14] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns2005*} and (A:dnsbox) [12:58:14] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns2005.wikimedia.org [12:58:29] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [12:59:12] XioNoX, topranks ^^ [12:59:14] (03CR) 10Marostegui: [C:03+1] clouddb1015: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1148297 (https://phabricator.wikimedia.org/T394372) (owner: 10FNegri) [12:59:33] (03CR) 10Marostegui: [C:03+1] clouddb1019: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154803 (https://phabricator.wikimedia.org/T394372) (owner: 10FNegri) [12:59:43] (03CR) 10Marostegui: [C:03+1] clouddb1013: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154804 (https://phabricator.wikimedia.org/T394372) (owner: 10FNegri) [12:59:52] vgutierrez: thanks for the heads up, will take a look [12:59:54] (03CR) 10Marostegui: [C:03+1] clouddb1017: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154805 (https://phabricator.wikimedia.org/T394372) (owner: 10FNegri) [13:00:03] (03CR) 10Marostegui: [C:03+1] clouddb1014: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154806 (https://phabricator.wikimedia.org/T394372) (owner: 10FNegri) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250609T1300). [13:00:05] MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] (03CR) 10Marostegui: [C:03+1] clouddb1018: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154807 (https://phabricator.wikimedia.org/T394372) (owner: 10FNegri) [13:00:22] (03CR) 10Marostegui: [C:03+1] clouddb1016: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154808 (https://phabricator.wikimedia.org/T394372) (owner: 10FNegri) [13:00:31] (03CR) 10Marostegui: [C:03+1] clouddb1020: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154809 (https://phabricator.wikimedia.org/T394372) (owner: 10FNegri) [13:00:51] * TheresNoTime can't deploy this afternoon [13:02:03] I can deploy, but Bartosz doesn't seem to be around [13:02:17] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [13:02:24] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:02:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T396130)', diff saved to https://phabricator.wikimedia.org/P77270 and previous config saved to /var/cache/conftool/dbconfig/20250609-130230-marostegui.json [13:02:33] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [13:02:35] hi [13:02:42] sorry i'm late :) [13:02:44] jouncebot: now [13:02:44] For the next 0 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250609T1300) [13:02:45] * taavi hopes that ping didn't go to the wrong person [13:03:03] heyo. looking at your patches [13:03:10] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.74 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:03:34] FIRING: [2x] ProbeDown: Service wdqs2014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:03:42] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:05:15] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:05:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153363 (https://phabricator.wikimedia.org/T395967) (owner: 10Gergő Tisza) [13:05:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153364 (https://phabricator.wikimedia.org/T394402) (owner: 10Gergő Tisza) [13:05:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T396130)', diff saved to https://phabricator.wikimedia.org/P77271 and previous config saved to /var/cache/conftool/dbconfig/20250609-130521-marostegui.json [13:05:56] there's nothing specific to test on mwdebug, the effect should be obvious in logstash in a few minutes after deployment though [13:06:35] (03Merged) 10jenkins-bot: logging: Allow sampling of Logstash logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153363 (https://phabricator.wikimedia.org/T395967) (owner: 10Gergő Tisza) [13:06:37] (03Merged) 10jenkins-bot: logging: Sample some high-volume log streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153364 (https://phabricator.wikimedia.org/T394402) (owner: 10Gergő Tisza) [13:06:43] the 'session-sampled' chanel doesn't exist yet, it's being added in a change pending review [13:06:59] !log taavi@deploy1003 Started scap sync-world: Backport for [[gerrit:1153363|logging: Allow sampling of Logstash logs (T395967)]], [[gerrit:1153364|logging: Sample some high-volume log streams (T394402)]] [13:07:03] T395967: Allow sampling of Logstash events - https://phabricator.wikimedia.org/T395967 [13:07:04] T394402: Reduce noisy auth logs - https://phabricator.wikimedia.org/T394402 [13:07:06] but i wantede this config to be live before merging it [13:08:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.74 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:08:26] ^ expected [13:08:29] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:10:16] !log taavi@cumin1002 START - Cookbook sre.dns.netbox [13:11:16] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns2005.wikimedia.org [13:11:16] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns2005*} and (A:dnsbox) [13:11:27] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns1004.wikimedia.org [13:11:27] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns1004*} and (A:dnsbox) [13:11:36] (03PS1) 10Vgutierrez: Revert^2 "Add pywikipedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1154815 (https://phabricator.wikimedia.org/T388809) [13:11:53] (03PS2) 10Vgutierrez: Revert^2 "Add pywikipedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1154815 (https://phabricator.wikimedia.org/T388809) [13:12:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T395241)', diff saved to https://phabricator.wikimedia.org/P77272 and previous config saved to /var/cache/conftool/dbconfig/20250609-131206-fceratto.json [13:12:25] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance [13:12:32] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:12:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1221 (T395241)', diff saved to https://phabricator.wikimedia.org/P77273 and previous config saved to /var/cache/conftool/dbconfig/20250609-131238-fceratto.json [13:12:46] !log sukhe@dns1004 START - running authdns-update [13:13:23] !log sukhe@dns1004 FAIL - running authdns-update [13:13:44] E003|MISSING_OR_WRONG_PTR_FOR_NAME_AND_IP: Missing PTR '1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.4.0.0.1.a.0.8.c.e.2.0.a.2.ip6.arpa.' for name 'openstack.codfw1dev.wikimediacloud.org.' and IP '2a02:ec80:a100:4000::1', PTRs are: [13:14:11] taavi: ^ [13:14:21] https://netbox.wikimedia.org/extras/changelog/227522/ [13:14:22] sukhe: yeah I'm working on it already [13:14:24] thanks [13:14:33] https://gerrit.wikimedia.org/r/c/operations/dns/+/1139033 is what needs to go in as well [13:14:39] looking [13:14:41] (03PS2) 10Cathal Mooney: Add include statement for WMCS service VIP reverse IPv6 [dns] - 10https://gerrit.wikimedia.org/r/1139033 (https://phabricator.wikimedia.org/T379282) [13:15:23] i figured that the nexbox change needed to go in first to create the file, didn't know that deploying the changes will fail until that's in as well [13:15:33] (03CR) 10Ssingh: Add include statement for WMCS service VIP reverse IPv6 [dns] - 10https://gerrit.wikimedia.org/r/1139033 (https://phabricator.wikimedia.org/T379282) (owner: 10Cathal Mooney) [13:16:10] (03CR) 10Majavah: [C:03+2] Add include statement for WMCS service VIP reverse IPv6 [dns] - 10https://gerrit.wikimedia.org/r/1139033 (https://phabricator.wikimedia.org/T379282) (owner: 10Cathal Mooney) [13:16:25] taavi@cumin1002 netbox (PID 3295718) is awaiting input [13:16:28] !log taavi@dns1004 START - running authdns-update [13:17:14] !log taavi@dns1004 END - running authdns-update [13:17:39] !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add AAAA record for openstack.codfw1dev.wikimediacloud.org - taavi@cumin1002" [13:17:42] sukhe: fixed, I think [13:17:45] !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add AAAA record for openstack.codfw1dev.wikimediacloud.org - taavi@cumin1002" [13:17:45] !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:18:33] FIRING: [4x] ProbeDown: Service wdqs2014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:18:52] taavi: thanks, the above successful run indicates that it was fixed. [13:19:42] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns3004*} and (A:dnsbox) [13:19:43] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns3004.wikimedia.org [13:19:44] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns4004*} and (A:dnsbox) [13:19:44] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns4004.wikimedia.org [13:19:56] (03PS1) 10Vgutierrez: liberica: Install xdp-tools on liberica LBs [puppet] - 10https://gerrit.wikimedia.org/r/1154816 (https://phabricator.wikimedia.org/T395228) [13:20:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P77274 and previous config saved to /var/cache/conftool/dbconfig/20250609-132028-marostegui.json [13:20:38] (03CR) 10Ssingh: [C:03+1] liberica: Install xdp-tools on liberica LBs [puppet] - 10https://gerrit.wikimedia.org/r/1154816 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [13:21:12] !log taavi@deploy1003 taavi, tgr: Backport for [[gerrit:1153363|logging: Allow sampling of Logstash logs (T395967)]], [[gerrit:1153364|logging: Sample some high-volume log streams (T394402)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:21:16] T395967: Allow sampling of Logstash events - https://phabricator.wikimedia.org/T395967 [13:21:17] T394402: Reduce noisy auth logs - https://phabricator.wikimedia.org/T394402 [13:21:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T395241)', diff saved to https://phabricator.wikimedia.org/P77275 and previous config saved to /var/cache/conftool/dbconfig/20250609-132136-fceratto.json [13:22:01] (03PS1) 10Marostegui: repl_prepare_schema.sh: Add new trigger [software] - 10https://gerrit.wikimedia.org/r/1154817 (https://phabricator.wikimedia.org/T396130) [13:22:04] !log taavi@deploy1003 taavi, tgr: Continuing with sync [13:23:31] PROBLEM - BFD status on asw1-bw27-esams.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:23:38] ^ expected, DNS host reboot [13:25:09] (03PS1) 10Tiziano Fogli: Revert "prometheus::pop: enable rsyncd on magru" [puppet] - 10https://gerrit.wikimedia.org/r/1154818 [13:25:10] FIRING: [2x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:26:42] (03CR) 10Ssingh: [C:03+1] Revert^2 "Add pywikipedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1154815 (https://phabricator.wikimedia.org/T388809) (owner: 10Vgutierrez) [13:27:19] (03Abandoned) 10Marostegui: repl_prepare_schema.sh: Add new trigger [software] - 10https://gerrit.wikimedia.org/r/1154817 (https://phabricator.wikimedia.org/T396130) (owner: 10Marostegui) [13:28:04] (03PS3) 10Vgutierrez: Revert^2 "Add pywikipedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1154815 (https://phabricator.wikimedia.org/T388809) [13:28:31] RECOVERY - BFD status on asw1-bw27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:28:37] (03PS1) 10Marostegui: filtered_tables.txt: Filtered column [puppet] - 10https://gerrit.wikimedia.org/r/1154819 (https://phabricator.wikimedia.org/T396130) [13:29:22] (03CR) 10Vgutierrez: [C:03+2] Revert^2 "Add pywikipedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1154815 (https://phabricator.wikimedia.org/T388809) (owner: 10Vgutierrez) [13:30:10] RESOLVED: [2x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:30:11] !log vgutierrez@dns1004 START - running authdns-update [13:30:44] is it still syncing? [13:30:58] !log vgutierrez@dns1004 END - running authdns-update [13:31:29] !log taavi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1153363|logging: Allow sampling of Logstash logs (T395967)]], [[gerrit:1153364|logging: Sample some high-volume log streams (T394402)]] (duration: 24m 30s) [13:31:34] T395967: Allow sampling of Logstash events - https://phabricator.wikimedia.org/T395967 [13:31:34] T394402: Reduce noisy auth logs - https://phabricator.wikimedia.org/T394402 [13:31:41] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154816 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [13:31:50] aha. thanks for deploying! [13:31:57] (03CR) 10JHathaway: [C:03+1] ssh: Stop managing /run/sshd with Trixie and later [puppet] - 10https://gerrit.wikimedia.org/r/1154261 (owner: 10Muehlenhoff) [13:32:37] MatmaRex: not anymore :) [13:33:55] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns4004.wikimedia.org [13:33:55] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns4004*} and (A:dnsbox) [13:34:17] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns3004.wikimedia.org [13:34:17] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns3004*} and (A:dnsbox) [13:34:33] !log sukhe@dns1004 START - running authdns-update [13:35:14] !log sukhe@dns1004 END - running authdns-update [13:35:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P77276 and previous config saved to /var/cache/conftool/dbconfig/20250609-133535-marostegui.json [13:35:42] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns5004*} and (A:dnsbox) [13:35:42] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns5004.wikimedia.org [13:35:43] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns6002*} and (A:dnsbox) [13:35:44] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns6002.wikimedia.org [13:36:21] (03CR) 10Tiziano Fogli: [C:03+2] Revert "prometheus::pop: enable rsyncd on magru" [puppet] - 10https://gerrit.wikimedia.org/r/1154818 (owner: 10Tiziano Fogli) [13:36:30] (03CR) 10Hnowlan: [C:03+2] (api|rest)-gateway: define containerPort for telemetry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154789 (owner: 10Hnowlan) [13:36:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P77277 and previous config saved to /var/cache/conftool/dbconfig/20250609-133643-fceratto.json [13:38:25] (03Merged) 10jenkins-bot: (api|rest)-gateway: define containerPort for telemetry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154789 (owner: 10Hnowlan) [13:38:29] FIRING: JobUnavailable: Reduced availability for job thanos-sidecar in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:39:31] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:39:50] ^ expected. [13:42:50] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [13:42:57] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [13:43:31] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:45:00] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns6002.wikimedia.org [13:45:00] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns6002*} and (A:dnsbox) [13:45:13] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns5004.wikimedia.org [13:45:14] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns5004*} and (A:dnsbox) [13:45:40] FIRING: [4x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:47:03] (03CR) 10Tiziano Fogli: [C:03+1] thanos: set gomemlimit for sidecar [puppet] - 10https://gerrit.wikimedia.org/r/1154798 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [13:47:11] (03CR) 10Tiziano Fogli: [C:03+1] thanos: set memorymax for thanos-sidecar instances [puppet] - 10https://gerrit.wikimedia.org/r/1154799 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [13:50:40] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:50:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T396130)', diff saved to https://phabricator.wikimedia.org/P77278 and previous config saved to /var/cache/conftool/dbconfig/20250609-135043-marostegui.json [13:50:47] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [13:50:58] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [13:51:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T396130)', diff saved to https://phabricator.wikimedia.org/P77279 and previous config saved to /var/cache/conftool/dbconfig/20250609-135105-marostegui.json [13:51:32] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1015.eqiad.wmnet [13:51:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P77280 and previous config saved to /var/cache/conftool/dbconfig/20250609-135150-fceratto.json [13:52:15] !log fnegri@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1015.eqiad.wmnet with reason: Upgrading clouddbs T394372 [13:52:32] T394372: Migrate clouddb* hosts to MariaDB 10.11 - https://phabricator.wikimedia.org/T394372 [13:53:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T396130)', diff saved to https://phabricator.wikimedia.org/P77281 and previous config saved to /var/cache/conftool/dbconfig/20250609-135355-marostegui.json [13:55:17] !log sukhe@dns1004 START - running authdns-update [13:55:40] (03CR) 10FNegri: [C:03+2] clouddb1015: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1148297 (https://phabricator.wikimedia.org/T394372) (owner: 10FNegri) [13:55:59] !log sukhe@dns1004 END - running authdns-update [13:58:17] (03PS2) 10Tiziano Fogli: prometheus/magru: remove 7001 from prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1154777 (https://phabricator.wikimedia.org/T395130) [13:59:54] (03PS1) 10Hnowlan: (api|rest)-gateway: force prometheus metrics on /metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154822 [14:01:24] jouncebot: now and next [14:01:24] No deployments scheduled for the next 1 hour(s) and 28 minute(s) [14:02:50] (03CR) 10Vgutierrez: [C:03+2] liberica: Install xdp-tools on liberica LBs [puppet] - 10https://gerrit.wikimedia.org/r/1154816 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [14:03:57] (03CR) 10Filippo Giunchedi: [C:03+1] prometheus/magru: remove 7001 from prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1154777 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli) [14:04:26] (03PS2) 10Filippo Giunchedi: thanos: set memorymax for thanos-sidecar instances [puppet] - 10https://gerrit.wikimedia.org/r/1154799 (https://phabricator.wikimedia.org/T394318) [14:04:51] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] thanos: set memorymax for thanos-sidecar instances [puppet] - 10https://gerrit.wikimedia.org/r/1154799 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [14:05:34] (03CR) 10Tiziano Fogli: [C:03+2] prometheus/magru: remove 7001 from prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1154777 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli) [14:06:56] (03PS2) 10Filippo Giunchedi: thanos: set gomemlimit for sidecar [puppet] - 10https://gerrit.wikimedia.org/r/1154798 (https://phabricator.wikimedia.org/T394318) [14:06:56] (03PS2) 10Filippo Giunchedi: thanos: add tracing define [puppet] - 10https://gerrit.wikimedia.org/r/1154782 (https://phabricator.wikimedia.org/T394318) [14:06:56] (03PS2) 10Filippo Giunchedi: hieradata: set default otel-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1154784 (https://phabricator.wikimedia.org/T394318) [14:06:56] (03PS2) 10Filippo Giunchedi: thanos-sidecar: enable tracing [puppet] - 10https://gerrit.wikimedia.org/r/1154785 (https://phabricator.wikimedia.org/T394318) [14:06:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T395241)', diff saved to https://phabricator.wikimedia.org/P77282 and previous config saved to /var/cache/conftool/dbconfig/20250609-140656-fceratto.json [14:07:15] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1238.eqiad.wmnet with reason: Maintenance [14:07:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1238 (T395241)', diff saved to https://phabricator.wikimedia.org/P77283 and previous config saved to /var/cache/conftool/dbconfig/20250609-140722-fceratto.json [14:08:13] (03CR) 10Filippo Giunchedi: [C:03+2] thanos: set gomemlimit for sidecar [puppet] - 10https://gerrit.wikimedia.org/r/1154798 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [14:09:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P77284 and previous config saved to /var/cache/conftool/dbconfig/20250609-140903-marostegui.json [14:11:31] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 124701264 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:12:05] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device lsw1-b7-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T395588#10895521 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:12:31] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:12:47] (03PS3) 10Stevemunene: zookeeper: onboard an-conf1004 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135025 (https://phabricator.wikimedia.org/T374922) [14:15:21] (03CR) 10CDanis: [C:03+1] thanos: add tracing define [puppet] - 10https://gerrit.wikimedia.org/r/1154782 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [14:15:35] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:15:40] (03CR) 10CDanis: [C:03+1] hieradata: set default otel-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1154784 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [14:15:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T395241)', diff saved to https://phabricator.wikimedia.org/P77285 and previous config saved to /var/cache/conftool/dbconfig/20250609-141548-fceratto.json [14:15:58] (03CR) 10CDanis: [C:03+1] thanos-sidecar: enable tracing [puppet] - 10https://gerrit.wikimedia.org/r/1154785 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [14:18:14] (03PS4) 10Stevemunene: zookeeper: onboard an-conf1004 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135025 (https://phabricator.wikimedia.org/T374922) [14:18:14] (03PS3) 10Stevemunene: zookeeper: onboard an-conf1005 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135026 (https://phabricator.wikimedia.org/T374922) [14:18:14] (03PS3) 10Stevemunene: zookeeper: onboard an-conf1006 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135027 (https://phabricator.wikimedia.org/T374922) [14:18:14] (03PS3) 10Stevemunene: zookeeper: remove an-conf100[1-3] from the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135028 (https://phabricator.wikimedia.org/T374922) [14:18:21] !log rollout cgroup memory limit + gomemlimit for thanos-sidecar - T394318 [14:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:24] T394318: Revisit thanos queries concurrency and limits - https://phabricator.wikimedia.org/T394318 [14:20:33] (03CR) 10Hnowlan: [C:03+2] (api|rest)-gateway: force prometheus metrics on /metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154822 (owner: 10Hnowlan) [14:21:44] (03CR) 10Herron: [C:03+1] thanos: add tracing define [puppet] - 10https://gerrit.wikimedia.org/r/1154782 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [14:22:09] (03Merged) 10jenkins-bot: (api|rest)-gateway: force prometheus metrics on /metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154822 (owner: 10Hnowlan) [14:22:21] (03CR) 10Herron: [C:03+1] hieradata: set default otel-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1154784 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [14:23:03] (03CR) 10Herron: [C:03+1] thanos-sidecar: enable tracing [puppet] - 10https://gerrit.wikimedia.org/r/1154785 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [14:23:29] RESOLVED: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [14:24:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P77286 and previous config saved to /var/cache/conftool/dbconfig/20250609-142410-marostegui.json [14:24:19] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [14:24:26] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [14:25:09] (03CR) 10Fabfur: hiera: x-provenance header on all DCs [puppet] - 10https://gerrit.wikimedia.org/r/1154157 (https://phabricator.wikimedia.org/T392217) (owner: 10Fabfur) [14:27:07] (03PS1) 10Ebernhardson: search: Return traffic to all DCs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154828 [14:30:37] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1015.eqiad.wmnet [14:30:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P77287 and previous config saved to /var/cache/conftool/dbconfig/20250609-143054-fceratto.json [14:31:14] !log tappof@cumin1002 START - Cookbook sre.hosts.decommission for hosts prometheus7001.magru.wmnet [14:31:55] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for mc-misc2001.mgmt:22 - https://phabricator.wikimedia.org/T395643#10895605 (10Jhancock.wm) 05Open→03Declined [14:33:27] (03PS5) 10Stevemunene: zookeeper: onboard an-conf1004 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135025 (https://phabricator.wikimedia.org/T374922) [14:33:43] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10895609 (10Jhancock.wm) [14:33:56] (03CR) 10Marostegui: [C:03+2] filtered_tables.txt: Filtered column [puppet] - 10https://gerrit.wikimedia.org/r/1154819 (https://phabricator.wikimedia.org/T396130) (owner: 10Marostegui) [14:34:36] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#10895613 (10jhathaway) p:05Triage→03Medium [14:34:40] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 10Mail: Add DMarcian trial-account address to the dmarc-ruf@wikimedia.org mailing list - https://phabricator.wikimedia.org/T396062#10895615 (10jhathaway) p:05Triage→03Medium [14:36:02] !log tappof@cumin1002 START - Cookbook sre.dns.netbox [14:38:29] FIRING: [2x] JobUnavailable: Reduced availability for job envoy in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T396130)', diff saved to https://phabricator.wikimedia.org/P77288 and previous config saved to /var/cache/conftool/dbconfig/20250609-143917-marostegui.json [14:39:21] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [14:39:29] !log tappof@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - tappof@cumin1002" [14:39:32] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [14:39:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T396130)', diff saved to https://phabricator.wikimedia.org/P77289 and previous config saved to /var/cache/conftool/dbconfig/20250609-143938-marostegui.json [14:40:13] !log tappof@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - tappof@cumin1002" [14:40:13] !log tappof@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:40:14] !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts prometheus7001.magru.wmnet [14:41:16] 10ops-codfw, 06DC-Ops: Moving extra 1G port to make 10G space on cloud rack. - https://phabricator.wikimedia.org/T396363 (10Jhancock.wm) 03NEW [14:42:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T396130)', diff saved to https://phabricator.wikimedia.org/P77290 and previous config saved to /var/cache/conftool/dbconfig/20250609-144230-marostegui.json [14:43:54] (03PS1) 10Scott French: sessionstore-resources: add SessionStoreDiskSpaceRunwayTooLow [alerts] - 10https://gerrit.wikimedia.org/r/1141959 (https://phabricator.wikimedia.org/T390630) [14:46:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P77291 and previous config saved to /var/cache/conftool/dbconfig/20250609-144601-fceratto.json [14:50:26] (03PS1) 10Hnowlan: (api|rest)-gateway: change stat_prefix naming to be prometheus friendly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154833 [14:51:15] 10ops-codfw, 06DC-Ops: Q4:rack/setup/install sretest2009 - https://phabricator.wikimedia.org/T396365 (10Jhancock.wm) 03NEW [14:57:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P77292 and previous config saved to /var/cache/conftool/dbconfig/20250609-145735-marostegui.json [15:01:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T395241)', diff saved to https://phabricator.wikimedia.org/P77293 and previous config saved to /var/cache/conftool/dbconfig/20250609-150108-fceratto.json [15:01:18] PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy [15:01:27] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1241.eqiad.wmnet with reason: Maintenance [15:01:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1241 (T395241)', diff saved to https://phabricator.wikimedia.org/P77294 and previous config saved to /var/cache/conftool/dbconfig/20250609-150134-fceratto.json [15:02:04] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10895822 (10elukey) @Anton.Kokh Hi! To keep archives happy - are you going to follow up with @K... [15:02:06] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:03:11] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for cmelo - https://phabricator.wikimedia.org/T395966#10895825 (10elukey) @cmelo Hi! Lemme know if you need help in following up with T395966#10882711, thanks! [15:03:29] FIRING: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:05:33] (03CR) 10Hnowlan: [C:03+2] (api|rest)-gateway: change stat_prefix naming to be prometheus friendly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154833 (owner: 10Hnowlan) [15:07:09] (03Merged) 10jenkins-bot: (api|rest)-gateway: change stat_prefix naming to be prometheus friendly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154833 (owner: 10Hnowlan) [15:08:29] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:15] 06SRE, 06Infrastructure-Foundations, 06Traffic: Avoid using codfw expansion cage for non-IPIP LVS-fronted services - https://phabricator.wikimedia.org/T394286#10895869 (10elukey) [15:11:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T395241)', diff saved to https://phabricator.wikimedia.org/P77295 and previous config saved to /var/cache/conftool/dbconfig/20250609-151144-fceratto.json [15:12:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P77296 and previous config saved to /var/cache/conftool/dbconfig/20250609-151242-marostegui.json [15:13:06] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:13:08] RECOVERY - Squid on install1004 is OK: TCP OK - 0.000 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [15:13:29] FIRING: [4x] ProbeDown: Service wdqs2014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:13:33] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:14:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [15:16:16] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:16:42] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:16:51] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:17:18] PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy [15:17:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:18:09] (03CR) 10Btullis: [C:03+2] Deploy the root config folder to all Airflow deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153704 (https://phabricator.wikimedia.org/T383931) (owner: 10Aleksandar Mastilovic) [15:19:06] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:19:08] RECOVERY - Squid on install1004 is OK: TCP OK - 0.001 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [15:19:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [15:20:33] (03Merged) 10jenkins-bot: Deploy the root config folder to all Airflow deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153704 (https://phabricator.wikimedia.org/T383931) (owner: 10Aleksandar Mastilovic) [15:21:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 10 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083870 (https://phabricator.wikimedia.org/T378287) (owner: 10Dreamrimmer) [15:21:51] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:22:26] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:22:47] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [15:23:27] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [15:23:34] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [15:23:44] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [15:23:45] RESOLVED: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:23:59] (03PS1) 10Majavah: P:wmcs::metricsinfra: Log all alerts [puppet] - 10https://gerrit.wikimedia.org/r/1154841 (https://phabricator.wikimedia.org/T396038) [15:24:15] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [15:24:26] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [15:24:40] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5827/co" [puppet] - 10https://gerrit.wikimedia.org/r/1154841 (https://phabricator.wikimedia.org/T396038) (owner: 10Majavah) [15:25:03] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [15:25:14] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [15:25:32] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [15:25:45] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [15:25:49] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [15:26:36] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [15:26:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P77297 and previous config saved to /var/cache/conftool/dbconfig/20250609-152651-fceratto.json [15:27:14] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [15:27:28] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [15:27:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [15:27:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T396130)', diff saved to https://phabricator.wikimedia.org/P77298 and previous config saved to /var/cache/conftool/dbconfig/20250609-152749-marostegui.json [15:27:52] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [15:28:05] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1187.eqiad.wmnet with reason: Maintenance [15:28:06] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [15:28:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T396130)', diff saved to https://phabricator.wikimedia.org/P77299 and previous config saved to /var/cache/conftool/dbconfig/20250609-152810-marostegui.json [15:28:17] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [15:28:50] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [15:30:04] jan_drewniak: Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250609T1530). Please do the needful. [15:30:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T396130)', diff saved to https://phabricator.wikimedia.org/P77300 and previous config saved to /var/cache/conftool/dbconfig/20250609-153057-marostegui.json [15:32:26] (03PS1) 10Scott French: shellbox-video: upgrade image to 2025-06-05-215815 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154840 (https://phabricator.wikimedia.org/T388260) [15:32:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [15:32:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:33:34] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [15:33:41] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [15:37:39] (03CR) 10Hnowlan: [C:03+1] shellbox-video: upgrade image to 2025-06-05-215815 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154840 (https://phabricator.wikimedia.org/T388260) (owner: 10Scott French) [15:41:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P77301 and previous config saved to /var/cache/conftool/dbconfig/20250609-154158-fceratto.json [15:46:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P77302 and previous config saved to /var/cache/conftool/dbconfig/20250609-154604-marostegui.json [15:46:52] !log dancy@deploy1003 Installing scap version "4.172.0" for 182 host(s) [15:49:04] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2226 - https://phabricator.wikimedia.org/T396323#10896087 (10Jhancock.wm) submitted a dispatch with Dell. SR211116839. should be here tomorrow or Wednesday at the latest. [15:50:41] 10ops-codfw, 06SRE, 06DC-Ops: test servers for new cage - https://phabricator.wikimedia.org/T393105#10896094 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:52:30] !log dancy@deploy1003 Installation of scap version "4.172.0" completed for 182 hosts [15:55:06] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2226 - https://phabricator.wikimedia.org/T396323#10896105 (10Marostegui) Thank you! [15:55:54] !log dancy@deploy1003 Started scap sync-world: Testing T395514 [15:55:57] T395514: Scap train: Rolling back from group2 to group1 takes too long - https://phabricator.wikimedia.org/T395514 [15:56:12] (03PS1) 10Filippo Giunchedi: titan: deploy local memcached [puppet] - 10https://gerrit.wikimedia.org/r/1154844 (https://phabricator.wikimedia.org/T394319) [15:56:14] (03PS1) 10Filippo Giunchedi: query-frontend: enable memcached on localhost [puppet] - 10https://gerrit.wikimedia.org/r/1154845 (https://phabricator.wikimedia.org/T394319) [15:57:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T395241)', diff saved to https://phabricator.wikimedia.org/P77303 and previous config saved to /var/cache/conftool/dbconfig/20250609-155705-fceratto.json [15:57:23] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1242.eqiad.wmnet with reason: Maintenance [15:57:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1242 (T395241)', diff saved to https://phabricator.wikimedia.org/P77304 and previous config saved to /var/cache/conftool/dbconfig/20250609-155730-fceratto.json [15:58:52] (03PS2) 10Filippo Giunchedi: query-frontend: enable memcached on titan[21]001 [puppet] - 10https://gerrit.wikimedia.org/r/1154845 (https://phabricator.wikimedia.org/T394319) [16:00:05] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/1141959 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French) [16:01:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P77305 and previous config saved to /var/cache/conftool/dbconfig/20250609-160111-marostegui.json [16:05:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T395241)', diff saved to https://phabricator.wikimedia.org/P77306 and previous config saved to /var/cache/conftool/dbconfig/20250609-160539-fceratto.json [16:12:37] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply [16:12:47] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [16:13:20] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install build2003.codfw.wmnet - https://phabricator.wikimedia.org/T393015#10896262 (10Jhancock.wm) @akosiaris can you update the site.pp and preseed files for this server? We've received it and I intend on getting it racked today. [16:13:41] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install build2003.codfw.wmnet - https://phabricator.wikimedia.org/T393015#10896263 (10Jhancock.wm) [16:16:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T396130)', diff saved to https://phabricator.wikimedia.org/P77307 and previous config saved to /var/cache/conftool/dbconfig/20250609-161618-marostegui.json [16:16:22] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [16:16:33] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1201.eqiad.wmnet with reason: Maintenance [16:16:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1201 (T396130)', diff saved to https://phabricator.wikimedia.org/P77308 and previous config saved to /var/cache/conftool/dbconfig/20250609-161640-marostegui.json [16:19:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T396130)', diff saved to https://phabricator.wikimedia.org/P77309 and previous config saved to /var/cache/conftool/dbconfig/20250609-161926-marostegui.json [16:19:37] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2009 - https://phabricator.wikimedia.org/T396365#10896279 (10Jhancock.wm) @RobH could you help me with updating the site.pp for this server? [16:19:50] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2009 - https://phabricator.wikimedia.org/T396365#10896281 (10Jhancock.wm) [16:20:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P77310 and previous config saved to /var/cache/conftool/dbconfig/20250609-162046-fceratto.json [16:28:20] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10896322 (10Jhancock.wm) [16:30:09] !log dancy@deploy1003 Finished scap sync-world: Testing T395514 (duration: 34m 14s) [16:30:12] T395514: Scap train: Rolling back from group2 to group1 takes too long - https://phabricator.wikimedia.org/T395514 [16:32:45] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:34:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P77311 and previous config saved to /var/cache/conftool/dbconfig/20250609-163433-marostegui.json [16:35:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P77312 and previous config saved to /var/cache/conftool/dbconfig/20250609-163553-fceratto.json [16:37:17] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cp2058 to codfw - jhancock@cumin2002" [16:37:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cp2058 to codfw - jhancock@cumin2002" [16:37:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:38:51] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp2058 [16:38:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp2058 [16:40:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cp2058.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:40:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154128 (https://phabricator.wikimedia.org/T290759) (owner: 10Arlolra) [16:41:17] (03PS1) 10HMonroy: Enable Codex and Multiblocks by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154853 (https://phabricator.wikimedia.org/T377121) [16:42:51] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2058.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:46:21] (03PS1) 10Hnowlan: rest-gateway: enable per-route statistics for all routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154854 [16:46:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cp2058.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:49:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P77313 and previous config saved to /var/cache/conftool/dbconfig/20250609-164940-marostegui.json [16:49:49] jhancock@cumin2002 provision (PID 1520115) is awaiting input [16:51:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T395241)', diff saved to https://phabricator.wikimedia.org/P77314 and previous config saved to /var/cache/conftool/dbconfig/20250609-165100-fceratto.json [16:51:19] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1243.eqiad.wmnet with reason: Maintenance [16:51:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1243 (T395241)', diff saved to https://phabricator.wikimedia.org/P77315 and previous config saved to /var/cache/conftool/dbconfig/20250609-165125-fceratto.json [16:55:06] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2058.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:56:32] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:58:29] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [16:58:37] (03CR) 10Scott French: [C:03+1] rest-gateway: enable per-route statistics for all routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154854 (owner: 10Hnowlan) [16:59:36] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cp2043 to codfw - jhancock@cumin2002" [16:59:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T395241)', diff saved to https://phabricator.wikimedia.org/P77316 and previous config saved to /var/cache/conftool/dbconfig/20250609-165936-fceratto.json [16:59:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cp2043 to codfw - jhancock@cumin2002" [16:59:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:59:54] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp2043 [17:00:01] (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154840 (https://phabricator.wikimedia.org/T388260) (owner: 10Scott French) [17:00:04] (03CR) 10Scott French: [C:03+2] shellbox-video: upgrade image to 2025-06-05-215815 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154840 (https://phabricator.wikimedia.org/T388260) (owner: 10Scott French) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250609T1700) [17:00:05] ryankemper: May I have your attention please! Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250609T1700) [17:00:24] (03PS1) 10BCornwall: Revert^2 "acmechief: Add pywikipedia.org to the cert list" [puppet] - 10https://gerrit.wikimedia.org/r/1154855 [17:00:25] o/ [17:01:15] I'll be merging some changes that are not directly mediawiki-affecting, but are still preferable to have happen independently of any mediawiki deployments [17:01:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp2043 [17:01:53] (03Merged) 10jenkins-bot: shellbox-video: upgrade image to 2025-06-05-215815 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154840 (https://phabricator.wikimedia.org/T388260) (owner: 10Scott French) [17:02:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:02:45] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:04:15] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply [17:04:43] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [17:04:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T396130)', diff saved to https://phabricator.wikimedia.org/P77317 and previous config saved to /var/cache/conftool/dbconfig/20250609-170447-marostegui.json [17:04:51] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [17:05:03] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance [17:09:30] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [17:09:33] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1231.eqiad.wmnet with reason: Maintenance [17:09:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T396130)', diff saved to https://phabricator.wikimedia.org/P77318 and previous config saved to /var/cache/conftool/dbconfig/20250609-170939-marostegui.json [17:10:21] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [17:12:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:12:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T396130)', diff saved to https://phabricator.wikimedia.org/P77319 and previous config saved to /var/cache/conftool/dbconfig/20250609-171225-marostegui.json [17:12:31] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [17:13:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:13:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:14:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P77320 and previous config saved to /var/cache/conftool/dbconfig/20250609-171443-fceratto.json [17:14:44] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:18:04] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:21:08] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [17:21:13] !log bking@cumin1003 power down cirrussearch1063 to prevent logspam T394350 [17:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:17] T394350: decommission cirrussearch1053.eqiad.wmnet + more (see description) - https://phabricator.wikimedia.org/T394350 [17:21:34] (03CR) 10Btullis: [C:03+1] zookeeper: onboard an-conf1004 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135025 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [17:21:51] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [17:22:01] Is right now a good time to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1154853 ? Wondering since last time I deployed and there some conflicts :) [17:23:46] jouncebot: nowandnext [17:23:46] For the next 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250609T1700) [17:23:46] For the next 0 hour(s) and 6 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250609T1700) [17:23:47] In 2 hour(s) and 36 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250609T2000) [17:25:17] hmonroy: ^ this is how to check what is currently being deployed. it is the mediawiki infrastructure window. And I would say check what is in that window and who is deploying. But also that window is empty and maybe your change belongs into it and should be added? [17:25:50] hmonroy: I should now be done with my changes [17:25:52] other than that I dont think there is an incident or anything right now.. so should be fine. [17:26:58] hmonroy: thank you for checking first, as we don't always explicitly schedule changes for the infra window. [17:27:01] (03CR) 10BCornwall: [C:03+1] Add puppetserver2004 [dns] - 10https://gerrit.wikimedia.org/r/1154296 (https://phabricator.wikimedia.org/T381274) (owner: 10Muehlenhoff) [17:27:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P77322 and previous config saved to /var/cache/conftool/dbconfig/20250609-172733-marostegui.json [17:27:42] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10896542 (10Jhancock.wm) [17:28:44] (03PS1) 10Jsn.sherman: [WIP] Deploy remaining Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154860 (https://phabricator.wikimedia.org/T396250) [17:29:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P77323 and previous config saved to /var/cache/conftool/dbconfig/20250609-172950-fceratto.json [17:30:40] swfrench-wmf: I was reading the calendar https://wikitech.wikimedia.org/wiki/Deployments wrong. It doesn't specify AM/PM for my timezone so I was assuming evening when it was morning. [17:32:21] ah, got it. yeah, I tend to just look at the UTC timestamps out of habit, or query jouncebot like m.utante did above ^ :) [17:36:33] hmonroy: Is there something we can do better to make it clear that the times are in 24-hour format? [17:37:22] (03PS1) 10Dzahn: zuul::main: include ::passwords::mysql::zuul [puppet] - 10https://gerrit.wikimedia.org/r/1154861 (https://phabricator.wikimedia.org/T394844) [17:37:36] (03CR) 10CI reject: [V:04-1] zuul::main: include ::passwords::mysql::zuul [puppet] - 10https://gerrit.wikimedia.org/r/1154861 (https://phabricator.wikimedia.org/T394844) (owner: 10Dzahn) [17:38:02] (03PS2) 10Dzahn: zuul::main: include ::passwords::mysql::zuul [puppet] - 10https://gerrit.wikimedia.org/r/1154861 (https://phabricator.wikimedia.org/T394844) [17:38:18] dancy: now I know they are 24hr format, hmmmmm... I don't think so. I should have been scrolling and looking at the different slots [17:38:26] (03CR) 10CI reject: [V:04-1] zuul::main: include ::passwords::mysql::zuul [puppet] - 10https://gerrit.wikimedia.org/r/1154861 (https://phabricator.wikimedia.org/T394844) (owner: 10Dzahn) [17:38:47] (03CR) 10MusikAnimal: [C:03+1] Enable Codex and Multiblocks by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154853 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [17:38:49] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:38:53] Ok. Good to know. Just one of those things. [17:39:07] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:39:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:39:46] (03PS3) 10Dzahn: zuul::main: include ::passwords::mysql::zuul [puppet] - 10https://gerrit.wikimedia.org/r/1154861 (https://phabricator.wikimedia.org/T394844) [17:39:47] yeah, thanks! [17:39:51] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:39:52] (03PS4) 10Dzahn: zuul::main: include ::passwords::mysql::zuul [puppet] - 10https://gerrit.wikimedia.org/r/1154861 (https://phabricator.wikimedia.org/T394844) [17:41:28] hmonroy: dancy: I just noticed in the deployment page it says you can also subscribe to a Google calendar version of that. That would show it in relation to local user time. [17:41:53] wasnt really aware of that. always just using jouncebot though [17:42:01] (03PS1) 10Cathal Mooney: Move 'evpn' inside 'protocols' section [homer/public] - 10https://gerrit.wikimedia.org/r/1154862 (https://phabricator.wikimedia.org/T394530) [17:42:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P77324 and previous config saved to /var/cache/conftool/dbconfig/20250609-174240-marostegui.json [17:44:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:44:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T395241)', diff saved to https://phabricator.wikimedia.org/P77325 and previous config saved to /var/cache/conftool/dbconfig/20250609-174457-fceratto.json [17:45:17] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1247.eqiad.wmnet with reason: Maintenance [17:45:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1247 (T395241)', diff saved to https://phabricator.wikimedia.org/P77326 and previous config saved to /var/cache/conftool/dbconfig/20250609-174523-fceratto.json [17:46:51] jouncebot: nowandnext [17:46:51] For the next 0 hour(s) and 13 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250609T1700) [17:46:51] In 2 hour(s) and 13 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250609T2000) [17:46:56] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:47:06] ah! first time using jouncebot :) [17:48:26] (03CR) 10Cathal Mooney: [C:03+2] Move 'evpn' inside 'protocols' section [homer/public] - 10https://gerrit.wikimedia.org/r/1154862 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [17:49:09] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [17:49:32] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:49:47] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet with reason: Maintenance [17:50:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:51:49] (03Merged) 10jenkins-bot: Move 'evpn' inside 'protocols' section [homer/public] - 10https://gerrit.wikimedia.org/r/1154862 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [17:52:31] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cp2044 to codfw - jhancock@cumin2002" [17:52:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cp2044 to codfw - jhancock@cumin2002" [17:52:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:53:04] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [17:53:08] (03CR) 10Dzahn: [C:03+2] zuul::main: include ::passwords::mysql::zuul [puppet] - 10https://gerrit.wikimedia.org/r/1154861 (https://phabricator.wikimedia.org/T394844) (owner: 10Dzahn) [17:53:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T395241)', diff saved to https://phabricator.wikimedia.org/P77327 and previous config saved to /var/cache/conftool/dbconfig/20250609-175330-fceratto.json [17:53:58] jhancock@cumin2002 provision (PID 1566790) is awaiting input [17:55:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:57:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T396130)', diff saved to https://phabricator.wikimedia.org/P77328 and previous config saved to /var/cache/conftool/dbconfig/20250609-175747-marostegui.json [17:57:52] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [17:58:02] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [17:59:00] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp2044 [17:59:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp2044 [17:59:27] (03PS8) 10Bking: elasticsearch: filter LVS config based on cluster membership [puppet] - 10https://gerrit.wikimedia.org/r/1138400 (https://phabricator.wikimedia.org/T387569) [17:59:27] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1138400 (https://phabricator.wikimedia.org/T387569) (owner: 10Bking) [17:59:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:03:14] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:03:40] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance [18:04:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hmonroy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154853 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [18:05:23] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2157.codfw.wmnet with reason: Maintenance [18:05:24] (03Merged) 10jenkins-bot: Enable Codex and Multiblocks by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154853 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [18:05:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T396130)', diff saved to https://phabricator.wikimedia.org/P77329 and previous config saved to /var/cache/conftool/dbconfig/20250609-180530-marostegui.json [18:05:34] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [18:05:38] !log hmonroy@deploy1003 Started scap sync-world: Backport for [[gerrit:1154853|Enable Codex and Multiblocks by default (T377121)]] [18:05:41] T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121 [18:08:16] (03PS9) 10Bking: elasticsearch: filter LVS config based on cluster membership [puppet] - 10https://gerrit.wikimedia.org/r/1138400 (https://phabricator.wikimedia.org/T387569) [18:08:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P77330 and previous config saved to /var/cache/conftool/dbconfig/20250609-180836-fceratto.json [18:09:15] (03PS1) 10Andrew Bogott: Remove refs to cloudcontrol2004-dev.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1154865 (https://phabricator.wikimedia.org/T396396) [18:09:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:09:24] !log hmonroy@deploy1003 hmonroy: Backport for [[gerrit:1154853|Enable Codex and Multiblocks by default (T377121)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:09:38] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:09:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T396130)', diff saved to https://phabricator.wikimedia.org/P77331 and previous config saved to /var/cache/conftool/dbconfig/20250609-180941-marostegui.json [18:10:26] (03PS1) 10AOkoth: miscweb: add os-reports update mechanism [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794) [18:12:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:13:14] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:13:35] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1138400 (https://phabricator.wikimedia.org/T387569) (owner: 10Bking) [18:13:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:14:00] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:14:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:14:21] (03PS4) 10Cathal Mooney: Switch BGP: Automate & unify IBGP configs on switches [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530) [18:15:39] !log hmonroy@deploy1003 hmonroy: Continuing with sync [18:15:43] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:15:59] (03CR) 10Eevans: [C:03+1] "Looks good to me; Thanks again for working this up!" [alerts] - 10https://gerrit.wikimedia.org/r/1141959 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French) [18:16:06] (03CR) 10Cwhite: [C:03+2] logstash: add early-stage filter to populate event.original [puppet] - 10https://gerrit.wikimedia.org/r/1152850 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [18:17:39] (03CR) 10Bking: "For reviewers: Check cluster membership as expressed in this change's YAML files vs. the source of truth of cluster membership, which is:" [puppet] - 10https://gerrit.wikimedia.org/r/1138400 (https://phabricator.wikimedia.org/T387569) (owner: 10Bking) [18:22:35] !log hmonroy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1154853|Enable Codex and Multiblocks by default (T377121)]] (duration: 16m 57s) [18:22:39] T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121 [18:23:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P77332 and previous config saved to /var/cache/conftool/dbconfig/20250609-182343-fceratto.json [18:24:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P77333 and previous config saved to /var/cache/conftool/dbconfig/20250609-182448-marostegui.json [18:26:00] PROBLEM - Disk space on an-worker1093 is CRITICAL: DISK CRITICAL - free space: / 1775 MB (3% inode=93%): /tmp 1775 MB (3% inode=93%): /var/tmp 1775 MB (3% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1093&var-datasource=eqiad+prometheus/ops [18:26:03] (03CR) 10Andrew Bogott: [C:03+2] Remove refs to cloudcontrol2004-dev.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1154865 (https://phabricator.wikimedia.org/T396396) (owner: 10Andrew Bogott) [18:30:47] (03PS1) 10Andrew Bogott: Remove cloudcontrol2004-dev.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1154869 (https://phabricator.wikimedia.org/T396396) [18:31:37] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudcontrol2004-dev.codfw.wmnet [18:35:00] (03CR) 10Dzahn: "thank you!:)" [puppet] - 10https://gerrit.wikimedia.org/r/1154543 (https://phabricator.wikimedia.org/T394844) (owner: 10Marostegui) [18:37:35] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [18:38:49] (03PS2) 10Scott French: sessionstore-resources: add SessionStoreDiskSpaceRunwayTooLow [alerts] - 10https://gerrit.wikimedia.org/r/1141959 (https://phabricator.wikimedia.org/T390630) [18:38:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T395241)', diff saved to https://phabricator.wikimedia.org/P77334 and previous config saved to /var/cache/conftool/dbconfig/20250609-183850-fceratto.json [18:39:09] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1248.eqiad.wmnet with reason: Maintenance [18:39:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1248 (T395241)', diff saved to https://phabricator.wikimedia.org/P77335 and previous config saved to /var/cache/conftool/dbconfig/20250609-183915-fceratto.json [18:39:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P77336 and previous config saved to /var/cache/conftool/dbconfig/20250609-183955-marostegui.json [18:41:09] (03CR) 10Scott French: "Thank you both for the reviews!" [alerts] - 10https://gerrit.wikimedia.org/r/1141959 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French) [18:41:59] (03CR) 10Scott French: [C:03+2] sessionstore-resources: add SessionStoreDiskSpaceRunwayTooLow [alerts] - 10https://gerrit.wikimedia.org/r/1141959 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French) [18:42:08] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10896922 (10Jhancock.wm) @RobH documenting issues found so far with these servers that are the fault of Dell when they build these servers. line item: iDRAC Legacy Pa... [18:43:14] (03Merged) 10jenkins-bot: sessionstore-resources: add SessionStoreDiskSpaceRunwayTooLow [alerts] - 10https://gerrit.wikimedia.org/r/1141959 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French) [18:43:19] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10896924 (10Jhancock.wm) {F62275715} [18:43:19] andrew@cumin1002 decommission (PID 3622692) is awaiting input [18:43:21] (03CR) 10Andrew Bogott: [C:03+2] Remove cloudcontrol2004-dev.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1154869 (https://phabricator.wikimedia.org/T396396) (owner: 10Andrew Bogott) [18:47:39] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol2004-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [18:48:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T395241)', diff saved to https://phabricator.wikimedia.org/P77337 and previous config saved to /var/cache/conftool/dbconfig/20250609-184809-fceratto.json [18:50:44] andrew@cumin1002 decommission (PID 3622692) is awaiting input [18:55:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T396130)', diff saved to https://phabricator.wikimedia.org/P77338 and previous config saved to /var/cache/conftool/dbconfig/20250609-185502-marostegui.json [18:55:06] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [18:55:19] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2171.codfw.wmnet with reason: Maintenance [18:55:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T396130)', diff saved to https://phabricator.wikimedia.org/P77339 and previous config saved to /var/cache/conftool/dbconfig/20250609-185525-marostegui.json [18:59:16] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol2004-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [18:59:16] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:59:16] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol2004-dev.codfw.wmnet [18:59:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T396130)', diff saved to https://phabricator.wikimedia.org/P77340 and previous config saved to /var/cache/conftool/dbconfig/20250609-185935-marostegui.json [18:59:46] 06SRE-OnFire, 10Cassandra, 13Patch-For-Review, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is trending high - https://phabricator.wikimedia.org/T390630#10896988 (10Scott_French) The SessionStoreDiskSpaceRunwayTooLow alert is now live, although in warning se... [19:02:45] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [19:03:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 8.354% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:03:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P77341 and previous config saved to /var/cache/conftool/dbconfig/20250609-190316-fceratto.json [19:03:29] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:03:44] FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [19:04:46] RESOLVED: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:04:58] FIRING: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:05:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:07:43] FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:08:16] !incidents [19:08:16] 6320 (UNACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [19:08:16] 6321 (UNACKED) VarnishUnavailable global sre (varnish-text thanos-rule) [19:08:16] 6322 (UNACKED) ProbeDown sre (2001:df2:e500:ed1a::1 ip6 text-https:443 probes/service http_text-https_ip6 eqsin) [19:08:29] FIRING: [3x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:09:10] !oncall-now [19:09:11] Oncall now for team SRE, rotation business_hours: [19:09:11] C.hrisDobbins901_ [19:09:58] RESOLVED: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:11:31] o/ [19:12:22] o/ [19:12:43] RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:13:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 21.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:13:29] FIRING: [2x] ProbeDown: Service wdqs2021:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:13:34] RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:13:44] RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [19:14:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P77342 and previous config saved to /var/cache/conftool/dbconfig/20250609-191442-marostegui.json [19:14:54] !incidents [19:14:55] 6320 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [19:14:55] 6321 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [19:14:55] 6322 (RESOLVED) ProbeDown sre (2001:df2:e500:ed1a::1 ip6 text-https:443 probes/service http_text-https_ip6 eqsin) [19:15:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web releases routed via main (k8s) 2.065s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:17:45] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [19:18:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P77343 and previous config saved to /var/cache/conftool/dbconfig/20250609-191823-fceratto.json [19:18:29] RESOLVED: [2x] ProbeDown: Service wdqs2021:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:20:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:23:29] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [19:25:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:27:52] 06SRE, 10Legalpad, 10Phabricator: Allow aklapper to view/edit L3 - https://phabricator.wikimedia.org/T394966#10897035 (10LSobanski) 05Open→03Resolved Done. [19:29:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P77344 and previous config saved to /var/cache/conftool/dbconfig/20250609-192949-marostegui.json [19:33:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T395241)', diff saved to https://phabricator.wikimedia.org/P77345 and previous config saved to /var/cache/conftool/dbconfig/20250609-193329-fceratto.json [19:33:48] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1249.eqiad.wmnet with reason: Maintenance [19:33:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1249 (T395241)', diff saved to https://phabricator.wikimedia.org/P77346 and previous config saved to /var/cache/conftool/dbconfig/20250609-193354-fceratto.json [19:42:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T395241)', diff saved to https://phabricator.wikimedia.org/P77347 and previous config saved to /var/cache/conftool/dbconfig/20250609-194203-fceratto.json [19:44:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T396130)', diff saved to https://phabricator.wikimedia.org/P77348 and previous config saved to /var/cache/conftool/dbconfig/20250609-194456-marostegui.json [19:45:00] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [19:45:13] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2178.codfw.wmnet with reason: Maintenance [19:45:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T396130)', diff saved to https://phabricator.wikimedia.org/P77349 and previous config saved to /var/cache/conftool/dbconfig/20250609-194520-marostegui.json [19:46:01] PROBLEM - Disk space on an-worker1093 is CRITICAL: DISK CRITICAL - free space: / 1769 MB (3% inode=95%): /tmp 1769 MB (3% inode=95%): /var/tmp 1769 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1093&var-datasource=eqiad+prometheus/ops [19:49:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T396130)', diff saved to https://phabricator.wikimedia.org/P77350 and previous config saved to /var/cache/conftool/dbconfig/20250609-194904-marostegui.json [19:57:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P77351 and previous config saved to /var/cache/conftool/dbconfig/20250609-195709-fceratto.json [19:58:36] (03PS3) 10BryanDavis: shellbox-syntaxhighlight: Bump to 2025-06-05-215815 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154132 (https://phabricator.wikimedia.org/T364249) [19:58:48] (03CR) 10BryanDavis: [C:03+2] shellbox-syntaxhighlight: Bump to 2025-06-05-215815 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154132 (https://phabricator.wikimedia.org/T364249) (owner: 10BryanDavis) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250609T2000). [20:00:05] sd and arlolra: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:33] here [20:00:37] (03Merged) 10jenkins-bot: shellbox-syntaxhighlight: Bump to 2025-06-05-215815 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154132 (https://phabricator.wikimedia.org/T364249) (owner: 10BryanDavis) [20:01:25] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [20:02:26] Guess I'll get started with my backport [20:02:43] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [20:02:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154128 (https://phabricator.wikimedia.org/T290759) (owner: 10Arlolra) [20:03:43] (03Merged) 10jenkins-bot: Disable VipsScaler in group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154128 (https://phabricator.wikimedia.org/T290759) (owner: 10Arlolra) [20:03:57] !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1154128|Disable VipsScaler in group1 (T290759)]] [20:04:00] T290759: Undeploy VipsScaler from Wikimedia wikis - https://phabricator.wikimedia.org/T290759 [20:04:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P77352 and previous config saved to /var/cache/conftool/dbconfig/20250609-200411-marostegui.json [20:04:26] (03PS3) 10Jsn.sherman: Deploy remaining Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154860 (https://phabricator.wikimedia.org/T396250) [20:05:56] !log arlolra@deploy1003 arlolra: Backport for [[gerrit:1154128|Disable VipsScaler in group1 (T290759)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:06:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.073s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:07:25] !log arlolra@deploy1003 arlolra: Continuing with sync [20:09:49] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [20:11:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.094s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:11:33] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [20:12:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P77353 and previous config saved to /var/cache/conftool/dbconfig/20250609-201216-fceratto.json [20:13:13] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [20:14:18] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [20:14:20] !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1154128|Disable VipsScaler in group1 (T290759)]] (duration: 10m 23s) [20:14:23] T290759: Undeploy VipsScaler from Wikimedia wikis - https://phabricator.wikimedia.org/T290759 [20:15:18] Hi there, any chance I could squeak in a config change for a quicksurveys deployment? I was waiting on sampling values and they just came in [20:15:52] JSherman: I'm done with my deploy and I'm not sure sd is around, so it's all yours [20:16:13] arlorla: cool, I can self deploy. Thanks! [20:16:24] enjoy [20:16:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154860 (https://phabricator.wikimedia.org/T396250) (owner: 10Jsn.sherman) [20:17:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154860 (https://phabricator.wikimedia.org/T396250) (owner: 10Jsn.sherman) [20:17:56] (03Merged) 10jenkins-bot: Deploy remaining Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154860 (https://phabricator.wikimedia.org/T396250) (owner: 10Jsn.sherman) [20:18:09] !log jsn@deploy1003 Started scap sync-world: Backport for [[gerrit:1154860|Deploy remaining Patroller Tools surveys (T396250)]] [20:18:13] T396250: Deploy remaining Patroller Tools surveys - https://phabricator.wikimedia.org/T396250 [20:19:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P77354 and previous config saved to /var/cache/conftool/dbconfig/20250609-201918-marostegui.json [20:20:04] !log jsn@deploy1003 jsn: Backport for [[gerrit:1154860|Deploy remaining Patroller Tools surveys (T396250)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:21:31] testing... [20:24:33] !log jsn@deploy1003 jsn: Continuing with sync [20:26:01] PROBLEM - Disk space on an-worker1093 is CRITICAL: DISK CRITICAL - free space: / 1762 MB (3% inode=93%): /tmp 1762 MB (3% inode=93%): /var/tmp 1762 MB (3% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1093&var-datasource=eqiad+prometheus/ops [20:27:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T395241)', diff saved to https://phabricator.wikimedia.org/P77355 and previous config saved to /var/cache/conftool/dbconfig/20250609-202723-fceratto.json [20:27:41] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1252.eqiad.wmnet with reason: Maintenance [20:31:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.147s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:31:25] !log jsn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1154860|Deploy remaining Patroller Tools surveys (T396250)]] (duration: 13m 15s) [20:31:28] T396250: Deploy remaining Patroller Tools surveys - https://phabricator.wikimedia.org/T396250 [20:33:29] alrighty, deployment looks good with the desired wikis with the exception of wikidata, which seems to be a not-workable fit for quicksurveys. I'm leaving it in for now, but may disable it later this week if there is no path to success there. [20:34:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T396130)', diff saved to https://phabricator.wikimedia.org/P77356 and previous config saved to /var/cache/conftool/dbconfig/20250609-203425-marostegui.json [20:34:29] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [20:34:41] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2192.codfw.wmnet with reason: Maintenance [20:34:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T396130)', diff saved to https://phabricator.wikimedia.org/P77357 and previous config saved to /var/cache/conftool/dbconfig/20250609-203448-marostegui.json [20:36:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.147s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:37:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T396130)', diff saved to https://phabricator.wikimedia.org/P77358 and previous config saved to /var/cache/conftool/dbconfig/20250609-203733-marostegui.json [20:39:47] (03PS3) 10Scott French: shellbox: align image version to 2025-06-05-215815 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127188 (https://phabricator.wikimedia.org/T388260) [20:52:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P77359 and previous config saved to /var/cache/conftool/dbconfig/20250609-205239-marostegui.json [20:58:29] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [21:00:04] Reedy, sbassett, Maryum, and manfredi: Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250609T2100). Please do the needful. [21:01:15] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host cirrussearch2115.codfw.wmnet [21:01:41] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): SSD firmware update for cirrussearch211[0-5] - https://phabricator.wikimedia.org/T394432#10897418 (10ops-monitoring-bot) Host cirrussearch2115.codfw.wmnet rebooted by bking@cumin2002 with reason: rebooting for host updates [21:03:13] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:03:19] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:07:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P77360 and previous config saved to /var/cache/conftool/dbconfig/20250609-210746-marostegui.json [21:09:06] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2115.codfw.wmnet [21:10:54] (03PS1) 10Ladsgroup: Restrict event page decoration to currently allowed namespaces [extensions/CampaignEvents] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1154886 (https://phabricator.wikimedia.org/T392784) [21:12:54] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host cirrussearch2114.codfw.wmnet [21:13:20] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): SSD firmware update for cirrussearch211[0-5] - https://phabricator.wikimedia.org/T394432#10897441 (10ops-monitoring-bot) Host cirrussearch2114.codfw.wmnet rebooted by bking@cumin2002 with reason: rebooting for storage firmware upd... [21:16:45] (03CR) 10Daimona Eaytoy: [C:03+1] Restrict event page decoration to currently allowed namespaces [extensions/CampaignEvents] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1154886 (https://phabricator.wikimedia.org/T392784) (owner: 10Ladsgroup) [21:19:21] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet [21:19:44] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cirrussearch2113.codfw.wmnet [21:21:01] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2114.codfw.wmnet [21:22:04] (03CR) 10Ladsgroup: [C:03+2] Restrict event page decoration to currently allowed namespaces [extensions/CampaignEvents] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1154886 (https://phabricator.wikimedia.org/T392784) (owner: 10Ladsgroup) [21:22:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T396130)', diff saved to https://phabricator.wikimedia.org/P77361 and previous config saved to /var/cache/conftool/dbconfig/20250609-212253-marostegui.json [21:22:57] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [21:23:10] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2201.codfw.wmnet with reason: Maintenance [21:23:15] (03CR) 10Jforrester: "You (or someone) needs to be present in IRC for this to be deployed, to confirm with the deployer that it's OK and hasn't broken anything." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) (owner: 10Umherirrender) [21:23:28] (03Merged) 10jenkins-bot: Restrict event page decoration to currently allowed namespaces [extensions/CampaignEvents] (wmf/1.45.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1154886 (https://phabricator.wikimedia.org/T392784) (owner: 10Ladsgroup) [21:24:28] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1154886|Restrict event page decoration to currently allowed namespaces (T392784)]] [21:24:32] T392784: CampaignEvents makes an uncached x1 DB query on pageviews - https://phabricator.wikimedia.org/T392784 [21:24:32] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [21:25:25] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2211.codfw.wmnet with reason: Maintenance [21:25:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2211 (T396130)', diff saved to https://phabricator.wikimedia.org/P77362 and previous config saved to /var/cache/conftool/dbconfig/20250609-212531-marostegui.json [21:26:26] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1154886|Restrict event page decoration to currently allowed namespaces (T392784)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:27:06] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 [21:27:09] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:27:13] !log vriley@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host an-worker1186 [21:28:22] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye [21:28:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10897493 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS b... [21:28:40] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [21:29:08] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:29:14] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:29:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T396130)', diff saved to https://phabricator.wikimedia.org/P77363 and previous config saved to /var/cache/conftool/dbconfig/20250609-212939-marostegui.json [21:29:43] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [21:33:40] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 [21:33:47] !log vriley@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host an-worker1186 [21:35:36] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1154886|Restrict event page decoration to currently allowed namespaces (T392784)]] (duration: 11m 07s) [21:35:40] T392784: CampaignEvents makes an uncached x1 DB query on pageviews - https://phabricator.wikimedia.org/T392784 [21:36:22] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2115.codfw.wmnet [21:36:34] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cirrussearch2115.codfw.wmnet [21:40:29] about do do a security deploy [21:41:15] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host cirrussearch2115.codfw.wmnet [21:41:42] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): SSD firmware update for cirrussearch211[0-5] - https://phabricator.wikimedia.org/T394432#10897518 (10ops-monitoring-bot) Host cirrussearch2115.codfw.wmnet rebooted by bking@cumin2002 with reason: rebooting for storage firmware upd... [21:44:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P77364 and previous config saved to /var/cache/conftool/dbconfig/20250609-214446-marostegui.json [21:48:28] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2115.codfw.wmnet [21:49:06] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reboot-single for host cirrussearch2114.codfw.wmnet [21:49:08] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:49:15] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:49:34] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): SSD firmware update for cirrussearch211[0-5] - https://phabricator.wikimedia.org/T394432#10897526 (10ops-monitoring-bot) Host cirrussearch2114.codfw.wmnet rebooted by ryankemper@cumin2002 with reason: rebooting for storage firmwar... [21:49:51] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cirrussearch2115 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:53:29] FIRING: SystemdUnitFailed: push_cross_cluster_settings_9600.service on cirrussearch2115:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:56:34] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2114.codfw.wmnet [21:56:52] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reboot-single for host cirrussearch2112.codfw.wmnet [21:57:05] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): SSD firmware update for cirrussearch211[0-5] - https://phabricator.wikimedia.org/T394432#10897532 (10ops-monitoring-bot) Host cirrussearch2112.codfw.wmnet rebooted by ryankemper@cumin2002 with reason: rebooting for storage firmwar... [21:58:29] RESOLVED: SystemdUnitFailed: push_cross_cluster_settings_9600.service on cirrussearch2115:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:59:51] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on cirrussearch2115 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:59:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P77365 and previous config saved to /var/cache/conftool/dbconfig/20250609-215953-marostegui.json [22:00:59] scap !log Deployed security fix for T395730 [22:01:13] !log Deployed security fix for T395730 [22:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:34] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2112.codfw.wmnet [22:06:23] 10ops-codfw, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcontrol2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T396396#10897568 (10Andrew) a:05Andrew→03None [22:06:31] (03PS1) 10Dwisehaupt: Add civi.frdev.wm.o cname pointing at frdev host [dns] - 10https://gerrit.wikimedia.org/r/1154895 (https://phabricator.wikimedia.org/T396084) [22:07:02] (03PS2) 10Dwisehaupt: Add civi.frdev.wm.o cname pointing at frdev host [dns] - 10https://gerrit.wikimedia.org/r/1154895 (https://phabricator.wikimedia.org/T396084) [22:08:21] !log Deployed security fix for T396230 [22:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:13] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reboot-single for host cirrussearch2111.codfw.wmnet [22:12:23] !log Deployed security fix for T395063 [22:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:36] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): SSD firmware update for cirrussearch211[0-5] - https://phabricator.wikimedia.org/T394432#10897578 (10ops-monitoring-bot) Host cirrussearch2111.codfw.wmnet rebooted by ryankemper@cumin2002 with reason: rebooting for storage firmwar... [22:15:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T396130)', diff saved to https://phabricator.wikimedia.org/P77366 and previous config saved to /var/cache/conftool/dbconfig/20250609-221501-marostegui.json [22:15:05] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [22:15:18] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2223.codfw.wmnet with reason: Maintenance [22:15:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2223 (T396130)', diff saved to https://phabricator.wikimedia.org/P77367 and previous config saved to /var/cache/conftool/dbconfig/20250609-221524-marostegui.json [22:19:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T396130)', diff saved to https://phabricator.wikimedia.org/P77368 and previous config saved to /var/cache/conftool/dbconfig/20250609-221932-marostegui.json [22:19:59] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2111.codfw.wmnet [22:29:13] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reboot-single for host cirrussearch2110.codfw.wmnet [22:29:23] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): SSD firmware update for cirrussearch211[0-5] - https://phabricator.wikimedia.org/T394432#10897615 (10ops-monitoring-bot) Host cirrussearch2110.codfw.wmnet rebooted by ryankemper@cumin2002 with reason: rebooting for storage firmwar... [22:34:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P77369 and previous config saved to /var/cache/conftool/dbconfig/20250609-223439-marostegui.json [22:36:57] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cirrussearch2110.codfw.wmnet [22:48:37] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye [22:48:41] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10897661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS bulls... [22:49:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P77370 and previous config saved to /var/cache/conftool/dbconfig/20250609-224947-marostegui.json [22:54:03] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): SSD firmware update for cirrussearch211[0-5] - https://phabricator.wikimedia.org/T394432#10897666 (10RKemper) [22:54:42] (03CR) 10Novem Linguae: [C:03+1] "The code in this patch is pretty simple, but involved a lot of different stakeholders. See https://phabricator.wikimedia.org/T378287#10897" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1083870 (https://phabricator.wikimedia.org/T378287) (owner: 10Dreamrimmer) [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250609T2300) [23:04:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T396130)', diff saved to https://phabricator.wikimedia.org/P77371 and previous config saved to /var/cache/conftool/dbconfig/20250609-230454-marostegui.json [23:04:58] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [23:05:10] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2228.codfw.wmnet with reason: Maintenance [23:05:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2228 (T396130)', diff saved to https://phabricator.wikimedia.org/P77372 and previous config saved to /var/cache/conftool/dbconfig/20250609-230518-marostegui.json [23:09:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T396130)', diff saved to https://phabricator.wikimedia.org/P77373 and previous config saved to /var/cache/conftool/dbconfig/20250609-230903-marostegui.json [23:23:38] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [23:24:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P77375 and previous config saved to /var/cache/conftool/dbconfig/20250609-232410-marostegui.json [23:32:43] PROBLEM - ganeti-noded running on ganeti1047 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [23:33:43] RECOVERY - ganeti-noded running on ganeti1047 is OK: PROCS OK: 2 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [23:38:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1154903 [23:38:37] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1154903 (owner: 10TrainBranchBot) [23:39:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P77376 and previous config saved to /var/cache/conftool/dbconfig/20250609-233918-marostegui.json [23:47:58] (03PS6) 10Scott French: httpd: introduce -bookworm track and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1081989 (https://phabricator.wikimedia.org/T378128) [23:50:17] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1154903 (owner: 10TrainBranchBot) [23:54:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T396130)', diff saved to https://phabricator.wikimedia.org/P77377 and previous config saved to /var/cache/conftool/dbconfig/20250609-235425-marostegui.json [23:54:29] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [23:57:51] (03CR) 10Scott French: "Now that we can reuse some of tooling we created for the PHP 8.1 migration to pilot this easily, it makes sense to pick this back up and g" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1081989 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French)