[00:01:57] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:03:24] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:05:27] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:06:41] FIRING: [13x] ConfdResourceFailed: confd resource _var_netmapper_public_clouds.json.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:08:42] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1146107 [00:08:42] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1146107 (owner: 10TrainBranchBot) [00:10:43] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 632.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:11:41] FIRING: [111x] ConfdResourceFailed: confd resource _var_netmapper_public_clouds.json.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:23:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10824694 (10Jhancock.wm) [00:24:41] (03CR) 10Cwhite: [C:03+1] "I'll roll this out tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/1144555 (https://phabricator.wikimedia.org/T228380) (owner: 10Filippo Giunchedi) [00:29:43] RECOVERY - Disk space on arclamp1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp1001&var-datasource=eqiad+prometheus/ops [00:31:25] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1146107 (owner: 10TrainBranchBot) [00:41:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [00:46:57] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:52:08] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [00:53:15] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:53:15] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:54:15] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/2bb59e518169dc32b3a7791729a47586865fb87b42b3ddd914701d94b9555aef/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:07:33] RECOVERY - Disk space on arclamp2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp2001&var-datasource=codfw+prometheus/ops [01:08:08] !log clear up some space on arclamp2001 to allow arclamp_compress_logs to complete [01:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:15] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:32:38] (03PS1) 10Andrew Bogott: Octavia: change hiera port to 9876 [puppet] - 10https://gerrit.wikimedia.org/r/1146117 (https://phabricator.wikimedia.org/T393783) [01:32:39] (03PS1) 10Andrew Bogott: cloudlb: add octavia endpoint in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1146118 (https://phabricator.wikimedia.org/T393783) [01:32:45] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146118 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [01:32:50] FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:36:24] (03CR) 10Andrew Bogott: [C:03+2] Octavia: change hiera port to 9876 [puppet] - 10https://gerrit.wikimedia.org/r/1146117 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [01:36:26] (03CR) 10Andrew Bogott: [C:03+2] cloudlb: add octavia endpoint in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1146118 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [01:36:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [01:41:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [01:50:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [01:55:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [02:21:53] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 193878288 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:22:53] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 48192 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:46:45] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [02:51:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [02:56:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [03:00:43] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 1.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:01:45] FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [03:07:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [03:12:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [03:16:45] FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [03:21:45] RESOLVED: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [03:49:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [03:54:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [04:01:57] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:11:41] FIRING: [111x] ConfdResourceFailed: confd resource _var_netmapper_public_clouds.json.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:41:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [04:46:57] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:52:08] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:53:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Remove db1256 from s8 (T351820)', diff saved to https://phabricator.wikimedia.org/P76157 and previous config saved to /var/cache/conftool/dbconfig/20250515-045345-ladsgroup.json [04:53:49] T351820: Move Wikidata term store to separate database cluster - https://phabricator.wikimedia.org/T351820 [04:56:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Remove db1192 from x3 (T351820)', diff saved to https://phabricator.wikimedia.org/P76158 and previous config saved to /var/cache/conftool/dbconfig/20250515-045631-ladsgroup.json [04:56:52] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet,es1043.eqiad.wmnet with reason: Maintenance [04:56:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1043 es2041 T391921', diff saved to https://phabricator.wikimedia.org/P76159 and previous config saved to /var/cache/conftool/dbconfig/20250515-045658-marostegui.json [04:57:01] T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921 [04:57:45] (03PS1) 10Marostegui: es1043: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146171 (https://phabricator.wikimedia.org/T391921) [04:59:50] (03CR) 10Marostegui: [C:03+2] es1043: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146171 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [05:01:52] (03PS1) 10Marostegui: es2041: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146174 (https://phabricator.wikimedia.org/T391921) [05:03:00] (03CR) 10Marostegui: [C:03+2] es2041: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146174 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [05:06:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76160 and previous config saved to /var/cache/conftool/dbconfig/20250515-050607-root.json [05:06:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76161 and previous config saved to /var/cache/conftool/dbconfig/20250515-050620-root.json [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:07:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc7 T394260', diff saved to https://phabricator.wikimedia.org/P76162 and previous config saved to /var/cache/conftool/dbconfig/20250515-050724-marostegui.json [05:07:27] T394260: Productionize pc8 - https://phabricator.wikimedia.org/T394260 [05:08:07] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on pc1017.eqiad.wmnet with reason: Maintenance [05:08:25] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on pc2017.codfw.wmnet with reason: Maintenance [05:10:38] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance [05:12:24] (03PS1) 10Marostegui: dbconfig.schema: Add pc8 [puppet] - 10https://gerrit.wikimedia.org/r/1146175 (https://phabricator.wikimedia.org/T394260) [05:15:31] (03CR) 10Marostegui: [C:03+2] dbconfig.schema: Add pc8 [puppet] - 10https://gerrit.wikimedia.org/r/1146175 (https://phabricator.wikimedia.org/T394260) (owner: 10Marostegui) [05:20:57] (03PS1) 10Marostegui: valid_section.pp: Add pc8 [puppet] - 10https://gerrit.wikimedia.org/r/1146176 (https://phabricator.wikimedia.org/T394260) [05:21:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76163 and previous config saved to /var/cache/conftool/dbconfig/20250515-052113-root.json [05:21:24] (03CR) 10Ladsgroup: [C:03+1] valid_section.pp: Add pc8 [puppet] - 10https://gerrit.wikimedia.org/r/1146176 (https://phabricator.wikimedia.org/T394260) (owner: 10Marostegui) [05:21:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76164 and previous config saved to /var/cache/conftool/dbconfig/20250515-052126-root.json [05:25:22] (03CR) 10Marostegui: [C:03+2] valid_section.pp: Add pc8 [puppet] - 10https://gerrit.wikimedia.org/r/1146176 (https://phabricator.wikimedia.org/T394260) (owner: 10Marostegui) [05:28:36] (03PS1) 10Marostegui: pc1018: Productionize [puppet] - 10https://gerrit.wikimedia.org/r/1146184 (https://phabricator.wikimedia.org/T394260) [05:29:07] (03CR) 10Ladsgroup: [C:03+1] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/1146184 (https://phabricator.wikimedia.org/T394260) (owner: 10Marostegui) [05:29:23] (03PS2) 10Marostegui: pc1018: Productionize [puppet] - 10https://gerrit.wikimedia.org/r/1146184 (https://phabricator.wikimedia.org/T394260) [05:30:40] (03CR) 10Marostegui: [C:03+2] pc1018: Productionize [puppet] - 10https://gerrit.wikimedia.org/r/1146184 (https://phabricator.wikimedia.org/T394260) (owner: 10Marostegui) [05:32:50] FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:36:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P76165 and previous config saved to /var/cache/conftool/dbconfig/20250515-053618-root.json [05:36:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P76166 and previous config saved to /var/cache/conftool/dbconfig/20250515-053631-root.json [05:39:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1042 and es2042 to es4 masters T391921', diff saved to https://phabricator.wikimedia.org/P76167 and previous config saved to /var/cache/conftool/dbconfig/20250515-053958-marostegui.json [05:40:02] T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921 [05:41:05] (03PS1) 10Marostegui: wmnet: Update es4-master [dns] - 10https://gerrit.wikimedia.org/r/1146190 (https://phabricator.wikimedia.org/T391921) [05:41:21] (03CR) 10Marostegui: "This is a noop" [dns] - 10https://gerrit.wikimedia.org/r/1146190 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [05:41:44] (03CR) 10Marostegui: [C:03+2] wmnet: Update es4-master [dns] - 10https://gerrit.wikimedia.org/r/1146190 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [05:41:47] !log marostegui@dns1006 START - running authdns-update [05:43:02] !log marostegui@dns1006 END - running authdns-update [05:50:20] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10824887 (10Ladsgroup) [05:51:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P76168 and previous config saved to /var/cache/conftool/dbconfig/20250515-055124-root.json [05:51:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P76169 and previous config saved to /var/cache/conftool/dbconfig/20250515-055137-root.json [05:53:39] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1138905 (https://phabricator.wikimedia.org/T392629) (owner: 10JHathaway) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T0600) [06:00:05] marostegui, Amir1, and federico3: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T0600) [06:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:06:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76170 and previous config saved to /var/cache/conftool/dbconfig/20250515-060629-root.json [06:06:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76171 and previous config saved to /var/cache/conftool/dbconfig/20250515-060643-root.json [06:16:56] PROBLEM - Exim SMTP on lists1004 is CRITICAL: connect to address 208.80.154.81 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Exim [06:19:02] RECOVERY - Exim SMTP on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 07 Aug 2025 09:25:51 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Exim [06:21:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76172 and previous config saved to /var/cache/conftool/dbconfig/20250515-062135-root.json [06:21:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76173 and previous config saved to /var/cache/conftool/dbconfig/20250515-062149-root.json [06:23:56] PROBLEM - Exim SMTP on lists1004 is CRITICAL: connect to address 208.80.154.81 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Exim [06:24:24] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:25:02] RECOVERY - Exim SMTP on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 07 Aug 2025 09:25:51 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Exim [06:34:02] Deploying cxserver.. [06:34:42] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-05-14-005542-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145456 (https://phabricator.wikimedia.org/T394008) (owner: 10KartikMistry) [06:36:14] (03Merged) 10jenkins-bot: Update cxserver to 2025-05-14-005542-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145456 (https://phabricator.wikimedia.org/T394008) (owner: 10KartikMistry) [06:36:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76174 and previous config saved to /var/cache/conftool/dbconfig/20250515-063641-root.json [06:36:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76175 and previous config saved to /var/cache/conftool/dbconfig/20250515-063655-root.json [06:38:14] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply [06:38:36] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:40:30] (03CR) 10JMeybohm: "Deploying and testing should be possible without service catalog entry. So usually the entry is created the way the service is supposed to" [puppet] - 10https://gerrit.wikimedia.org/r/1145241 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [06:43:21] !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:43:53] !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:46:06] !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:46:38] !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:47:10] (03PS1) 10Muehlenhoff: Bitu: When approving a permission request mention the need for re-login [software/bitu] - 10https://gerrit.wikimedia.org/r/1146446 (https://phabricator.wikimedia.org/T393724) [06:47:15] (03CR) 10JMeybohm: [C:03+2] Reapply "Update admin_ng fixtures to reflect puppet changes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145857 (owner: 10JMeybohm) [06:49:57] !log Updated cxserver to 2025-05-14-005542-production (T394008, T392499) [06:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:01] T394008: CXServer doesn't support section suggestions for "be-tarask" language code - https://phabricator.wikimedia.org/T394008 [06:50:01] T392499: Post-creation work for rkiwiki - https://phabricator.wikimedia.org/T392499 [06:50:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1045 es2045 T391921', diff saved to https://phabricator.wikimedia.org/P76176 and previous config saved to /var/cache/conftool/dbconfig/20250515-065039-marostegui.json [06:50:43] T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921 [06:51:07] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2045.codfw.wmnet,es1045.eqiad.wmnet with reason: Maintenance [06:51:22] (03PS1) 10Marostegui: es1045: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146447 (https://phabricator.wikimedia.org/T391921) [06:51:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76177 and previous config saved to /var/cache/conftool/dbconfig/20250515-065147-root.json [06:52:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76178 and previous config saved to /var/cache/conftool/dbconfig/20250515-065200-root.json [06:52:51] (03CR) 10Marostegui: [C:03+2] es1045: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146447 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [06:55:06] (03CR) 10Fabfur: [C:03+2] hiera: enable vk monitoring in magru to actually remove it [puppet] - 10https://gerrit.wikimedia.org/r/1146021 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [06:55:09] (03CR) 10Slyngshede: [C:03+1] admin: SSH key rotation for cmassaro [puppet] - 10https://gerrit.wikimedia.org/r/1146033 (https://phabricator.wikimedia.org/T393140) (owner: 10BCornwall) [06:56:12] (03PS1) 10Marostegui: es2045: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146448 (https://phabricator.wikimedia.org/T391921) [06:56:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76179 and previous config saved to /var/cache/conftool/dbconfig/20250515-065613-root.json [06:57:26] (03CR) 10Marostegui: [C:03+2] es2045: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146448 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [06:57:37] (03CR) 10Slyngshede: [C:03+1] "Looks good." [software/bitu] - 10https://gerrit.wikimedia.org/r/1146446 (https://phabricator.wikimedia.org/T393724) (owner: 10Muehlenhoff) [06:59:24] RESOLVED: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:04] Amir1, Urbanecm, and awight: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T0700). [07:00:05] MichaelG_WMF: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:03:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [07:04:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76180 and previous config saved to /var/cache/conftool/dbconfig/20250515-070433-root.json [07:05:30] (03Merged) 10jenkins-bot: Reapply "Update admin_ng fixtures to reflect puppet changes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145857 (owner: 10JMeybohm) [07:05:49] (03CR) 10Muehlenhoff: [C:03+2] Bitu: When approving a permission request mention the need for re-login [software/bitu] - 10https://gerrit.wikimedia.org/r/1146446 (https://phabricator.wikimedia.org/T393724) (owner: 10Muehlenhoff) [07:06:30] !log add 70G to arclamp /srv [07:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76181 and previous config saved to /var/cache/conftool/dbconfig/20250515-070653-root.json [07:07:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76182 and previous config saved to /var/cache/conftool/dbconfig/20250515-070706-root.json [07:07:48] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for esanders - https://phabricator.wikimedia.org/T393724#10825034 (10MoritzMuehlenhoff) >>! In T393724#10823734, @thcipriani wrote: >>>! In T393724#10823444, @Esanders wrote: >> |cn |[Esanders] >> |mail |[esanders@wikimedia... [07:11:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76183 and previous config saved to /var/cache/conftool/dbconfig/20250515-071119-root.json [07:13:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [07:16:17] hi [07:16:33] sorry for being late - network issues... [07:17:01] jouncebot: nowandnext [07:17:01] For the next 0 hour(s) and 42 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T0700) [07:17:01] In 0 hour(s) and 42 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T0800) [07:18:32] !log installing nginx security updates [07:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:35] (03CR) 10Ilias Sarantopoulos: [C:03+1] python-webapp: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145226 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [07:19:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76184 and previous config saved to /var/cache/conftool/dbconfig/20250515-071939-root.json [07:24:35] (03CR) 10Brouberol: [C:03+1] Revert "hdfs: Exclude rack F3 hosts from analytics cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1145943 (https://phabricator.wikimedia.org/T390171) (owner: 10Stevemunene) [07:24:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [07:24:53] (03CR) 10Brouberol: [C:03+1] spark-history: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145229 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [07:25:07] (03CR) 10Brouberol: [C:03+1] superset: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145230 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [07:26:25] (03CR) 10Elukey: [C:03+2] python-webapp: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145226 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [07:26:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76185 and previous config saved to /var/cache/conftool/dbconfig/20250515-072625-root.json [07:26:30] (03PS1) 10JMeybohm: CI test change - do not merge [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146465 [07:26:33] (03CR) 10Elukey: [C:03+2] recommendation-api: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145227 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [07:26:39] (03CR) 10Elukey: [C:03+2] shellbox: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145228 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [07:26:48] (03CR) 10Elukey: [C:03+2] spark-history: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145229 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [07:26:56] (03CR) 10Elukey: [C:03+2] superset: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145230 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [07:27:03] (03CR) 10Elukey: [C:03+2] tegola-vector-tiles: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145231 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [07:27:10] (03CR) 10Elukey: [C:03+2] termbox: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145232 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [07:27:16] (03CR) 10Elukey: [C:03+2] thumbor: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145233 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [07:27:18] (03CR) 10Joal: [C:03+1] "LGTM! Thank you :)" [alerts] - 10https://gerrit.wikimedia.org/r/1136383 (https://phabricator.wikimedia.org/T391810) (owner: 10Fabfur) [07:27:30] (03PS11) 10Stevemunene: airflow: cleanup deployment charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) [07:29:22] (03CR) 10Brouberol: "Almost all good! Just a minor not on `airflow-main/values-production.yaml`" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) (owner: 10Stevemunene) [07:29:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [07:29:46] (03PS1) 10Elukey: toolhub: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146483 (https://phabricator.wikimedia.org/T391333) [07:29:48] (03PS1) 10Elukey: wikifeeds: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146484 (https://phabricator.wikimedia.org/T391333) [07:29:49] (03PS1) 10Elukey: zotero: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146485 (https://phabricator.wikimedia.org/T391333) [07:29:57] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146465 (owner: 10JMeybohm) [07:30:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1041 es2043 T391921', diff saved to https://phabricator.wikimedia.org/P76186 and previous config saved to /var/cache/conftool/dbconfig/20250515-073033-marostegui.json [07:30:37] T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921 [07:31:03] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2043.codfw.wmnet,es1041.eqiad.wmnet with reason: Maintenance [07:31:30] (03PS1) 10Marostegui: es1041: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146490 (https://phabricator.wikimedia.org/T391921) [07:31:37] (03PS1) 10Majavah: Do not show thumbnails or descriptions on Wikitech search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146491 [07:32:52] (03CR) 10Marostegui: [C:03+2] es1041: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146490 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [07:33:24] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [07:34:04] (03PS1) 10Elukey: growthbook: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146493 (https://phabricator.wikimedia.org/T391333) [07:34:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [07:34:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76187 and previous config saved to /var/cache/conftool/dbconfig/20250515-073445-root.json [07:35:04] (03PS1) 10Marostegui: es2043: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146494 (https://phabricator.wikimedia.org/T391921) [07:35:30] (03CR) 10CI reject: [V:04-1] growthbook: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146493 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [07:35:51] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [07:36:12] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [07:37:04] (03CR) 10Marostegui: [C:03+2] es2043: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146494 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [07:37:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76188 and previous config saved to /var/cache/conftool/dbconfig/20250515-073723-root.json [07:38:58] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [07:40:29] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [07:41:06] (03PS2) 10Elukey: growthbook: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146493 (https://phabricator.wikimedia.org/T391333) [07:41:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76189 and previous config saved to /var/cache/conftool/dbconfig/20250515-074131-root.json [07:41:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76190 and previous config saved to /var/cache/conftool/dbconfig/20250515-074142-root.json [07:49:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76191 and previous config saved to /var/cache/conftool/dbconfig/20250515-074950-root.json [07:50:43] (03CR) 10Federico Ceratto: "Thanks for the check. The configuration has been updated with more help from @cgoubert@wikimedia.org and should be ok now:" [puppet] - 10https://gerrit.wikimedia.org/r/1135387 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [07:52:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76192 and previous config saved to /var/cache/conftool/dbconfig/20250515-075228-root.json [07:53:06] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146465 (owner: 10JMeybohm) [07:56:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76193 and previous config saved to /var/cache/conftool/dbconfig/20250515-075636-root.json [07:56:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76194 and previous config saved to /var/cache/conftool/dbconfig/20250515-075648-root.json [07:58:09] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145981 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [07:58:20] (03CR) 10JMeybohm: [C:03+1] "This LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145981 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [07:59:17] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146465 (owner: 10JMeybohm) [08:00:04] jnuche and jeena: gettimeofday() says it's time for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T0800) [08:00:26] hi, I'll be rolling out the train in a few minutes [08:00:28] (03CR) 10Filippo Giunchedi: [C:03+1] toolhub: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146483 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:01:57] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:02:13] (03PS1) 10Brouberol: Copy app.generic to make the subsequent diff easier to review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146499 (https://phabricator.wikimedia.org/T391333) [08:02:14] (03PS1) 10Brouberol: modules/app/generic: allow the definition of app env vars from a configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146500 (https://phabricator.wikimedia.org/T391333) [08:02:15] (03PS1) 10Brouberol: spark-history: re-introduce environment variable injection from a configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146501 (https://phabricator.wikimedia.org/T391333) [08:03:06] (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146502 (https://phabricator.wikimedia.org/T392171) [08:03:07] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146502 (https://phabricator.wikimedia.org/T392171) (owner: 10TrainBranchBot) [08:03:30] (03CR) 10Filippo Giunchedi: [C:03+1] growthbook: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146493 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:03:35] (03CR) 10Filippo Giunchedi: [C:03+1] zotero: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146485 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:03:39] (03CR) 10Filippo Giunchedi: [C:03+1] wikifeeds: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146484 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:03:59] (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146502 (https://phabricator.wikimedia.org/T392171) (owner: 10TrainBranchBot) [08:04:23] (03CR) 10Brouberol: [C:03+1] growthbook: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146493 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:04:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76195 and previous config saved to /var/cache/conftool/dbconfig/20250515-080456-root.json [08:05:58] 06SRE-OnFire, 10SRE-swift-storage, 07Sustainability: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913#10825162 (10Jelto) I'm adding #Sustainability (Incident Followup) and #SRE-OnFire tags here because this task was mentioned during one of the last swi... [08:06:02] (03CR) 10Brouberol: "This patch cannot be rebased due to conflicts" [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic) [08:07:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P76196 and previous config saved to /var/cache/conftool/dbconfig/20250515-080733-root.json [08:08:31] (03CR) 10Elukey: [C:03+1] Copy app.generic to make the subsequent diff easier to review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146499 (https://phabricator.wikimedia.org/T391333) (owner: 10Brouberol) [08:08:45] (03CR) 10Elukey: [C:03+1] modules/app/generic: allow the definition of app env vars from a configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146500 (https://phabricator.wikimedia.org/T391333) (owner: 10Brouberol) [08:09:08] (03CR) 10Fabfur: [C:03+2] data-engineering: duplicating varnishkafka alerts [alerts] - 10https://gerrit.wikimedia.org/r/1136383 (https://phabricator.wikimedia.org/T391810) (owner: 10Fabfur) [08:09:10] (03CR) 10Elukey: [C:03+1] spark-history: re-introduce environment variable injection from a configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146501 (https://phabricator.wikimedia.org/T391333) (owner: 10Brouberol) [08:09:20] (03CR) 10Brouberol: [C:03+2] Copy app.generic to make the subsequent diff easier to review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146499 (https://phabricator.wikimedia.org/T391333) (owner: 10Brouberol) [08:09:23] (03CR) 10Brouberol: [C:03+2] modules/app/generic: allow the definition of app env vars from a configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146500 (https://phabricator.wikimedia.org/T391333) (owner: 10Brouberol) [08:09:26] (03CR) 10Brouberol: [C:03+2] spark-history: re-introduce environment variable injection from a configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146501 (https://phabricator.wikimedia.org/T391333) (owner: 10Brouberol) [08:09:52] (03CR) 10AOkoth: "Ack. Okay, I'll merge this later then." [puppet] - 10https://gerrit.wikimedia.org/r/1145241 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [08:10:37] (03Merged) 10jenkins-bot: Copy app.generic to make the subsequent diff easier to review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146499 (https://phabricator.wikimedia.org/T391333) (owner: 10Brouberol) [08:10:46] (03Merged) 10jenkins-bot: modules/app/generic: allow the definition of app env vars from a configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146500 (https://phabricator.wikimedia.org/T391333) (owner: 10Brouberol) [08:11:10] (03Merged) 10jenkins-bot: spark-history: re-introduce environment variable injection from a configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146501 (https://phabricator.wikimedia.org/T391333) (owner: 10Brouberol) [08:11:41] FIRING: [111x] ConfdResourceFailed: confd resource _var_netmapper_public_clouds.json.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:11:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76197 and previous config saved to /var/cache/conftool/dbconfig/20250515-081141-root.json [08:11:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P76198 and previous config saved to /var/cache/conftool/dbconfig/20250515-081153-root.json [08:12:03] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [08:12:40] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [08:13:59] (03CR) 10Elukey: [C:03+2] toolhub: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146483 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:14:05] (03CR) 10Elukey: [C:03+2] wikifeeds: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146484 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:14:13] (03CR) 10Elukey: [C:03+2] zotero: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146485 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:14:21] (03CR) 10Elukey: [C:03+2] growthbook: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146493 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey) [08:14:26] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [08:15:03] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [08:17:04] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.1 refs T392171 [08:17:07] T392171: 1.45.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T392171 [08:20:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76200 and previous config saved to /var/cache/conftool/dbconfig/20250515-082002-root.json [08:20:55] (03CR) 10Vgutierrez: [C:03+1] cache: lua lookup experiment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1144620 (https://phabricator.wikimedia.org/T393927) (owner: 10Fabfur) [08:21:36] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 2517 bytes in 0.118 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [08:21:46] (03PS1) 10Brouberol: airflow: upggrade base image to include krenew [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146504 (https://phabricator.wikimedia.org/T394293) [08:21:56] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146505 (https://phabricator.wikimedia.org/T392171) [08:21:57] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146505 (https://phabricator.wikimedia.org/T392171) (owner: 10TrainBranchBot) [08:22:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76201 and previous config saved to /var/cache/conftool/dbconfig/20250515-082238-root.json [08:22:51] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146505 (https://phabricator.wikimedia.org/T392171) (owner: 10TrainBranchBot) [08:22:59] (03CR) 10Filippo Giunchedi: [C:03+2] icinga: remove HOSTOUTPUT from vo-host-notify-by-email [puppet] - 10https://gerrit.wikimedia.org/r/1145902 (https://phabricator.wikimedia.org/T264016) (owner: 10Filippo Giunchedi) [08:23:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1187 for testing T264016', diff saved to https://phabricator.wikimedia.org/P76202 and previous config saved to /var/cache/conftool/dbconfig/20250515-082333-marostegui.json [08:23:37] T264016: Host page did not auto-resolve in VO - https://phabricator.wikimedia.org/T264016 [08:26:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76203 and previous config saved to /var/cache/conftool/dbconfig/20250515-082659-root.json [08:30:34] pages about db1187 are expected [08:31:41] (03CR) 10MVernon: [C:03+1] "Looks good to me - feel free to ignore the typo I spotted, but it'll make me happy if you do fix it :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1146102 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [08:31:41] RESOLVED: [111x] ConfdResourceFailed: confd resource _var_netmapper_public_clouds.json.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:31:56] PROBLEM - Host db1187 #page is DOWN: PING CRITICAL - Packet loss = 100% [08:31:59] !incidents [08:31:59] 6124 (UNACKED) Host db1187 (paged) [08:31:59] 6123 (RESOLVED) ProbeDown sre (10.2.2.30 ip4 probes/service eqiad) [08:32:00] 6122 (RESOLVED) ProbeDown sre (10.2.2.30 ip4 search-psi-https:9643 probes/service http_search-psi-https_ip4 eqiad) [08:32:03] !ack 6124 [08:32:04] 6124 (ACKED) Host db1187 (paged) [08:32:13] (03PS1) 10MVernon: Thanos: add new thanos-fe100[5-7] nodes [puppet] - 10https://gerrit.wikimedia.org/r/1146511 (https://phabricator.wikimedia.org/T389635) [08:33:48] RECOVERY - Host db1187 #page is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [08:34:49] !incidents [08:34:50] 6124 (RESOLVED) Host db1187 (paged) [08:34:50] 6123 (RESOLVED) ProbeDown sre (10.2.2.30 ip4 probes/service eqiad) [08:34:50] 6122 (RESOLVED) ProbeDown sre (10.2.2.30 ip4 search-psi-https:9643 probes/service http_search-psi-https_ip4 eqiad) [08:35:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76204 and previous config saved to /var/cache/conftool/dbconfig/20250515-083540-root.json [08:37:23] No more pages about db1187 are expected [08:37:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76205 and previous config saved to /var/cache/conftool/dbconfig/20250515-083744-root.json [08:38:49] (03CR) 10Marostegui: [C:03+1] Thanos: add new thanos-fe100[5-7] nodes [puppet] - 10https://gerrit.wikimedia.org/r/1146511 (https://phabricator.wikimedia.org/T389635) (owner: 10MVernon) [08:39:53] (03CR) 10MVernon: [C:03+2] Thanos: add new thanos-fe100[5-7] nodes [puppet] - 10https://gerrit.wikimedia.org/r/1146511 (https://phabricator.wikimedia.org/T389635) (owner: 10MVernon) [08:40:02] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: codfw: introduce support for octavia-lb-mgmt-net [puppet] - 10https://gerrit.wikimedia.org/r/1146515 (https://phabricator.wikimedia.org/T394099) [08:40:34] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146465 (owner: 10JMeybohm) [08:41:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [08:41:42] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146515 (https://phabricator.wikimedia.org/T394099) (owner: 10Arturo Borrero Gonzalez) [08:41:46] (03PS1) 10Fabfur: Remove unused varnishkafka configuration [alerts] - 10https://gerrit.wikimedia.org/r/1146516 (https://phabricator.wikimedia.org/T391810) [08:42:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76206 and previous config saved to /var/cache/conftool/dbconfig/20250515-084204-root.json [08:42:19] (03CR) 10Volans: "I haven't tested but the code looks ok. I've left some optional nits that might simplify some bits, no blocker." [cookbooks] - 10https://gerrit.wikimedia.org/r/1141895 (https://phabricator.wikimedia.org/T391581) (owner: 10Federico Ceratto) [08:44:32] (03CR) 10Vgutierrez: [C:04-1] hiera: Add zarcillo k8s service on traffic server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135387 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [08:46:57] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:49:15] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1177.eqiad.wmnet [08:49:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10825309 (10ops-monitoring-bot) Host rebooted by stevemunene@cumin1002 with reason: Rebooting afte... [08:50:39] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29762 bytes in 0.438 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [08:50:40] !log wikitech-static: rm -rf /srv/mediawiki/images/wikitech/archive/* (T338520) [08:50:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76207 and previous config saved to /var/cache/conftool/dbconfig/20250515-085045-root.json [08:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:50] T338520: Shellbox is broken on wikitech-static due to disk fullness - https://phabricator.wikimedia.org/T338520 [08:52:08] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:52:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc7 T394260', diff saved to https://phabricator.wikimedia.org/P76208 and previous config saved to /var/cache/conftool/dbconfig/20250515-085256-marostegui.json [08:53:00] T394260: Productionize pc8 - https://phabricator.wikimedia.org/T394260 [08:53:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76209 and previous config saved to /var/cache/conftool/dbconfig/20250515-085303-root.json [08:54:00] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10825319 (10Stevemunene) an-worker1177 seems stuck booting with the error [18134415.076569] system... [08:54:48] (03PS1) 10Muehlenhoff: imposm-initial-import: Enable the osmupdater DB permissions earlier [puppet] - 10https://gerrit.wikimedia.org/r/1146524 (https://phabricator.wikimedia.org/T381565) [08:57:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76210 and previous config saved to /var/cache/conftool/dbconfig/20250515-085710-root.json [09:04:35] (03Abandoned) 10Vgutierrez: varnish: Allow /beacon/v2/event to hit origin servers [puppet] - 10https://gerrit.wikimedia.org/r/1143474 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [09:05:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76211 and previous config saved to /var/cache/conftool/dbconfig/20250515-090551-root.json [09:07:55] !log reboot thanos-fe100[5-7] prior to bringing into service T391352 [09:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:58] T391352: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352 [09:08:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76212 and previous config saved to /var/cache/conftool/dbconfig/20250515-090808-root.json [09:10:15] PROBLEM - Host thanos-fe1006 is DOWN: PING CRITICAL - Packet loss = 100% [09:10:27] PROBLEM - Host thanos-fe1005 is DOWN: PING CRITICAL - Packet loss = 100% [09:10:29] PROBLEM - Host thanos-fe1007 is DOWN: PING CRITICAL - Packet loss = 100% [09:11:29] RECOVERY - Host thanos-fe1006 is UP: PING OK - Packet loss = 0%, RTA = 1.38 ms [09:11:47] (03CR) 10Filippo Giunchedi: [C:03+1] "IIRC alert files should be removed post-deploy, please verify after deploy and a puppet run in /srv/alerts/ops/team-data-engineering_* on " [alerts] - 10https://gerrit.wikimedia.org/r/1146516 (https://phabricator.wikimedia.org/T391810) (owner: 10Fabfur) [09:11:57] RECOVERY - Host thanos-fe1005 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [09:11:57] RECOVERY - Host thanos-fe1007 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [09:12:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76213 and previous config saved to /var/cache/conftool/dbconfig/20250515-091216-root.json [09:14:04] (03PS6) 10Vgutierrez: trafficserver: Send /evt-103e/v2/event to intake-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1143483 (https://phabricator.wikimedia.org/T391411) [09:14:20] (03PS1) 10Zabe: FlaggablePageView: don't call getId() on null [extensions/FlaggedRevs] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146528 (https://phabricator.wikimedia.org/T394381) [09:15:54] (03CR) 10Alexandros Kosiaris: partman: Add a kubernetes-node-containerd-efi recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143817 (https://phabricator.wikimedia.org/T393053) (owner: 10Alexandros Kosiaris) [09:17:15] (03PS3) 10Alexandros Kosiaris: partman: Add a kubernetes-node-containerd-efi recipe [puppet] - 10https://gerrit.wikimedia.org/r/1143817 (https://phabricator.wikimedia.org/T393053) [09:17:15] (03PS2) 10Alexandros Kosiaris: preseed: Use EFI recipes for aux-k8s-worker[12]00[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/1144627 (https://phabricator.wikimedia.org/T393053) [09:17:54] !log mvernon@cumin1002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe [09:19:01] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:19:07] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:19:07] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:19:09] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:19:17] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:19:17] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:19:17] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:19:52] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift [09:19:57] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Swift [09:19:57] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift [09:19:59] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift [09:20:07] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Swift [09:20:07] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [09:20:07] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Swift [09:20:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76214 and previous config saved to /var/cache/conftool/dbconfig/20250515-092056-root.json [09:22:00] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.1 refs T392171 [09:22:03] T392171: 1.45.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T392171 [09:22:20] (03CR) 10Muehlenhoff: [C:03+1] "Looks good! (To the extent that Partman recipes can look good)" [puppet] - 10https://gerrit.wikimedia.org/r/1143817 (https://phabricator.wikimedia.org/T393053) (owner: 10Alexandros Kosiaris) [09:23:07] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe [09:23:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76215 and previous config saved to /var/cache/conftool/dbconfig/20250515-092314-root.json [09:25:08] jouncebot: nowandnext [09:25:08] For the next 0 hour(s) and 34 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T0800) [09:25:09] In 0 hour(s) and 34 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1000) [09:26:06] (03CR) 10Hnowlan: [C:03+1] imposm-initial-import: Enable the osmupdater DB permissions earlier [puppet] - 10https://gerrit.wikimedia.org/r/1146524 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:26:56] Dreamy_Jazz: feel free to backport [09:26:56] zabe: Do you want to backport the UBN fix? If not, I'm happy to do that. [09:27:01] Thanks. Will do. [09:27:05] !log mvernon@cumin1002 conftool action : set/weight=100; selector: name=thanos-fe1005.eqiad.wmnet [09:27:10] !log mvernon@cumin1002 conftool action : set/weight=100; selector: name=thanos-fe1006.eqiad.wmnet [09:27:10] (03CR) 10Dreamy Jazz: [C:03+2] FlaggablePageView: don't call getId() on null [extensions/FlaggedRevs] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146528 (https://phabricator.wikimedia.org/T394381) (owner: 10Zabe) [09:27:15] !log mvernon@cumin1002 conftool action : set/weight=100; selector: name=thanos-fe1007.eqiad.wmnet [09:27:15] is "feel free" a thing in english? [09:27:20] !log mvernon@cumin1002 conftool action : set/pooled=yes; selector: name=thanos-fe1005.eqiad.wmnet [09:27:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76216 and previous config saved to /var/cache/conftool/dbconfig/20250515-092721-root.json [09:27:24] or am I just doing a bad translation? [09:27:25] !log mvernon@cumin1002 conftool action : set/pooled=yes; selector: name=thanos-fe1006.eqiad.wmnet [09:27:29] !log mvernon@cumin1002 conftool action : set/pooled=yes; selector: name=thanos-fe1007.eqiad.wmnet [09:27:35] yeap. it's a thing. sounds fluent [09:27:38] "Feel free" is a thing in english [09:28:38] (03Merged) 10jenkins-bot: FlaggablePageView: don't call getId() on null [extensions/FlaggedRevs] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146528 (https://phabricator.wikimedia.org/T394381) (owner: 10Zabe) [09:28:44] hi train people. We have a CI blocker on Wikibase - is there anything that speaks against me +2'ing a patch to the zuul config and redeploying it right now? (https://gerrit.wikimedia.org/r/c/integration/config/+/1146520) [09:28:46] nice [09:29:19] codders: zuul config seems like a #wikimedia-releng question [09:29:26] yeah. bit quiet over there [09:29:36] just wanted to make sure it wouldn't interfere with operations [09:29:53] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1146528|FlaggablePageView: don't call getId() on null (T394381)]] [09:29:56] T394381: Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T394381 [09:30:25] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143483 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [09:30:45] (03Abandoned) 10Hnowlan: mw::maintenance: migrate refreshLinkRecommendations s1 shard to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143528 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [09:30:51] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host an-worker1177.eqiad.wmnet [09:30:55] codders: that shouldn't affect the train [09:31:03] (y) thanks! [09:31:43] 06SRE, 10SRE-swift-storage: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10825378 (10MatthewVernon) [09:32:50] FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:34:18] (03CR) 10Alexandros Kosiaris: [C:03+2] "lol, agreed! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1143817 (https://phabricator.wikimedia.org/T393053) (owner: 10Alexandros Kosiaris) [09:34:33] (03CR) 10Alexandros Kosiaris: [C:03+2] preseed: Use EFI recipes for aux-k8s-worker[12]00[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/1144627 (https://phabricator.wikimedia.org/T393053) (owner: 10Alexandros Kosiaris) [09:36:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76217 and previous config saved to /var/cache/conftool/dbconfig/20250515-093602-root.json [09:36:36] !log dreamyjazz@deploy1003 dreamyjazz, zabe: Backport for [[gerrit:1146528|FlaggablePageView: don't call getId() on null (T394381)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:36:39] T394381: Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T394381 [09:37:11] !log dreamyjazz@deploy1003 dreamyjazz, zabe: Continuing with sync [09:37:47] https://test2.wikipedia.org/wiki/Testpage1 no longer has a fatal error. Couldn't reproduce with the `action=veedit` so maybe you have to press save for that case. [09:38:06] *`veaction=edit` [09:39:02] !log isaranto@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:44:02] !log isaranto@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:44:24] !log isaranto@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:45:54] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1146528|FlaggablePageView: don't call getId() on null (T394381)]] (duration: 16m 00s) [09:45:57] T394381: Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T394381 [09:50:22] Dreamy_Jazz: I'm seeing the bug on a test server one minute after the backport synchronized there [09:50:37] is it possible the bug is still present? [09:50:49] Hmm. I was testing using https://test2.wikipedia.org/wiki/Testpage1 [09:51:04] https://usercontent.irccloud-cdn.com/file/S6Euohjg/image.png [09:51:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76218 and previous config saved to /var/cache/conftool/dbconfig/20250515-095108-root.json [09:51:13] I can't seem to reproduce the error now. [09:51:17] Using that URL [09:51:58] Dreamy_Jazz: sounds good, maybe it was some lag when generating the logstash timestamp [09:52:07] thank you! [09:53:07] ok, rolling forward the train [09:53:29] (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146535 (https://phabricator.wikimedia.org/T392171) [09:53:30] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146535 (https://phabricator.wikimedia.org/T392171) (owner: 10TrainBranchBot) [09:54:20] (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146535 (https://phabricator.wikimedia.org/T392171) (owner: 10TrainBranchBot) [09:57:14] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10825452 (10taavi) The site at http://ec2-54-81-201-239.compute-1.amazonaws.com/ seems to embed images from `upload.wikimedia.org`, for pages like n... [09:58:53] (03PS6) 10Hnowlan: mw::periodic_job: add concurrency parameter to k8s jobs [puppet] - 10https://gerrit.wikimedia.org/r/1146010 (https://phabricator.wikimedia.org/T385782) [09:59:45] FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1000) [10:00:15] please be aware the train is still running [10:04:46] (03CR) 10Cathal Mooney: [C:03+1] cloudgw: codfw: introduce support for octavia-lb-mgmt-net [puppet] - 10https://gerrit.wikimedia.org/r/1146515 (https://phabricator.wikimedia.org/T394099) (owner: 10Arturo Borrero Gonzalez) [10:05:04] (03CR) 10Cathal Mooney: [C:03+2] Enable link-protection on OSPF links on EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/1145977 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [10:05:45] PROBLEM - OpenSearch unassigned shard check - 9200 on cirrussearch1074 is CRITICAL: CRITICAL - cebwiki_content_1741288099[4](2025-05-09T20:59:37.634Z), enwikibooks_archive_1717258160[0](2025-05-09T20:16:16.046Z), nnwiki_content_1727897783[0](2025-05-12T14:48:52.049Z), enwikiquote_content_1727930976[0](2025-05-12T14:50:54.219Z), ruwiki_content_1727993503[6](2025-05-12T15:10:07.603Z) https://wikitech.wikimedia.org/wiki/Search%23Administrati [10:05:49] PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch1074 is CRITICAL: CRITICAL - azwikiquote_content_1727944408[0](2025-05-09T20:16:25.362Z), skwiktionary_archive_1717364159[0](2025-05-09T20:16:16.114Z), mlwikiquote_content_1728089481[0](2025-05-12T17:12:12.133Z), id_internalwikimedia_content_1717526458[0](2025-05-12T14:44:10.676Z), urwiktionary_content_1728117663[0](2025-05-12T17:12:25.526Z), sdwiki_content_1728047554[0](20 [10:05:49] T17:12:52.192Z), fiwikibooks_content_1728060458[0](2025-05-12T14:44:16.723Z), newiktionary_content_1728013854[0](2025-05-12T14:44:11.709Z), ukwiktionary_content_1728125590[0](2025-05-12T14:44:51.053Z), kabwiki_content_1727944513[0](2025-05-12T17:12:24.299Z), ocwiktionary_content_1728036052[0](2025-05-12T17:12:45.626Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [10:05:52] (03Merged) 10jenkins-bot: Enable link-protection on OSPF links on EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/1145977 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [10:07:14] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.1 refs T392171 [10:07:18] T392171: 1.45.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T392171 [10:08:26] !log depool thanos-fe100[1-3] prior to decom T391352 [10:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:30] T391352: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352 [10:10:11] jouncebot: now [10:10:12] For the next 0 hour(s) and 49 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1000) [10:10:22] jnuche: train is still running? [10:10:49] ah yes, please ping me when you are done :) [10:11:54] (03CR) 10Fabfur: [C:04-1] "You mean after disabling varnishkafka everywhere? I'm ok with that, I'll flag this with a -1 as reminder" [alerts] - 10https://gerrit.wikimedia.org/r/1146516 (https://phabricator.wikimedia.org/T391810) (owner: 10Fabfur) [10:12:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.538s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:12:37] effie: just finished and things look healthy enough [10:12:41] please go ahead :) [10:13:13] cheers! [10:14:35] (03CR) 10Effie Mouzeli: [C:03+2] mcrouter: update modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141282 (owner: 10Effie Mouzeli) [10:14:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:15:10] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.ad [10:15:24] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.ad [10:16:04] (03Merged) 10jenkins-bot: mcrouter: update modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141282 (owner: 10Effie Mouzeli) [10:17:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.1s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:19:34] (03CR) 10Stevemunene: [C:03+1] airflow: cleanup deployment charts (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) (owner: 10Stevemunene) [10:19:43] (03CR) 10Stevemunene: airflow: cleanup deployment charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) (owner: 10Stevemunene) [10:19:45] FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:19:49] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [10:20:54] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.ad [10:20:55] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.check-dbs (exit_code=99) Checking container DBs of wikipedia-commons-local-public.ad [10:21:08] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1145093 (https://phabricator.wikimedia.org/T379283) (owner: 10Majavah) [10:21:31] !log mw-mcrouter minor update, memcached errors are expected [10:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:48] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1145094 (https://phabricator.wikimedia.org/T379283) (owner: 10Majavah) [10:21:55] (03PS1) 10Hnowlan: trafficserver: route mobileapps apis for zhwiki via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1146544 (https://phabricator.wikimedia.org/T393591) [10:23:46] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1177.eqiad.wmnet with OS bullseye [10:23:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10825606 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cu... [10:25:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:26:45] ^ expected [10:27:54] PROBLEM - OpenSearch unassigned shard check - 9200 on cirrussearch1075 is CRITICAL: CRITICAL - enwikibooks_archive_1717258160[0](2025-05-09T20:16:16.046Z), cebwiki_content_1741288099[4](2025-05-09T20:59:37.634Z), nnwiki_content_1727897783[0](2025-05-12T14:48:52.049Z), ruwiki_content_1727993503[6](2025-05-12T15:10:07.603Z), enwikiquote_content_1727930976[0](2025-05-12T14:50:54.219Z) https://wikitech.wikimedia.org/wiki/Search%23Administrati [10:29:20] (03CR) 10Jgiannelos: [C:03+1] trafficserver: route mobileapps apis for zhwiki via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1146544 (https://phabricator.wikimedia.org/T393591) (owner: 10Hnowlan) [10:29:45] FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:29:53] jouncebot: nowandnext [10:29:54] For the next 0 hour(s) and 30 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1000) [10:29:54] In 1 hour(s) and 30 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1200) [10:29:56] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.ad [10:29:57] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.check-dbs (exit_code=99) Checking container DBs of wikipedia-commons-local-public.ad [10:30:04] (03CR) 10Elukey: [C:03+1] imposm-initial-import: Enable the osmupdater DB permissions earlier [puppet] - 10https://gerrit.wikimedia.org/r/1146524 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:30:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:30:50] hnowlan: I am deploying mcrouter [10:31:02] (03CR) 10Hnowlan: [C:03+2] trafficserver: route mobileapps apis for zhwiki via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1146544 (https://phabricator.wikimedia.org/T393591) (owner: 10Hnowlan) [10:32:13] effie: my change shouldn't interfere [10:32:15] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.ad [10:32:23] (03PS7) 10Vgutierrez: trafficserver: Send /evt-103e/v2/events to intake-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1143483 (https://phabricator.wikimedia.org/T391411) [10:32:37] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.ad [10:33:36] hnowlan: excellent! [10:34:01] FIRING: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:34:03] (03CR) 10Kamila Součková: [C:03+1] mw::periodic_job: add concurrency parameter to k8s jobs [puppet] - 10https://gerrit.wikimedia.org/r/1146010 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [10:34:24] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.f2 [10:34:58] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.f2 [10:35:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:35:47] (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: keystone: Update ACLs for cloud-private v6 [puppet] - 10https://gerrit.wikimedia.org/r/1145093 (https://phabricator.wikimedia.org/T379283) (owner: 10Majavah) [10:35:54] (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: rabbitmq: Add cloud-private v6 nets to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1145094 (https://phabricator.wikimedia.org/T379283) (owner: 10Majavah) [10:36:07] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.cd [10:36:09] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.check-dbs (exit_code=99) Checking container DBs of wikipedia-commons-local-public.cd [10:36:35] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [10:37:26] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.cd [10:38:01] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.cd [10:38:40] (03PS1) 10Majavah: hieradata: Update striker-toolsbeta to 2025-05-15-103617-production [puppet] - 10https://gerrit.wikimedia.org/r/1146546 [10:39:01] RESOLVED: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:40:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:41:04] (03PS1) 10Stevemunene: hdfs: add an-worker1177 to in retup role [puppet] - 10https://gerrit.wikimedia.org/r/1146547 (https://phabricator.wikimedia.org/T390171) [10:43:30] RECOVERY - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is OK: wikitech-static OK - wikitech and wikitech-static in sync (32816 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [10:44:34] (03CR) 10Majavah: [C:03+2] hieradata: Update striker-toolsbeta to 2025-05-15-103617-production [puppet] - 10https://gerrit.wikimedia.org/r/1146546 (owner: 10Majavah) [10:44:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:45:10] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:47:33] (03PS1) 10Gkyziridis: ml-services: edit-check cpu/gpu deployment experimental staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146548 (https://phabricator.wikimedia.org/T393154) [10:47:34] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10825707 (10Stevemunene) Host is still stuck, checking the partman recipe and trying the reimage.... [10:48:08] (03CR) 10Btullis: hdfs: add an-worker1177 to in retup role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1146547 (https://phabricator.wikimedia.org/T390171) (owner: 10Stevemunene) [10:48:45] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.cd [10:49:02] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.cd [10:49:25] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.cde [10:49:26] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.check-dbs (exit_code=99) Checking container DBs of wikipedia-commons-local-public.cde [10:49:30] (03CR) 10Stevemunene: hdfs: add an-worker1177 to in retup role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1146547 (https://phabricator.wikimedia.org/T390171) (owner: 10Stevemunene) [10:49:45] RESOLVED: [2x] WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:50:42] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Neslihan_Turan_WMDE - https://phabricator.wikimedia.org/T394395 (10Neslihan_Turan_WMDE) 03NEW [10:53:05] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [10:53:09] (03PS1) 10Btullis: dumps: Add the addschanges.conf file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146549 (https://phabricator.wikimedia.org/T394389) [10:53:17] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [10:53:29] (03CR) 10Btullis: [C:03+1] hdfs: add an-worker1177 to in retup role [puppet] - 10https://gerrit.wikimedia.org/r/1146547 (https://phabricator.wikimedia.org/T390171) (owner: 10Stevemunene) [10:53:51] (03CR) 10Stevemunene: [C:03+2] Revert "hdfs: Exclude rack F3 hosts from analytics cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1145943 (https://phabricator.wikimedia.org/T390171) (owner: 10Stevemunene) [10:56:31] 06SRE, 06Traffic: Research and respond to Let's Encrypt's intent to deprecate OCSP in favour of CRLs - https://phabricator.wikimedia.org/T370821#10825755 (10Vgutierrez) 05Open→03In progress p:05Triage→03Unbreak! Let's encrypt already stopped including OCSP urls in new certificates and it's already caus... [10:56:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95152319 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:57:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:58:05] (03CR) 10Stevemunene: [C:03+2] hdfs: add an-worker1177 to in retup role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1146547 (https://phabricator.wikimedia.org/T390171) (owner: 10Stevemunene) [10:59:21] memcached errors are expected [11:01:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145506 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:02:12] (03PS1) 10Vgutierrez: profile,cache: Stop monitoring OCSP freshness for acme-chief managed certs [puppet] - 10https://gerrit.wikimedia.org/r/1146550 (https://phabricator.wikimedia.org/T370821) [11:02:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:02:32] (03PS1) 10Fabfur: submodule update for deploy [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1146551 [11:02:54] (03PS2) 10Vgutierrez: profile,cache: Stop monitoring OCSP freshness for acme-chief managed certs [puppet] - 10https://gerrit.wikimedia.org/r/1146550 (https://phabricator.wikimedia.org/T370821) [11:02:57] (03CR) 10AikoChou: ml-services: edit-check cpu/gpu deployment experimental staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146548 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis) [11:02:59] (03PS9) 10MVernon: swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 [11:03:44] RECOVERY - Hadoop DataNode on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [11:04:35] !log stevemunene@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1177.eqiad.wmnet with OS bullseye [11:04:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10825773 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1... [11:05:11] (03CR) 10Lucas Werkmeister (WMDE): "It looks like the setting only became unused in wmf.1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146020 (https://phabricator.wikimedia.org/T389970) (owner: 10Sbisson) [11:05:13] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1156.eqiad.wmnet [11:05:41] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10825774 (10ops-monitoring-bot) Host rebooted by stevemunene@cumin1002 with reason: Rebooting afte... [11:05:59] (03Abandoned) 10Fabfur: submodule update for deploy [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1146551 (owner: 10Fabfur) [11:06:03] (03PS1) 10Vgutierrez: ncredir: Stop using OCSP stapling [puppet] - 10https://gerrit.wikimedia.org/r/1146552 (https://phabricator.wikimedia.org/T370821) [11:06:40] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146550 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez) [11:06:51] (03PS2) 10Vgutierrez: ncredir: Stop using OCSP stapling [puppet] - 10https://gerrit.wikimedia.org/r/1146552 (https://phabricator.wikimedia.org/T370821) [11:07:02] (03PS1) 10Fabfur: New deploy for last modification [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1146553 [11:07:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:08:01] (03CR) 10Fabfur: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1146552 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez) [11:08:02] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146552 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez) [11:08:40] (03CR) 10Fabfur: [V:03+2 C:03+2] New deploy for last modification [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1146553 (owner: 10Fabfur) [11:09:04] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:09:58] !log fabfur@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Minor template modification - fabfur@cumin1002" [11:10:00] !log fabfur@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Minor template modification - fabfur@cumin1002 [11:10:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:10:33] !log fabfur@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Minor template modification - fabfur@cumin1002 [11:10:34] !log fabfur@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Minor template modification - fabfur@cumin1002" [11:11:01] (03PS1) 10Muehlenhoff: apt: Remove OCSP stapling [puppet] - 10https://gerrit.wikimedia.org/r/1146555 (https://phabricator.wikimedia.org/T370821) [11:11:10] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: OpenConfirm - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:11:20] (03PS1) 10Vgutierrez: wikidough: Stop using OCSP [puppet] - 10https://gerrit.wikimedia.org/r/1146556 (https://phabricator.wikimedia.org/T370821) [11:11:44] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146555 (https://phabricator.wikimedia.org/T370821) (owner: 10Muehlenhoff) [11:12:02] (03CR) 10Vgutierrez: [C:03+2] ncredir: Stop using OCSP stapling [puppet] - 10https://gerrit.wikimedia.org/r/1146552 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez) [11:12:28] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1156.eqiad.wmnet [11:13:39] FIRING: TransitBGPDown: Transit BGP session down between cr2-eqsin and Hurricane Electric (2001:df5:b800:bb00::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr2-eqsin:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBG [11:14:04] (03PS2) 10Gkyziridis: ml-services: edit-check cpu/gpu deployment experimental staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146548 (https://phabricator.wikimedia.org/T393154) [11:14:17] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146556 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez) [11:15:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:15:19] (03CR) 10Vgutierrez: [C:03+1] apt: Remove OCSP stapling [puppet] - 10https://gerrit.wikimedia.org/r/1146555 (https://phabricator.wikimedia.org/T370821) (owner: 10Muehlenhoff) [11:16:01] (03CR) 10Ssingh: [C:03+1] wikidough: Stop using OCSP [puppet] - 10https://gerrit.wikimedia.org/r/1146556 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez) [11:16:08] (03CR) 10Fabfur: [C:03+1] apt: Remove OCSP stapling [puppet] - 10https://gerrit.wikimedia.org/r/1146555 (https://phabricator.wikimedia.org/T370821) (owner: 10Muehlenhoff) [11:16:36] (03CR) 10Muehlenhoff: [C:03+2] apt: Remove OCSP stapling [puppet] - 10https://gerrit.wikimedia.org/r/1146555 (https://phabricator.wikimedia.org/T370821) (owner: 10Muehlenhoff) [11:17:10] PROBLEM - HTTPS non-canonical-redirect-6 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:17:12] PROBLEM - HTTPS non-canonical-redirect-9 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:17:12] PROBLEM - HTTPS non-canonical-redirect-7 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:17:12] (03CR) 10Vgutierrez: [C:03+2] wikidough: Stop using OCSP [puppet] - 10https://gerrit.wikimedia.org/r/1146556 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez) [11:17:14] PROBLEM - HTTPS non-canonical-redirect-8 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:17:14] PROBLEM - HTTPS non-canonical-redirect-2 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:17:34] PROBLEM - HTTPS non-canonical-redirect-11 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:17:40] PROBLEM - HTTPS non-canonical-redirect-5 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:17:40] PROBLEM - HTTPS non-canonical-redirect-1 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:17:42] PROBLEM - HTTPS non-canonical-redirect-11 on ncredir6002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:17:42] PROBLEM - HTTPS non-canonical-redirect-1 on ncredir6002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:17:42] PROBLEM - HTTPS non-canonical-redirect-4 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:17:42] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir6002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:17:42] PROBLEM - HTTPS non-canonical-redirect-4 on ncredir6002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:17:50] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:17:54] PROBLEM - HTTPS non-canonical-redirect-6 on ncredir6002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:17:54] PROBLEM - HTTPS non-canonical-redirect-9 on ncredir6002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:18:00] PROBLEM - HTTPS non-canonical-redirect-10 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:18:00] PROBLEM - HTTPS non-canonical-redirect-2 on ncredir6002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:18:10] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:18:12] PROBLEM - HTTPS non-canonical-redirect-10 on ncredir6002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:18:12] PROBLEM - HTTPS non-canonical-redirect-7 on ncredir6002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:18:12] PROBLEM - HTTPS non-canonical-redirect-8 on ncredir6002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:18:16] PROBLEM - HTTPS non-canonical-redirect-5 on ncredir6002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:18:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10825823 (10Stevemunene) Host an-worker1156 is getting onboarded to the cluster {F60011424} [11:18:39] RESOLVED: TransitBGPDown: Transit BGP session down between cr2-eqsin and Hurricane Electric (2001:df5:b800:bb00::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr2-eqsin:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransit [11:18:47] (03CR) 10Gkyziridis: ml-services: edit-check cpu/gpu deployment experimental staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146548 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis) [11:19:16] PROBLEM - HTTPS non-canonical-redirect-1 on ncredir5002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:19:16] PROBLEM - HTTPS non-canonical-redirect-6 on ncredir5002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:19:16] PROBLEM - HTTPS non-canonical-redirect-7 on ncredir5002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:19:16] PROBLEM - HTTPS non-canonical-redirect-10 on ncredir5002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:19:23] (03CR) 10Stevemunene: [C:03+1] dumps: Add the addschanges.conf file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146549 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis) [11:19:52] PROBLEM - HTTPS non-canonical-redirect-5 on ncredir5002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:19:52] PROBLEM - HTTPS non-canonical-redirect-11 on ncredir5002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:19:52] PROBLEM - HTTPS non-canonical-redirect-8 on ncredir5002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:19:52] PROBLEM - HTTPS non-canonical-redirect-9 on ncredir5002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:19:52] PROBLEM - HTTPS non-canonical-redirect-4 on ncredir5002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:19:53] PROBLEM - HTTPS non-canonical-redirect-2 on ncredir5002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:19:53] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir5002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:19:54] PROBLEM - HTTPS non-canonical-redirect-1 on ncredir7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:19:54] PROBLEM - HTTPS non-canonical-redirect-2 on ncredir7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:20:10] PROBLEM - HTTPS non-canonical-redirect-9 on ncredir7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:20:10] PROBLEM - HTTPS non-canonical-redirect-1 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:20:12] PROBLEM - HTTPS non-canonical-redirect-2 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:20:14] PROBLEM - HTTPS non-canonical-redirect-11 on ncredir7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:20:14] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:20:16] PROBLEM - HTTPS non-canonical-redirect-4 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:20:18] sigh... sorry about the flood [11:20:38] PROBLEM - HTTPS non-canonical-redirect-11 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:20:38] PROBLEM - HTTPS non-canonical-redirect-9 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:20:38] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:20:40] PROBLEM - HTTPS non-canonical-redirect-8 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:20:42] PROBLEM - HTTPS non-canonical-redirect-8 on ncredir7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:20:42] PROBLEM - HTTPS non-canonical-redirect-7 on ncredir7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:20:42] PROBLEM - HTTPS non-canonical-redirect-10 on ncredir7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:20:42] PROBLEM - HTTPS non-canonical-redirect-10 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:20:42] PROBLEM - HTTPS non-canonical-redirect-6 on ncredir7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:20:42] PROBLEM - HTTPS non-canonical-redirect-4 on ncredir7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:20:43] PROBLEM - HTTPS non-canonical-redirect-5 on ncredir7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:21:02] !log sudo cumin -b1 -s10 "A:wikidough" "run-puppet-agent": T370821 [11:21:04] PROBLEM - HTTPS non-canonical-redirect-6 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:21:04] PROBLEM - HTTPS non-canonical-redirect-7 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:21:04] PROBLEM - HTTPS non-canonical-redirect-5 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:08] T370821: Research and respond to Let's Encrypt's intent to deprecate OCSP in favour of CRLs - https://phabricator.wikimedia.org/T370821 [11:21:23] (03PS3) 10Gkyziridis: ml-services: edit-check cpu/gpu deployment experimental staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146548 (https://phabricator.wikimedia.org/T393154) [11:21:42] PROBLEM - HTTPS non-canonical-redirect-6 on ncredir2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:21:54] PROBLEM - HTTPS non-canonical-redirect-11 on ncredir2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:22:00] PROBLEM - HTTPS non-canonical-redirect-5 on ncredir2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:22:04] PROBLEM - HTTPS non-canonical-redirect-1 on ncredir2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:22:04] PROBLEM - HTTPS non-canonical-redirect-7 on ncredir2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:22:04] PROBLEM - HTTPS non-canonical-redirect-9 on ncredir2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:22:05] (03CR) 10Ssingh: profile,cache: Stop monitoring OCSP freshness for acme-chief managed certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1146550 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez) [11:22:10] PROBLEM - HTTPS non-canonical-redirect-2 on ncredir2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:22:12] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:22:38] PROBLEM - HTTPS non-canonical-redirect-10 on ncredir2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:22:38] PROBLEM - HTTPS non-canonical-redirect-8 on ncredir2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:22:40] PROBLEM - HTTPS non-canonical-redirect-4 on ncredir2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:22:53] sukhe: is your patch fixing the above alerts? [11:23:07] volans: yes [11:23:09] volans: vg's patch is going to fix that but we are still missing one thing [11:23:13] (on it) [11:23:13] found the answer in the backlog :D [11:23:15] thx [11:23:25] was hidden in the flood :D [11:23:29] sorry for the nosie. [11:23:32] *noise. [11:23:36] no worries [11:23:37] caught us by surprise :) [11:23:41] (03CR) 10Vgutierrez: profile,cache: Stop monitoring OCSP freshness for acme-chief managed certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1146550 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez) [11:23:42] PROBLEM - HTTPS non-canonical-redirect-9 on ncredir6001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:23:42] PROBLEM - HTTPS non-canonical-redirect-8 on ncredir6001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:23:42] PROBLEM - HTTPS non-canonical-redirect-11 on ncredir6001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:23:42] PROBLEM - HTTPS non-canonical-redirect-6 on ncredir6001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:23:42] PROBLEM - HTTPS non-canonical-redirect-5 on ncredir6001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:23:42] PROBLEM - HTTPS non-canonical-redirect-10 on ncredir6001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:23:46] I am going to silence [11:23:49] k [11:24:10] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir6001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:24:12] PROBLEM - HTTPS non-canonical-redirect-1 on ncredir6001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:24:14] PROBLEM - HTTPS non-canonical-redirect-2 on ncredir6001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:24:14] PROBLEM - HTTPS non-canonical-redirect-7 on ncredir6001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:24:19] (03CR) 10Ssingh: [C:03+1] profile,cache: Stop monitoring OCSP freshness for acme-chief managed certs [puppet] - 10https://gerrit.wikimedia.org/r/1146550 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez) [11:24:26] PROBLEM - HTTPS non-canonical-redirect-4 on ncredir6001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:25:39] (03PS1) 10Vgutierrez: ncredir: Stop requiring OCSP on ssl monitor [puppet] - 10https://gerrit.wikimedia.org/r/1146559 (https://phabricator.wikimedia.org/T370821) [11:26:08] (03PS2) 10Vgutierrez: ncredir: Stop requiring OCSP on ssl monitor [puppet] - 10https://gerrit.wikimedia.org/r/1146559 (https://phabricator.wikimedia.org/T370821) [11:26:20] (03CR) 10AikoChou: [C:03+1] "Thanks for taking care of this issue!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146548 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis) [11:26:41] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146559 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez) [11:27:00] PROBLEM - HTTPS non-canonical-redirect-1 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:27:02] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:27:02] PROBLEM - HTTPS non-canonical-redirect-2 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:27:06] PROBLEM - HTTPS non-canonical-redirect-11 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:27:06] PROBLEM - HTTPS non-canonical-redirect-4 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:27:12] PROBLEM - HTTPS non-canonical-redirect-9 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:27:14] PROBLEM - HTTPS non-canonical-redirect-5 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:27:26] (03CR) 10Vgutierrez: [C:03+2] profile,cache: Stop monitoring OCSP freshness for acme-chief managed certs [puppet] - 10https://gerrit.wikimedia.org/r/1146550 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez) [11:27:38] PROBLEM - HTTPS non-canonical-redirect-8 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:27:46] PROBLEM - HTTPS non-canonical-redirect-7 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:27:54] (03CR) 10Gkyziridis: [C:03+2] ml-services: edit-check cpu/gpu deployment experimental staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146548 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis) [11:27:54] PROBLEM - HTTPS non-canonical-redirect-10 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:27:54] PROBLEM - HTTPS non-canonical-redirect-6 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:27:54] PROBLEM - HTTPS non-canonical-redirect-6 on ncredir3004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:28:00] PROBLEM - HTTPS on apt1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/APT_repository [11:28:12] PROBLEM - HTTPS non-canonical-redirect-11 on ncredir3004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:28:12] PROBLEM - HTTPS non-canonical-redirect-4 on ncredir3004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:28:12] PROBLEM - HTTPS non-canonical-redirect-8 on ncredir3004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:28:14] PROBLEM - HTTPS non-canonical-redirect-2 on ncredir3004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:28:14] PROBLEM - HTTPS non-canonical-redirect-9 on ncredir3004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:28:32] !log sukhe@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 14 hosts with reason: monitoring alerts [11:28:36] it's a blanket downtime but controlling the flood [11:28:43] will monitor and remove individually [11:30:33] (03CR) 10Ssingh: [C:03+1] ncredir: Stop requiring OCSP on ssl monitor [puppet] - 10https://gerrit.wikimedia.org/r/1146559 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez) [11:30:39] (03CR) 10Vgutierrez: [C:03+2] ncredir: Stop requiring OCSP on ssl monitor [puppet] - 10https://gerrit.wikimedia.org/r/1146559 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez) [11:30:58] (03PS1) 10Muehlenhoff: Fix apt.wikimedia.org health check now that OCSP is disabled [puppet] - 10https://gerrit.wikimedia.org/r/1146562 [11:31:15] (03CR) 10Ssingh: [C:03+1] Fix apt.wikimedia.org health check now that OCSP is disabled [puppet] - 10https://gerrit.wikimedia.org/r/1146562 (owner: 10Muehlenhoff) [11:31:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145506 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:31:55] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [11:33:08] 06SRE, 06Traffic, 13Patch-For-Review: Research and respond to Let's Encrypt's intent to deprecate OCSP in favour of CRLs - https://phabricator.wikimedia.org/T370821#10825867 (10Vgutierrez) p:05Unbreak!→03High [11:35:22] (03CR) 10Muehlenhoff: [C:03+2] Fix apt.wikimedia.org health check now that OCSP is disabled [puppet] - 10https://gerrit.wikimedia.org/r/1146562 (owner: 10Muehlenhoff) [11:35:45] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10825871 (10Jclark-ctr) @MatthewVernon Nvme for os drives require uefi booting [11:39:29] (03CR) 10Btullis: [C:03+2] dumps: Add the addschanges.conf file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146549 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis) [11:40:37] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host apus-be1004.eqiad.wmnet with OS bookworm [11:40:43] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10825899 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host apus-be1004.eqiad.wmnet with OS bookworm [11:40:58] (03Merged) 10jenkins-bot: dumps: Add the addschanges.conf file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146549 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis) [11:41:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95152319 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:41:49] (03PS1) 10Muehlenhoff: Remove now unused and obsolete LE OCSP health check [puppet] - 10https://gerrit.wikimedia.org/r/1146563 (https://phabricator.wikimedia.org/T370821) [11:41:55] RECOVERY - HTTPS non-canonical-redirect-6 on ncredir3004 is OK: SSL OK - Certificate wikipedia.fi valid until 2025-06-27 04:31:45 +0000 (expires in 42 days) https://wikitech.wikimedia.org/wiki/Ncredir [11:42:13] RECOVERY - HTTPS non-canonical-redirect-11 on ncredir3004 is OK: SSL OK - Certificate weekipedia.com valid until 2025-08-03 15:53:03 +0000 (expires in 80 days) https://wikitech.wikimedia.org/wiki/Ncredir [11:42:13] RECOVERY - HTTPS non-canonical-redirect-8 on ncredir3004 is OK: SSL OK - Certificate wikimediacommons.uk valid until 2025-07-15 15:17:13 +0000 (expires in 61 days) https://wikitech.wikimedia.org/wiki/Ncredir [11:42:13] RECOVERY - HTTPS non-canonical-redirect-4 on ncredir3004 is OK: SSL OK - Certificate *.wikispecies.net valid until 2025-07-19 04:44:00 +0000 (expires in 64 days) https://wikitech.wikimedia.org/wiki/Ncredir [11:42:15] RECOVERY - HTTPS non-canonical-redirect-2 on ncredir3004 is OK: SSL OK - Certificate *.wikimania.com valid until 2025-07-19 06:44:30 +0000 (expires in 64 days) https://wikitech.wikimedia.org/wiki/Ncredir [11:42:15] RECOVERY - HTTPS non-canonical-redirect-9 on ncredir3004 is OK: SSL OK - Certificate wikipediashop.com valid until 2025-07-22 18:14:58 +0000 (expires in 68 days) https://wikitech.wikimedia.org/wiki/Ncredir [11:42:37] (03PS1) 10Kamila Součková: mw::maintenance: migrate growthexperiments-updateIsActiveFlagForMentees [puppet] - 10https://gerrit.wikimedia.org/r/1146566 (https://phabricator.wikimedia.org/T385782) [11:42:51] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146566 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková) [11:43:17] (03CR) 10Ssingh: [C:03+1] "Makes sense, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1146563 (https://phabricator.wikimedia.org/T370821) (owner: 10Muehlenhoff) [11:43:36] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1177.eqiad.wmnet with OS bullseye [11:43:41] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10825905 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1... [11:44:08] (03PS1) 10Dreamy Jazz: CustomBlockedDomainStorage::validateDomain: Undo hard-deprecation whilst prod callers exist [extensions/AbuseFilter] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146568 (https://phabricator.wikimedia.org/T394267) [11:44:27] jouncebot: nowandnext [11:44:27] No deployments scheduled for the next 0 hour(s) and 15 minute(s) [11:44:27] In 0 hour(s) and 15 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1200) [11:44:34] ncredir alerts should be clearing up [11:44:44] I am removing the downtime so that we get alerted about other stuff. please ignore the noise for a bit. [11:44:47] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for 14 hosts [11:44:53] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 14 hosts [11:44:57] PROBLEM - HTTPS non-canonical-redirect-4 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:44:57] PROBLEM - HTTPS non-canonical-redirect-7 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:44:57] PROBLEM - HTTPS non-canonical-redirect-8 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:44:57] PROBLEM - HTTPS non-canonical-redirect-10 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:44:57] PROBLEM - HTTPS non-canonical-redirect-5 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:44:57] PROBLEM - HTTPS non-canonical-redirect-2 on ncredir5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:44:57] PROBLEM - HTTPS non-canonical-redirect-1 on ncredir5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:44:58] PROBLEM - HTTPS non-canonical-redirect-9 on ncredir5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:44:58] PROBLEM - HTTPS non-canonical-redirect-8 on ncredir5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:44:59] PROBLEM - HTTPS non-canonical-redirect-5 on ncredir5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:44:59] PROBLEM - HTTPS non-canonical-redirect-11 on ncredir5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:45:00] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:45:00] PROBLEM - HTTPS non-canonical-redirect-4 on ncredir5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:45:01] PROBLEM - HTTPS non-canonical-redirect-6 on ncredir5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:45:01] PROBLEM - HTTPS non-canonical-redirect-10 on ncredir5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:45:02] PROBLEM - HTTPS non-canonical-redirect-7 on ncredir5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:45:03] (03CR) 10Jforrester: [C:03+1] CustomBlockedDomainStorage::validateDomain: Undo hard-deprecation whilst prod callers exist [extensions/AbuseFilter] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146568 (https://phabricator.wikimedia.org/T394267) (owner: 10Dreamy Jazz) [11:45:04] !log removing downtime on A:ncredir [11:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:11] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:45:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/AbuseFilter] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146568 (https://phabricator.wikimedia.org/T394267) (owner: 10Dreamy Jazz) [11:45:41] (03PS1) 10Kamila Součková: mw::maintenance: migrate growthexperiments-refreshPraiseworthyMentees [puppet] - 10https://gerrit.wikimedia.org/r/1146569 (https://phabricator.wikimedia.org/T385782) [11:45:55] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146569 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková) [11:45:57] PROBLEM - HTTPS non-canonical-redirect-2 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:45:57] PROBLEM - HTTPS non-canonical-redirect-6 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:45:57] PROBLEM - HTTPS non-canonical-redirect-1 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:45:57] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:45:57] PROBLEM - HTTPS non-canonical-redirect-9 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:45:57] PROBLEM - HTTPS non-canonical-redirect-11 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir [11:46:12] ^ agent is running so these should clear up [11:46:35] (03PS1) 10Muehlenhoff: Remove krb1001 from the list of KDCs presented to clients [puppet] - 10https://gerrit.wikimedia.org/r/1146570 (https://phabricator.wikimedia.org/T390863) [11:48:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 18.41% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:48:58] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146566 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková) [11:50:17] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: OpenSent - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:51:39] FIRING: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [11:52:17] RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 422, down: 5, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:53:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 16.77% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:54:57] RECOVERY - HTTPS non-canonical-redirect-1 on ncredir2001 is OK: SSL OK - Certificate wikipedia.com valid until 2025-07-28 21:32:43 +0000 (expires in 74 days) https://wikitech.wikimedia.org/wiki/Ncredir [11:54:57] RECOVERY - HTTPS non-canonical-redirect-6 on ncredir2001 is OK: SSL OK - Certificate wikipedia.fi valid until 2025-06-27 04:31:45 +0000 (expires in 42 days) https://wikitech.wikimedia.org/wiki/Ncredir [11:54:57] RECOVERY - HTTPS non-canonical-redirect-8 on ncredir2001 is OK: SSL OK - Certificate wikimediacommons.uk valid until 2025-07-15 15:17:13 +0000 (expires in 61 days) https://wikitech.wikimedia.org/wiki/Ncredir [11:54:57] RECOVERY - HTTPS non-canonical-redirect-6 on ncredir2002 is OK: SSL OK - Certificate wikipedia.fi valid until 2025-06-27 04:31:45 +0000 (expires in 42 days) https://wikitech.wikimedia.org/wiki/Ncredir [11:54:57] RECOVERY - HTTPS non-canonical-redirect-10 on ncredir2001 is OK: SSL OK - Certificate wikipediya.org valid until 2025-08-04 16:51:59 +0000 (expires in 81 days) https://wikitech.wikimedia.org/wiki/Ncredir [11:54:57] RECOVERY - HTTPS non-canonical-redirect-7 on ncredir2001 is OK: SSL OK - Certificate wikipedia.ro valid until 2025-07-01 19:44:46 +0000 (expires in 47 days) https://wikitech.wikimedia.org/wiki/Ncredir [11:56:10] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10825944 (10MatthewVernon) @Jclark-ctr EFI booting is fine (I thought I'd said as much on a previous ticket, but may have missed it); I don't want the OS on the NVME drive, the O... [11:56:39] RESOLVED: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [11:56:49] (03Merged) 10jenkins-bot: CustomBlockedDomainStorage::validateDomain: Undo hard-deprecation whilst prod callers exist [extensions/AbuseFilter] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146568 (https://phabricator.wikimedia.org/T394267) (owner: 10Dreamy Jazz) [11:56:57] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10825945 (10Jclark-ctr) The boss card is 2x m2 nvme drives [11:57:05] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1146568|CustomBlockedDomainStorage::validateDomain: Undo hard-deprecation whilst prod callers exist (T394267)]] [11:57:08] T394267: PHP Deprecated: Use of MediaWiki\Extension\AbuseFilter\BlockedDomains\CustomBlockedDomainStorage::validateDomain was deprecated in MediaWiki 1.44. [Called from MediaWiki\Extension\VisualEditor\EditCheck\ApiEditCheckReferenceUrl - https://phabricator.wikimedia.org/T394267 [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1200) [12:00:09] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [12:01:57] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:02:27] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [12:02:37] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [12:03:02] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [12:03:04] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [12:03:41] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1146568|CustomBlockedDomainStorage::validateDomain: Undo hard-deprecation whilst prod callers exist (T394267)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:03:44] T394267: PHP Deprecated: Use of MediaWiki\Extension\AbuseFilter\BlockedDomains\CustomBlockedDomainStorage::validateDomain was deprecated in MediaWiki 1.44. [Called from MediaWiki\Extension\VisualEditor\EditCheck\ApiEditCheckReferenceUrl - https://phabricator.wikimedia.org/T394267 [12:03:47] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [12:05:09] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [12:07:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95152319 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:09:11] (03CR) 10Muehlenhoff: [C:03+2] imposm-initial-import: Enable the osmupdater DB permissions earlier [puppet] - 10https://gerrit.wikimedia.org/r/1146524 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [12:09:16] jclark@cumin1002 reimage (PID 2241989) is awaiting input [12:10:13] !log cgoubert@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [12:10:35] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1146568|CustomBlockedDomainStorage::validateDomain: Undo hard-deprecation whilst prod callers exist (T394267)]] (duration: 13m 30s) [12:10:39] T394267: PHP Deprecated: Use of MediaWiki\Extension\AbuseFilter\BlockedDomains\CustomBlockedDomainStorage::validateDomain was deprecated in MediaWiki 1.44. [Called from MediaWiki\Extension\VisualEditor\EditCheck\ApiEditCheckReferenceUrl - https://phabricator.wikimedia.org/T394267 [12:11:02] (03CR) 10Brouberol: [C:03+1] airflow: cleanup deployment charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) (owner: 10Stevemunene) [12:31:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2070:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2070 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:32:44] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10826027 (10MatthewVernon) Oh, right, yes, we want the OS on that (which I thought was going to be presented to the OS as a single device, doing RAID-1 in hardware), sorry. [12:35:11] (03CR) 10Sbisson: [C:04-2] "Yes, I was going to re-evaluate this morning and indeed it's too early. I'll consider it again early next week." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146020 (https://phabricator.wikimedia.org/T389970) (owner: 10Sbisson) [12:38:57] (03CR) 10Muehlenhoff: [C:03+2] Remove now unused and obsolete LE OCSP health check [puppet] - 10https://gerrit.wikimedia.org/r/1146563 (https://phabricator.wikimedia.org/T370821) (owner: 10Muehlenhoff) [12:40:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2070:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2070 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:41:35] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1146579 (owner: 10Slyngshede) [12:41:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [12:45:22] (03PS1) 10Klausman: preseed: Switch soon-to-arrive ML GPU hosts to using EFI [puppet] - 10https://gerrit.wikimedia.org/r/1146596 (https://phabricator.wikimedia.org/T393948) [12:45:34] (03CR) 10Klausman: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146596 (https://phabricator.wikimedia.org/T393948) (owner: 10Klausman) [12:46:39] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [12:46:57] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:50:14] (03PS1) 10Andrew Bogott: network data: expand cloud-instances-octavia-lb-mgmt-net to v6 [puppet] - 10https://gerrit.wikimedia.org/r/1146598 (https://phabricator.wikimedia.org/T394099) [12:50:35] (03PS1) 10DDesouza: Design Research participant recruitment survey on eswiki: Deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146599 (https://phabricator.wikimedia.org/T394315) [12:51:03] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users for Jonathan Tweed - https://phabricator.wikimedia.org/T394308#10826075 (10Bmueller) Approved, thanks! [12:51:19] (03CR) 10Andrew Bogott: "equivalent netbox change is done: https://netbox.wikimedia.org/search/?q=octavia-lb-mgmt" [puppet] - 10https://gerrit.wikimedia.org/r/1146598 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott) [12:52:08] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:52:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146599 (https://phabricator.wikimedia.org/T394315) (owner: 10DDesouza) [12:52:16] jhancock@cumin2002 netbox (PID 3706992) is awaiting input [12:53:54] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host apus-be1004.eqiad.wmnet with OS bookworm [12:54:04] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10826078 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host apus-be1004.eqiad.wmnet with OS bookworm executed with errors: - apus-be... [12:55:02] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: generic update - jhancock@cumin2002" [12:55:09] (03PS1) 10Clément Goubert: python-webapp: Include base.networkpolicy.egress.mariadb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146601 [12:55:09] (03PS1) 10Clément Goubert: zarcillo: Fix ingress and egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146602 [12:55:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2070:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2070 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:55:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: generic update - jhancock@cumin2002" [12:55:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:55:59] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1146596 (https://phabricator.wikimedia.org/T393948) (owner: 10Klausman) [12:57:33] (03CR) 10Elukey: [C:03+1] Remove krb1001 from the list of KDCs presented to clients [puppet] - 10https://gerrit.wikimedia.org/r/1146570 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [12:58:53] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826090 (10VRiley-WMF) [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to do the UTC afternoon backport window / Backport Party!Members of Release Engineering will be in #wikimedia-operations connect to share the joy of SpiderPig deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1300). [13:00:05] MichaelG_WMF: A patch you scheduled for UTC afternoon backport window / Backport Party!Members of Release Engineering will be in #wikimedia-operations connect to share the joy of SpiderPig is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:47] Hey hey :) [13:05:24] (03CR) 10Volans: "I left some suggestions inline that should simplify a bit the approach, but there is no blocker beside the limited check in case of passin" [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 (owner: 10MVernon) [13:06:08] (03PS2) 10Andrew Bogott: network data: expand cloud-instances-octavia-lb-mgmt-net to v6 [puppet] - 10https://gerrit.wikimedia.org/r/1146598 (https://phabricator.wikimedia.org/T394099) [13:10:46] (03PS3) 10Andrew Bogott: network data: expand cloud-instances-octavia-lb-mgmt-net to v6 [puppet] - 10https://gerrit.wikimedia.org/r/1146598 (https://phabricator.wikimedia.org/T394099) [13:14:13] o/ [13:14:17] I can deploy! [13:14:51] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1177.eqiad.wmnet with OS bullseye [13:15:04] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10826146 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-wo... [13:15:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145184 (https://phabricator.wikimedia.org/T392869) (owner: 10Urbanecm) [13:15:31] (03CR) 10Clément Goubert: "Without `startingDeadlineSeconds`, as I understand it, it'll "miss" a scheduling every 10s, so if the job overruns its next scheduling tim" [puppet] - 10https://gerrit.wikimedia.org/r/1146010 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [13:15:41] (03CR) 10Muehlenhoff: [C:03+2] Remove krb1001 from the list of KDCs presented to clients [puppet] - 10https://gerrit.wikimedia.org/r/1146570 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [13:16:12] 06SRE, 06Traffic, 13Patch-For-Review: Lower geodns TTLs for dyna.wm.org from 300s (5 min) to 180s (3 min) - https://phabricator.wikimedia.org/T394312#10826148 (10Vgutierrez) p:05Triage→03Medium [13:16:33] (03Merged) 10jenkins-bot: [Growth] eswiki: Bump mentorship to 70% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145184 (https://phabricator.wikimedia.org/T392869) (owner: 10Urbanecm) [13:16:46] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1145184|[Growth] eswiki: Bump mentorship to 70% of users (T392869)]] [13:16:50] T392869: Incrementally increase mentorship at Spanish Wikipedia - https://phabricator.wikimedia.org/T392869 [13:18:25] (03CR) 10Hnowlan: "Fair point! Given the amount of nuance here I might just remove this setting for this job as part of this change for the time being. There" [puppet] - 10https://gerrit.wikimedia.org/r/1146010 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [13:18:33] 06SRE, 06Traffic, 13Patch-For-Review: Lower geodns TTLs for dyna.wm.org and upload.wm.org from 300s (5 min) to 180s (3 min) - https://phabricator.wikimedia.org/T394312#10826152 (10ssingh) [13:18:39] FIRING: TransitBGPDown: Transit BGP session down between cr2-eqsin and Hurricane Electric (103.231.152.47) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr2-eqsin:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [13:19:52] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1072.eqiad.wmnet with OS bookworm [13:19:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826158 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1072.eqiad.wmnet with OS bookworm [13:20:47] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1074.eqiad.wmnet with OS bookworm [13:20:53] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826164 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1074.eqiad.wmnet with OS bookworm [13:22:41] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1071.eqiad.wmnet with OS bookworm [13:22:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826169 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1071.eqiad.wmnet with OS bookworm [13:22:54] !log lucaswerkmeister-wmde@deploy1003 urbanecm, lucaswerkmeister-wmde: Backport for [[gerrit:1145184|[Growth] eswiki: Bump mentorship to 70% of users (T392869)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:22:57] T392869: Incrementally increase mentorship at Spanish Wikipedia - https://phabricator.wikimedia.org/T392869 [13:23:19] MichaelG_WMF: please test :) [13:23:39] FIRING: [4x] TransitBGPDown: Transit BGP session down between cr2-eqsin and Hurricane Electric (103.231.152.47) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [13:23:46] Lucas_WMDE: sorry, I missed your earlier message [13:23:53] * MichaelG_WMF is looking [13:24:26] (03CR) 10Cathal Mooney: [C:03+1] network data: expand cloud-instances-octavia-lb-mgmt-net to v6 [puppet] - 10https://gerrit.wikimedia.org/r/1146598 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott) [13:24:35] (03PS2) 10Klausman: preseed: Switch soon-to-arrive ML GPU hosts to using EFI [puppet] - 10https://gerrit.wikimedia.org/r/1146596 (https://phabricator.wikimedia.org/T393948) [13:24:43] (03CR) 10Klausman: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146596 (https://phabricator.wikimedia.org/T393948) (owner: 10Klausman) [13:25:20] (03CR) 10Andrew Bogott: [C:03+2] network data: expand cloud-instances-octavia-lb-mgmt-net to v6 [puppet] - 10https://gerrit.wikimedia.org/r/1146598 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott) [13:26:01] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1073.eqiad.wmnet with OS bookworm [13:26:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826186 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1073.eqiad.wmnet with OS bookworm [13:26:39] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5557/co" [puppet] - 10https://gerrit.wikimedia.org/r/1146596 (https://phabricator.wikimedia.org/T393948) (owner: 10Klausman) [13:27:40] (03PS3) 10AOkoth: wmnet: create os-reports record [dns] - 10https://gerrit.wikimedia.org/r/1145191 (https://phabricator.wikimedia.org/T350794) [13:29:19] (03PS1) 10Andrew Bogott: Octavia: upgrade amphora boot/mgmt network [puppet] - 10https://gerrit.wikimedia.org/r/1146622 (https://phabricator.wikimedia.org/T394099) [13:29:38] (03CR) 10Klausman: [V:03+1 C:03+2] preseed: Switch soon-to-arrive ML GPU hosts to using EFI [puppet] - 10https://gerrit.wikimedia.org/r/1146596 (https://phabricator.wikimedia.org/T393948) (owner: 10Klausman) [13:30:30] (03CR) 10Andrew Bogott: [C:03+2] cloudgw: codfw: introduce support for octavia-lb-mgmt-net [puppet] - 10https://gerrit.wikimedia.org/r/1146515 (https://phabricator.wikimedia.org/T394099) (owner: 10Arturo Borrero Gonzalez) [13:30:35] @Lucas_WMDE not seeing any errors, though got myself blocked trying to create an account on spanish wikipedia [13:30:41] :( [13:30:48] !log lucaswerkmeister-wmde@deploy1003 urbanecm, lucaswerkmeister-wmde: Continuing with sync [13:31:13] (03CR) 10Andrew Bogott: [C:03+2] Octavia: upgrade amphora boot/mgmt network [puppet] - 10https://gerrit.wikimedia.org/r/1146622 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott) [13:31:32] this is not really something that I expect to fail, and it is just changing the percentage of new users that might get a mentor [13:31:57] yeah [13:32:38] (03CR) 10AOkoth: [C:03+2] wmnet: create os-reports record [dns] - 10https://gerrit.wikimedia.org/r/1145191 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [13:32:50] FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:33:46] (03CR) 10AOkoth: [C:03+2] add os-reports to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1145241 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [13:34:09] (03CR) 10AOkoth: [C:03+2] trafficserver: update os-reports replacment url [puppet] - 10https://gerrit.wikimedia.org/r/1145192 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [13:34:27] (03CR) 10AOkoth: [C:03+2] trafficserver: update os-reports replacment url (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1145192 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [13:34:56] !log aokoth@dns1004 START - running authdns-update [13:35:27] (03PS1) 10Clément Goubert: mediawiki: Add mwcron.suspended_jobs list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146626 (https://phabricator.wikimedia.org/T394409) [13:35:28] (03PS1) 10Clément Goubert: mw-cron: Suspend growthexperiments-listtaskcounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146627 (https://phabricator.wikimedia.org/T394019) [13:36:16] !log aokoth@dns1004 END - running authdns-update [13:36:34] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1072.eqiad.wmnet with reason: host reimage [13:36:46] (03CR) 10CI reject: [V:04-1] mediawiki: Add mwcron.suspended_jobs list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146626 (https://phabricator.wikimedia.org/T394409) (owner: 10Clément Goubert) [13:36:53] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1074.eqiad.wmnet with reason: host reimage [13:37:08] (03CR) 10CI reject: [V:04-1] mw-cron: Suspend growthexperiments-listtaskcounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146627 (https://phabricator.wikimedia.org/T394019) (owner: 10Clément Goubert) [13:37:26] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1145184|[Growth] eswiki: Bump mentorship to 70% of users (T392869)]] (duration: 20m 39s) [13:37:29] T392869: Incrementally increase mentorship at Spanish Wikipedia - https://phabricator.wikimedia.org/T392869 [13:37:38] (03PS1) 10Mhorsey: release CampaignEvents to cbk-zam wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146628 (https://phabricator.wikimedia.org/T393604) [13:38:31] !log UTC afternoon backport+config window done [13:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:45] also found a SpiderPig bug (T394411), yay [13:38:45] T394411: “Show sensitive information” checkbox broken, suspends terminal - https://phabricator.wikimedia.org/T394411 [13:38:54] (03PS2) 10Clément Goubert: mediawiki: Add mwcron.suspended_jobs list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146626 (https://phabricator.wikimedia.org/T394409) [13:38:54] (03PS2) 10Clément Goubert: mw-cron: Suspend growthexperiments-listtaskcounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146627 (https://phabricator.wikimedia.org/T394019) [13:39:11] (03PS1) 10Majavah: P:wmcs: cloudgw: Remove internet access for octavia-lb-mgmt-net [puppet] - 10https://gerrit.wikimedia.org/r/1146629 (https://phabricator.wikimedia.org/T394099) [13:39:57] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5559/co" [puppet] - 10https://gerrit.wikimedia.org/r/1146629 (https://phabricator.wikimedia.org/T394099) (owner: 10Majavah) [13:40:08] @Lucas_WMDE Thank you for running the window! 🙏 [13:40:22] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1072.eqiad.wmnet with reason: host reimage [13:41:16] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5560/co" [puppet] - 10https://gerrit.wikimedia.org/r/1146629 (https://phabricator.wikimedia.org/T394099) (owner: 10Majavah) [13:41:29] np :) [13:41:34] (03PS2) 10Majavah: P:wmcs: cloudgw: Remove internet access for octavia-lb-mgmt-net [puppet] - 10https://gerrit.wikimedia.org/r/1146629 (https://phabricator.wikimedia.org/T394099) [13:42:45] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1146629 (https://phabricator.wikimedia.org/T394099) (owner: 10Majavah) [13:43:39] FIRING: [4x] TransitBGPDown: Transit BGP session down between cr2-eqsin and Hurricane Electric (103.231.152.47) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [13:43:41] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1074.eqiad.wmnet with reason: host reimage [13:44:43] (03CR) 10Kamila Součková: "From k8s docs:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146626 (https://phabricator.wikimedia.org/T394409) (owner: 10Clément Goubert) [13:45:40] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1073.eqiad.wmnet with OS bookworm [13:45:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826301 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1073.eqiad.wmnet with OS bookworm executed... [13:46:18] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1073.eqiad.wmnet with OS bookworm [13:46:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826303 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1073.eqiad.wmnet with OS bookworm [13:46:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146628 (https://phabricator.wikimedia.org/T393604) (owner: 10Mhorsey) [13:49:04] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826313 (10VRiley-WMF) [13:49:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [13:50:18] (03CR) 10Andrew Bogott: [C:03+1] P:wmcs: cloudgw: Remove internet access for octavia-lb-mgmt-net [puppet] - 10https://gerrit.wikimedia.org/r/1146629 (https://phabricator.wikimedia.org/T394099) (owner: 10Majavah) [13:51:22] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs: cloudgw: Remove internet access for octavia-lb-mgmt-net [puppet] - 10https://gerrit.wikimedia.org/r/1146629 (https://phabricator.wikimedia.org/T394099) (owner: 10Majavah) [13:52:14] (03CR) 10Clément Goubert: "It's even worse than that, without `startingDeadlineSeconds`, if the CronJob has been suspended for more than 100 scheduled executions, th" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146626 (https://phabricator.wikimedia.org/T394409) (owner: 10Clément Goubert) [13:54:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [13:56:34] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10826364 (10MoritzMuehlenhoff) [13:57:03] !log installing openjdk-8 security updates [13:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:16] (03PS1) 10Jforrester: Merge remote-tracking branch 'origin/master' into wmf_deploy [extensions/CentralNotice] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146631 (https://phabricator.wikimedia.org/T341775) [13:58:04] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [13:58:34] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [13:58:35] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1072.eqiad.wmnet with OS bookworm [13:58:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826396 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1072.eqiad.wmnet with OS bookworm completed... [13:59:15] (03PS1) 10Jforrester: Stabilization: convert deprecated Xml methods to Html [extensions/FlaggedRevs] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146634 (https://phabricator.wikimedia.org/T394403) [14:00:22] 06SRE, 10Observability-Metrics: Rework the Pyrra list dashboard - https://phabricator.wikimedia.org/T394415 (10elukey) 03NEW [14:01:21] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [14:01:35] !log pfischer@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:02:05] 06SRE, 10Observability-Metrics: Rework the Pyrra list dashboard - https://phabricator.wikimedia.org/T394415#10826456 (10elukey) [14:03:26] !log pfischer@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:03:33] vriley@cumin1002 reimage (PID 2257605) is awaiting input [14:04:09] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [14:04:10] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1074.eqiad.wmnet with OS bookworm [14:04:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826479 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1074.eqiad.wmnet with OS bookworm completed... [14:04:36] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1073.eqiad.wmnet with OS bookworm [14:04:41] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826491 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1073.eqiad.wmnet with OS bookworm executed... [14:05:32] (03CR) 10BBlack: [C:03+1] "Seems low-risk and beneficial at this point!" [dns] - 10https://gerrit.wikimedia.org/r/1146028 (https://phabricator.wikimedia.org/T394312) (owner: 10Ssingh) [14:05:44] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt1073.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:06:03] (03PS1) 10Andrew Bogott: Octavia: open firewall for amphora health checks [puppet] - 10https://gerrit.wikimedia.org/r/1146635 (https://phabricator.wikimedia.org/T394099) [14:06:25] (03CR) 10Ssingh: [C:03+2] templates: lower TTLs for dyna.wm.org and upload.wm.org to 240s [dns] - 10https://gerrit.wikimedia.org/r/1146028 (https://phabricator.wikimedia.org/T394312) (owner: 10Ssingh) [14:06:32] (03PS2) 10Andrew Bogott: Octavia: open firewall for amphora health checks [puppet] - 10https://gerrit.wikimedia.org/r/1146635 (https://phabricator.wikimedia.org/T394099) [14:06:35] (03PS2) 10Ssingh: templates: lower TTLs for dyna.wm.org and upload.wm.org to 240s [dns] - 10https://gerrit.wikimedia.org/r/1146028 (https://phabricator.wikimedia.org/T394312) [14:07:36] (03CR) 10CI reject: [V:04-1] Octavia: open firewall for amphora health checks [puppet] - 10https://gerrit.wikimedia.org/r/1146635 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott) [14:08:18] (03PS1) 10Andrew Bogott: octavia: refresh services if config changes [puppet] - 10https://gerrit.wikimedia.org/r/1146636 (https://phabricator.wikimedia.org/T393783) [14:08:27] (03CR) 10Ssingh: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1146028 (https://phabricator.wikimedia.org/T394312) (owner: 10Ssingh) [14:08:43] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1071.eqiad.wmnet with OS bookworm [14:08:51] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826511 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1071.eqiad.wmnet with OS bookworm executed... [14:09:05] (03PS3) 10Andrew Bogott: Octavia: open firewall for amphora health checks [puppet] - 10https://gerrit.wikimedia.org/r/1146635 (https://phabricator.wikimedia.org/T394099) [14:09:05] (03PS2) 10Andrew Bogott: octavia: refresh services if config changes [puppet] - 10https://gerrit.wikimedia.org/r/1146636 (https://phabricator.wikimedia.org/T393783) [14:09:26] (03CR) 10CI reject: [V:04-1] octavia: refresh services if config changes [puppet] - 10https://gerrit.wikimedia.org/r/1146636 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [14:09:41] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt1071.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:10:10] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146635 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott) [14:10:20] (03CR) 10CI reject: [V:04-1] octavia: refresh services if config changes [puppet] - 10https://gerrit.wikimedia.org/r/1146636 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [14:10:21] (03PS3) 10Ssingh: templates: lower TTLs for dyna.wm.org and upload.wm.org to 240s [dns] - 10https://gerrit.wikimedia.org/r/1146028 (https://phabricator.wikimedia.org/T394312) [14:10:57] (03PS3) 10Andrew Bogott: octavia: refresh services if config changes [puppet] - 10https://gerrit.wikimedia.org/r/1146636 (https://phabricator.wikimedia.org/T393783) [14:11:28] (03CR) 10Ssingh: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1146028 (https://phabricator.wikimedia.org/T394312) (owner: 10Ssingh) [14:12:40] (03CR) 10Andrew Bogott: [C:03+2] Octavia: open firewall for amphora health checks [puppet] - 10https://gerrit.wikimedia.org/r/1146635 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott) [14:12:42] (03CR) 10Andrew Bogott: [C:03+2] octavia: refresh services if config changes [puppet] - 10https://gerrit.wikimedia.org/r/1146636 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [14:12:48] !log sukhe@dns1004 START - running authdns-update [14:13:25] !log sukhe@dns1004 END - running authdns-update [14:13:31] !log finished running lowering of dyna/upload TTL to 240s: T394312 [14:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:36] T394312: Lower geodns TTLs for dyna.wm.org and upload.wm.org from 300s (5 min) to 180s (3 min) - https://phabricator.wikimedia.org/T394312 [14:17:32] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1073.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:17:59] 06SRE, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10826585 (10elukey) 05Open→03Resolved a:03elukey I think that the purpose of this task is completed, We should follow up on the subtasks. [14:18:18] (03PS2) 10Jsn.sherman: Create dblist for ores extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143638 (https://phabricator.wikimedia.org/T391103) [14:18:47] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1073.eqiad.wmnet with OS bookworm [14:18:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826598 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1073.eqiad.wmnet with OS bookworm [14:21:25] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1071.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:21:59] (03PS1) 10Muehlenhoff: Bump the version numbers for Java images based on latest security releases [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146638 [14:22:00] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1071.eqiad.wmnet with OS bookworm [14:22:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1071.eqiad.wmnet with OS bookworm [14:24:32] 06SRE, 06Traffic, 13Patch-For-Review: Lower geodns TTLs for dyna.wm.org and upload.wm.org from 300s (5 min) to 180s (3 min) - https://phabricator.wikimedia.org/T394312#10826667 (10ssingh) [14:24:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [14:24:44] (03PS3) 10Jsn.sherman: Create dblist for ores extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143638 (https://phabricator.wikimedia.org/T391103) [14:24:51] (03PS1) 10Andrew Bogott: Octavia: allow amphora health checks over ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099) [14:25:47] (03PS5) 10Eevans: cassandra: configurable local_system_data_file_directory [puppet] - 10https://gerrit.wikimedia.org/r/1146102 (https://phabricator.wikimedia.org/T391544) [14:26:22] (03CR) 10CI reject: [V:04-1] Create dblist for ores extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143638 (https://phabricator.wikimedia.org/T391103) (owner: 10Jsn.sherman) [14:26:53] (03PS2) 10Andrew Bogott: Octavia: allow amphora health checks over ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099) [14:26:59] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott) [14:30:30] (03PS3) 10Andrew Bogott: Octavia: allow amphora health checks over ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099) [14:30:52] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott) [14:33:17] (03CR) 10Brouberol: [C:03+2] airflow: upggrade base image to include krenew [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146504 (https://phabricator.wikimedia.org/T394293) (owner: 10Brouberol) [14:33:53] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1073.eqiad.wmnet with reason: host reimage [14:34:08] (03CR) 10Eevans: [C:03+2] cassandra: configurable local_system_data_file_directory [puppet] - 10https://gerrit.wikimedia.org/r/1146102 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [14:37:04] (03PS4) 10Andrew Bogott: Octavia: allow amphora health checks over ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099) [14:37:15] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1071.eqiad.wmnet with reason: host reimage [14:37:16] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott) [14:37:58] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1073.eqiad.wmnet with reason: host reimage [14:39:50] (03PS5) 10Andrew Bogott: Octavia: allow amphora health checks over ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099) [14:39:50] stevemunene@cumin1002 reimage (PID 2253166) is awaiting input [14:39:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826758 (10VRiley-WMF) [14:40:01] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott) [14:40:36] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1071.eqiad.wmnet with reason: host reimage [14:42:08] (03PS1) 10DCausse: Revert "Make weighted tags no longer be WMF-specific" [extensions/CirrusSearch] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146643 [14:42:29] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826770 (10VRiley-WMF) [14:43:40] (03PS4) 10Jsn.sherman: Create dblist for ores extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143638 (https://phabricator.wikimedia.org/T391103) [14:45:09] (03CR) 10Jsn.sherman: "Thank you! I missed `euwiki` and also the whole `composer manage-dblist update` step." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143638 (https://phabricator.wikimedia.org/T391103) (owner: 10Jsn.sherman) [14:46:39] (03PS5) 10Jsn.sherman: Create dblist for ores extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143638 (https://phabricator.wikimedia.org/T391103) [14:47:20] (03PS6) 10Andrew Bogott: Octavia: allow amphora health checks over ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099) [14:47:29] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott) [14:47:35] (03CR) 10Jsn.sherman: "...and I see fawiki was in fact enabled; nevermind!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143638 (https://phabricator.wikimedia.org/T391103) (owner: 10Jsn.sherman) [14:48:24] (03CR) 10CI reject: [V:04-1] Octavia: allow amphora health checks over ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott) [14:48:41] jouncebot: nowandnext [14:48:41] No deployments scheduled for the next 0 hour(s) and 11 minute(s) [14:48:41] In 0 hour(s) and 11 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1500) [14:49:06] (03CR) 10CDobbins: "I just wanted to ask for additional clarification on this, since it's been a while and there's been no activity. While we could merge this" [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [14:49:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [14:50:35] (03PS7) 10Andrew Bogott: Octavia: allow amphora health checks over ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099) [14:51:53] (03PS1) 10Eevans: cassandra_dev: actually put system keyspaces on RAID [puppet] - 10https://gerrit.wikimedia.org/r/1146646 (https://phabricator.wikimedia.org/T391544) [14:53:50] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott) [14:54:06] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [14:55:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10826849 (10klausman) [14:56:45] (03CR) 10Andrew Bogott: [C:03+2] Octavia: allow amphora health checks over ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott) [14:57:12] vriley@cumin1002 reimage (PID 2265179) is awaiting input [14:57:56] (03CR) 10Elukey: [C:03+1] Bump the version numbers for Java images based on latest security releases [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146638 (owner: 10Muehlenhoff) [14:58:00] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [14:58:01] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1073.eqiad.wmnet with OS bookworm [14:58:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826854 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1073.eqiad.wmnet with OS bookworm completed... [14:58:37] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [14:58:51] !log disable puppet on A:cp to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1144620 (T393927) [14:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:54] T393927: Deploy geoip lookup script on 2 hosts - https://phabricator.wikimedia.org/T393927 [15:00:04] jnuche and jeena: Time to snap out of that daydream and deploy Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1500). [15:00:35] (03CR) 10Michael Große: [C:03+1] Revert "Make weighted tags no longer be WMF-specific" [extensions/CirrusSearch] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146643 (owner: 10DCausse) [15:01:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [15:01:46] vriley@cumin1002 reimage (PID 2265364) is awaiting input [15:02:29] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [15:02:29] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1071.eqiad.wmnet with OS bookworm [15:02:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826875 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1071.eqiad.wmnet with OS bookworm completed... [15:02:49] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp3073.esams.wmnet [15:02:57] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp3081.esams.wmnet [15:03:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826876 (10VRiley-WMF) 05Open→03Resolved [15:03:48] (03CR) 10Fabfur: [C:03+2] cache: lua lookup experiment [puppet] - 10https://gerrit.wikimedia.org/r/1144620 (https://phabricator.wikimedia.org/T393927) (owner: 10Fabfur) [15:04:37] jnuche and jeena: would you be able to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1146643 as part of train-log-triage? this change in the train broke a lot of things [15:05:25] See https://phabricator.wikimedia.org/T394416 and conversation in #wikimedia-search for context [15:06:06] * Lucas_WMDE is also around if needed [15:06:09] (03CR) 10Kamila Součková: [C:03+1] "Acknowledged" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146626 (https://phabricator.wikimedia.org/T394409) (owner: 10Clément Goubert) [15:06:13] MichaelG_WMF: I can backport it [15:06:35] jnuche: thank you! [15:06:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [15:06:43] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:07] (03CR) 10Clément Goubert: [C:03+2] "Thanks! I've created https://phabricator.wikimedia.org/T394423 for discussion of `startingDeadlineSeconds`" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146626 (https://phabricator.wikimedia.org/T394409) (owner: 10Clément Goubert) [15:08:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jnuche@deploy1003 using scap backport" [extensions/CirrusSearch] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146643 (owner: 10DCausse) [15:08:53] MichaelG_WMF, jnuche thanks for backporting this! [15:09:47] (03Merged) 10jenkins-bot: Revert "Make weighted tags no longer be WMF-specific" [extensions/CirrusSearch] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146643 (owner: 10DCausse) [15:09:58] (03Merged) 10jenkins-bot: mediawiki: Add mwcron.suspended_jobs list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146626 (https://phabricator.wikimedia.org/T394409) (owner: 10Clément Goubert) [15:10:04] !log jnuche@deploy1003 Started scap sync-world: Backport for [[gerrit:1146643|Revert "Make weighted tags no longer be WMF-specific"]] [15:15:00] !log jnuche@deploy1003 dcausse, jnuche: Backport for [[gerrit:1146643|Revert "Make weighted tags no longer be WMF-specific"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:15:00] MichaelG_WMF, dcausse: the revert is on the test servers, is it possible for you to check there that the problem is gone? otherwise I'm fine with continuing the backport [15:15:01] jnuche: yes, it should be possible to check, one moment [15:15:03] I confirm it works on the test servers, at least for my use cases [15:15:11] jnuche: all good [15:15:14] ty [15:15:40] !log jnuche@deploy1003 dcausse, jnuche: Continuing with sync [15:15:42] * MichaelG_WMF jnuche: looks good from my side too! [15:38:45] jouncebot: nowandnext [15:38:45] For the next 0 hour(s) and 21 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1500) [15:38:45] In 0 hour(s) and 21 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1600) [15:39:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [15:40:34] !log reenabling puppet on A:cp (T393927) [15:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:34] T393927: Deploy geoip lookup script on 2 hosts - https://phabricator.wikimedia.org/T393927 [15:41:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:44:03] (03PS1) 10Majavah: ssh: Do not shell out for root SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/1146661 (https://phabricator.wikimedia.org/T394283) [15:45:13] (03CR) 10Eevans: [C:03+2] cassandra_dev: actually put system keyspaces on RAID [puppet] - 10https://gerrit.wikimedia.org/r/1146646 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [15:45:13] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_esams - > [15:45:18] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_esams - > [15:48:33] !log dancy@deploy1003 Installing scap version "4.168.0" for 2 host(s) [15:48:49] (03CR) 10BCornwall: [C:03+2] admin: SSH key rotation for cmassaro [puppet] - 10https://gerrit.wikimedia.org/r/1146033 (https://phabricator.wikimedia.org/T393140) (owner: 10BCornwall) [15:49:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [15:49:59] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough [15:50:11] (03PS1) 10Cathal Mooney: Add EBGP between codfw row A-D spines and row E/F spines [homer/public] - 10https://gerrit.wikimedia.org/r/1146662 (https://phabricator.wikimedia.org/T394021) [15:50:21] !log dancy@deploy1003 Installation of scap version "4.168.0" completed for 2 hosts [15:50:51] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Update SSH key for apine - https://phabricator.wikimedia.org/T393140#10827128 (10BCornwall) 05In progress→03Resolved Hi, @cmassaro! Your key has been rotated. Feel free to re-open if anything was missed. Thank you! [15:51:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[23] - https://phabricator.wikimedia.org/T393948#10827139 (10RobH) [15:52:41] RESOLVED: [2x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [15:53:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10827152 (10RobH) [15:53:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10827161 (10RobH) [15:54:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [15:55:11] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp3081.esams.wmnet [15:55:16] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp3073.esams.wmnet [15:56:17] !log Starting patch deployment for T394393 [15:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:33] (03CR) 10RLazarus: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5569/console" [puppet] - 10https://gerrit.wikimedia.org/r/1145949 (https://phabricator.wikimedia.org/T394299) (owner: 10Dreamy Jazz) [15:58:51] 10ops-codfw, 06SRE, 06DC-Ops: lsw1-c6-codfw: PEM 0 Not Powered - https://phabricator.wikimedia.org/T394261#10827184 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:59:17] (03CR) 10Scott French: "No objections in principle, though this needs rebased to reflect I7e9c97537327a4de42a0d8013971beec4da6cb83 and may benefit from tuning the" [puppet] - 10https://gerrit.wikimedia.org/r/1143529 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [16:00:05] jhathaway and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1600). [16:00:05] Dreamy_Jazz: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:49] Dreamy_Jazz: o/ just running PCC and then I can merge [16:00:53] Thanks! [16:00:55] doesn't look like you'll need to test anything? [16:01:35] The only thing I'll be able to test is that when it's been deployed we stop seeing LogicException errors being thrown on beta logstash. [16:01:45] 👍 [16:01:57] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:02:36] The logs I'll be checking is at https://beta-logs.wmcloud.org/goto/7dacc8512956b14f79255206bf05187e [16:03:01] (03CR) 10RLazarus: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5570/console" [puppet] - 10https://gerrit.wikimedia.org/r/1145949 (https://phabricator.wikimedia.org/T394299) (owner: 10Dreamy Jazz) [16:03:03] PCC noops on mwmaint (for the old-timey systemd job) and deploy (for the sparkly new k8s one), going ahead [16:03:12] (03CR) 10RLazarus: [V:03+1 C:03+2] MediaModeration: Only running scanning scripts on production [puppet] - 10https://gerrit.wikimedia.org/r/1145949 (https://phabricator.wikimedia.org/T394299) (owner: 10Dreamy Jazz) [16:03:19] (03PS1) 10SBassett: Update deployment image for security.wikimedia.org site [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146665 (https://phabricator.wikimedia.org/T392098) [16:04:41] Dreamy_Jazz: merged and deployed to prod puppetmasters, I haven't touched anything in beta but feel free to take it from there [16:04:57] thank you for flying puppet request window, please ensure you have all your personal belongings and watch your step as you exit [16:04:59] (03PS2) 10SBassett: Update deployment image for security.wikimedia.org site [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146665 (https://phabricator.wikimedia.org/T392098) [16:05:07] Is there anything I'd need to deploy on the beta wikis specifically? [16:05:15] Or is it an automatic thing? [16:05:22] (03PS1) 10Fabfur: Revert "cache: lua lookup experiment" [puppet] - 10https://gerrit.wikimedia.org/r/1146667 [16:05:26] (03CR) 10Scott French: [C:03+1] "Thank you both. Moving ahead with plumbing this in, since we'll presumably want to use it in some form anyway, but without using it quite " [puppet] - 10https://gerrit.wikimedia.org/r/1146010 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [16:05:48] (03PS2) 10Fabfur: Revert "cache: lua lookup experiment" [puppet] - 10https://gerrit.wikimedia.org/r/1146667 [16:06:23] it should happen automatically, I don't know the exact timing; in prod I would say wait 30 minutes max [16:06:33] Thanks! [16:06:47] (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate growthexperiments-updateIsActiveFlagForMentees [puppet] - 10https://gerrit.wikimedia.org/r/1146566 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková) [16:07:48] (03CR) 10Cwhite: [C:03+2] graphite: remove access to port 2003 tcp/udp [puppet] - 10https://gerrit.wikimedia.org/r/1144555 (https://phabricator.wikimedia.org/T228380) (owner: 10Filippo Giunchedi) [16:07:54] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough [16:07:58] !log mszabo Deployed security patch for T394393 [16:08:14] (03PS1) 10Hnowlan: sre:api-gateway: bump alerting threshold for elevated error [alerts] - 10https://gerrit.wikimedia.org/r/1146668 [16:08:30] (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate growthexperiments-refreshPraiseworthyMentees [puppet] - 10https://gerrit.wikimedia.org/r/1146569 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková) [16:09:30] (03CR) 10CI reject: [V:04-1] sre:api-gateway: bump alerting threshold for elevated error [alerts] - 10https://gerrit.wikimedia.org/r/1146668 (owner: 10Hnowlan) [16:10:06] (03CR) 10Fabfur: [C:03+2] Revert "cache: lua lookup experiment" [puppet] - 10https://gerrit.wikimedia.org/r/1146667 (owner: 10Fabfur) [16:11:30] 10ops-codfw, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): SSD firmware update for cirrussearch211-0-5] - https://phabricator.wikimedia.org/T394432#10827258 (10RobH) p:05Triage→03Medium [16:12:54] 10ops-codfw, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): SSD firmware update for cirrussearch211-0-5] - https://phabricator.wikimedia.org/T394432#10827263 (10RobH) a:05RobH→03bking @RKemper or @bking: Can you advise which of these cirrusseach hosts would be most easily put into maint/offline... [16:12:56] 10ops-codfw, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): SSD firmware update for cirrussearch211-0-5] - https://phabricator.wikimedia.org/T394432#10827265 (10RobH) [16:13:13] !log mszabo Deployed security patch for T394393 [16:13:17] 10ops-codfw, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): SSD firmware update for cirrussearch211[0-5] - https://phabricator.wikimedia.org/T394432#10827266 (10RobH) [16:14:04] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp3073.esams.wmnet [16:14:11] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp3081.esams.wmnet [16:14:16] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146638 (owner: 10Muehlenhoff) [16:14:28] (03CR) 10SBassett: [C:03+2] Update deployment image for security.wikimedia.org site [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146665 (https://phabricator.wikimedia.org/T392098) (owner: 10SBassett) [16:14:35] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Dell SSD Critical Firmware Update - https://phabricator.wikimedia.org/T394348#10827267 (10RobH) I've created sub-task T392935 to track cirrussearch maint windows, likely should have just done that to start but was hoping one was just easily kicked offline for test... [16:16:10] (03CR) 10Cwhite: [C:03+2] logstash: create partition for ml logs [puppet] - 10https://gerrit.wikimedia.org/r/1145339 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [16:16:12] (03Merged) 10jenkins-bot: Update deployment image for security.wikimedia.org site [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146665 (https://phabricator.wikimedia.org/T392098) (owner: 10SBassett) [16:16:32] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp3081.esams.wmnet [16:16:37] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp3073.esams.wmnet [16:17:04] !log sbassett@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [16:17:25] (03PS8) 10Brouberol: airflow: prevent resource name collisions when multiple releases are installed in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145200 (https://phabricator.wikimedia.org/T393999) [16:17:25] (03PS1) 10Brouberol: airflow: use the devenv.db.name in the PG URI instead of /app [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146670 (https://phabricator.wikimedia.org/T393999) [16:17:25] (03PS1) 10Brouberol: airflow: rely on krenew instead of 'airflow kerberos' to renew the kerberos ticket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146671 (https://phabricator.wikimedia.org/T393999) [16:17:28] (03PS1) 10Brouberol: airflow: define an airflow-dev values file, containing the devenv default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146672 (https://phabricator.wikimedia.org/T393999) [16:17:31] (03PS1) 10Brouberol: airflow: don't define OAUTH-related configs in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146673 (https://phabricator.wikimedia.org/T393999) [16:17:32] (03PS1) 10Brouberol: airflow: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146674 (https://phabricator.wikimedia.org/T393999) [16:18:09] (03CR) 10Scott French: [C:03+1] "This indeed does what it says on the tin, so +1 in that regard. As discussed elsewhere, we'll want to wait until the `startingDeadlineSeco" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146627 (https://phabricator.wikimedia.org/T394019) (owner: 10Clément Goubert) [16:19:51] 07Puppet, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Data-Platform-SRE, 06Discovery-Search: Puppet failing on deployment-cirrussearch{12,13,14}.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T393924#10827290 (10bd808) [16:20:29] 07Puppet, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Data-Platform-SRE, 06Discovery-Search: Puppet failing on deployment-cirrussearch{12,13,14}.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T393924#10827295 (10bd808) >>! In T394430#10827113, @dancy wrote: > The failing... [16:26:20] Trying to do a helmfile -e staging -i apply --context 5 for miscweb but it seems to be hanging on research-landing-page. Should probably just ctrl+z? [16:27:13] !log sbassett@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [16:27:15] !log sbassett@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [16:27:26] heh, n/m [16:31:57] FIRING: HelmReleaseBadStatus: Helm release miscweb/design-landing-page on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:34:25] (03PS1) 10Bvibber: Enable Chart extension on phase 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146679 (https://phabricator.wikimedia.org/T393518) [16:35:27] (03CR) 10Scott French: "Thanks, Dan! I think this looks good, aside from the missing bullseye update." [puppet] - 10https://gerrit.wikimedia.org/r/1146091 (https://phabricator.wikimedia.org/T392526) (owner: 10Dduvall) [16:35:30] !log add bgp peerings from codfw row A-D switches to new spines in rows E/F T394021 [16:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:34] T394021: Stage and configure new Juniper switches in codfw rows E/F - https://phabricator.wikimedia.org/T394021 [16:35:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146679 (https://phabricator.wikimedia.org/T393518) (owner: 10Bvibber) [16:36:49] !log helmfile [staging] HALTED helmfile.d/services/miscweb: apply [16:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:48] (03PS1) 10Andrew Bogott: Dynamic proxy: pin python3-flask package [puppet] - 10https://gerrit.wikimedia.org/r/1146680 [16:38:54] (03CR) 10CI reject: [V:04-1] Dynamic proxy: pin python3-flask package [puppet] - 10https://gerrit.wikimedia.org/r/1146680 (owner: 10Andrew Bogott) [16:39:17] herron@cumin1002 roll-restart-reboot-brokers (PID 2287058) is awaiting input [16:39:32] (03PS2) 10Andrew Bogott: Dynamic proxy: pin python3-flask package [puppet] - 10https://gerrit.wikimedia.org/r/1146680 [16:40:14] !log herron@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-eqiad [16:40:39] FIRING: CoreBGPDown: Core BGP session down between ssw1-a1-codfw and ssw1-e1-codfw (10.192.253.193) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=ssw1-a1-codfw:9804&var-bgp_group=core&var-bgp_neighbor=ssw1-e1-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:41:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [16:44:17] (03CR) 10Andrew Bogott: [C:03+2] Dynamic proxy: pin python3-flask package [puppet] - 10https://gerrit.wikimedia.org/r/1146680 (owner: 10Andrew Bogott) [16:45:39] RESOLVED: CoreBGPDown: Core BGP session down between ssw1-a1-codfw and ssw1-e1-codfw (10.192.253.193) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=ssw1-a1-codfw:9804&var-bgp_group=core&var-bgp_neighbor=ssw1-e1-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:46:57] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:48:49] (03PS3) 10Dduvall: aptrepo: Provide thirdparty/docker component with upstream packages [puppet] - 10https://gerrit.wikimedia.org/r/1146091 (https://phabricator.wikimedia.org/T392526) [16:49:46] (03CR) 10Dduvall: "Thanks for the review, Scott!" [puppet] - 10https://gerrit.wikimedia.org/r/1146091 (https://phabricator.wikimedia.org/T392526) (owner: 10Dduvall) [16:52:08] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [17:00:04] bd808: It is that lovely time of the day again! You are hereby commanded to deploy Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1700) [17:02:16] (03PS1) 10BryanDavis: developer-portal: Bump container to 2025-05-15-122256-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146686 [17:02:51] (03PS2) 10Hnowlan: sre:api-gateway: bump alerting threshold for elevated error [alerts] - 10https://gerrit.wikimedia.org/r/1146668 [17:03:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:04:23] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2025-05-15-122256-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146686 (owner: 10BryanDavis) [17:04:57] !log herron@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-eqiad [17:05:50] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2025-05-15-122256-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146686 (owner: 10BryanDavis) [17:08:59] (03PS2) 10Cathal Mooney: Add EBGP between codfw row A-D spines and row E/F spines [homer/public] - 10https://gerrit.wikimedia.org/r/1146662 (https://phabricator.wikimedia.org/T394021) [17:09:01] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:10:29] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:10:41] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:11:15] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:11:28] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:12:01] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:12:08] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:12:40] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:15:15] developer portal looks good. There were some helm changes from T391333 that rode along with the container update I was intending to push. [17:15:16] T391333: Revisit default envoy histogram buckets - https://phabricator.wikimedia.org/T391333 [17:15:43] * bd808 is done with deploying during this window. [17:20:32] (03CR) 10Cathal Mooney: [C:03+2] Add EBGP between codfw row A-D spines and row E/F spines [homer/public] - 10https://gerrit.wikimedia.org/r/1146662 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [17:21:04] (03Merged) 10jenkins-bot: Add EBGP between codfw row A-D spines and row E/F spines [homer/public] - 10https://gerrit.wikimedia.org/r/1146662 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [17:22:16] (03CR) 10Dwisehaupt: [C:03+1] "This changes looks ok from a notification standpoint. There is a concern about sending alerts to us that are not directly actionable by us" [puppet] - 10https://gerrit.wikimedia.org/r/1145169 (https://phabricator.wikimedia.org/T388641) (owner: 10Filippo Giunchedi) [17:23:16] !log add remaining bgp peerings from codfw row A-D switches to new spines in rows E/F T394021 [17:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:22] T394021: Stage and configure new Juniper switches in codfw rows E/F - https://phabricator.wikimedia.org/T394021 [17:24:25] !log herron@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-codfw [17:32:50] FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:41:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [17:43:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [17:46:35] !log herron@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-codfw [17:46:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [17:50:15] FIRING: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:55:15] RESOLVED: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:58:37] (03PS1) 10Andrew Bogott: Octavia: new attempt at health check ports & ips [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) [17:59:16] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [17:59:40] (03CR) 10CI reject: [V:04-1] Octavia: new attempt at health check ports & ips [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [18:00:05] jnuche and jeena: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1800). [18:00:47] (03PS2) 10Andrew Bogott: Octavia: new attempt at health check ports & ips [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) [18:01:46] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [18:03:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:07:19] (03PS2) 10Brouberol: airflow: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146674 (https://phabricator.wikimedia.org/T393999) [18:07:20] (03PS1) 10Brouberol: airflow: include an ENVOY_SERVICE_NAME env var pointing to the envoy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146693 (https://phabricator.wikimedia.org/T393999) [18:10:12] (03PS3) 10Andrew Bogott: Octavia: new attempt at health check ports & ips [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) [18:12:35] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [18:14:34] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1146027 (https://phabricator.wikimedia.org/T394308) (owner: 10BCornwall) [18:14:43] (03PS4) 10Andrew Bogott: Octavia: new attempt at health check ports & ips [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) [18:14:48] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [18:19:40] (03CR) 10BCornwall: [C:03+2] admin: Add jtweed to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1146027 (https://phabricator.wikimedia.org/T394308) (owner: 10BCornwall) [18:20:54] (03PS5) 10Andrew Bogott: Octavia: new attempt at health check ports & ips [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) [18:21:04] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [18:21:07] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users for Jonathan Tweed - https://phabricator.wikimedia.org/T394308#10827849 (10BCornwall) 05In progress→03Resolved This access has been granted. It'll be up to an hour before it will be... [18:23:15] (03PS6) 10Andrew Bogott: Octavia: new attempt at health check ports & ips [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) [18:23:22] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [18:25:22] (03PS1) 10TChin: [eventgate-analytics-external] bump version v1.13.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146695 (https://phabricator.wikimedia.org/T391959) [18:25:41] (03PS7) 10Andrew Bogott: Octavia: new attempt at health check ports & ips [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) [18:25:50] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [18:27:41] (03CR) 10Dr0ptp4kt: [C:03+2] [eventgate-analytics-external] bump version v1.13.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146695 (https://phabricator.wikimedia.org/T391959) (owner: 10TChin) [18:28:09] (03PS8) 10Andrew Bogott: Octavia: new attempt at health check ports & ips [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) [18:29:02] (03Merged) 10jenkins-bot: [eventgate-analytics-external] bump version v1.13.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146695 (https://phabricator.wikimedia.org/T391959) (owner: 10TChin) [18:32:27] (03CR) 10Andrew Bogott: [C:03+2] Octavia: new attempt at health check ports & ips [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott) [18:33:53] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for Neslihan_Turan_WMDE - https://phabricator.wikimedia.org/T394395#10827898 (10BCornwall) 05Open→03In progress p:05Triage→03Medium a:03WMDECyn [18:34:27] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for Neslihan_Turan_WMDE - https://phabricator.wikimedia.org/T394395#10827905 (10BCornwall) L3/NDA is indeed valid, but the approval needs to happen still. @WMDECyn, Can you please comment here with your approva... [18:34:51] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [18:35:14] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [18:36:13] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [18:36:14] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10827909 (10BCornwall) [18:36:55] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [18:40:54] (03PS2) 10Bvibber: Enable Chart extension on phase 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146679 (https://phabricator.wikimedia.org/T393518) [18:40:59] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [18:41:43] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [18:49:11] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_esams - > [18:53:35] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_esams - > [18:55:43] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10827952 (10Jclark-ctr) @MatthewVernon The BOSS card did not appear in the boot order initially. Under NVMe settings, I changed the BIOS NVMe Driver setting to "All Drives" inst... [18:55:52] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_eqsin - > [18:55:53] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_eqsin - > [18:58:15] FIRING: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:00:35] (03PS1) 10Eevans: cassandra: create storage directory for local keyspaces [puppet] - 10https://gerrit.wikimedia.org/r/1146705 (https://phabricator.wikimedia.org/T391544) [19:01:30] (03PS2) 10Eevans: cassandra: create storage directory for local keyspaces [puppet] - 10https://gerrit.wikimedia.org/r/1146705 (https://phabricator.wikimedia.org/T391544) [19:03:07] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146705 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [19:03:15] RESOLVED: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:06:28] !log dancy@deploy1003 Installing scap version "4.168.1" for 2 host(s) [19:06:30] (03CR) 10Eevans: [C:03+2] cassandra: create storage directory for local keyspaces [puppet] - 10https://gerrit.wikimedia.org/r/1146705 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [19:08:15] !log dancy@deploy1003 Installation of scap version "4.168.1" completed for 2 hosts [19:11:56] (03PS1) 10Andrew Bogott: Octavia health checks: open firewall to UDP [puppet] - 10https://gerrit.wikimedia.org/r/1146708 (https://phabricator.wikimedia.org/T394099) [19:12:24] (03PS2) 10Andrew Bogott: Octavia health checks: open firewall to UDP [puppet] - 10https://gerrit.wikimedia.org/r/1146708 (https://phabricator.wikimedia.org/T394099) [19:13:09] (03PS3) 10Andrew Bogott: Octavia health checks: open firewall to UDP [puppet] - 10https://gerrit.wikimedia.org/r/1146708 (https://phabricator.wikimedia.org/T394099) [19:13:19] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146708 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott) [19:13:26] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10827984 (10BCornwall) 05Open→03Resolved I'm not seeing any errors in the kernel log, anomalies in the graphs, or outputs in `getsel`. I'll go ahead and resolve this. Thanks... [19:16:36] (03CR) 10Andrew Bogott: [C:03+2] Octavia health checks: open firewall to UDP [puppet] - 10https://gerrit.wikimedia.org/r/1146708 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott) [19:18:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10827991 (10BCornwall) [19:24:00] (03PS3) 10LD: frwiki: Enable the NewUserMessage extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146707 (https://phabricator.wikimedia.org/T382199) [19:25:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [19:28:31] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10828031 (10Eevans) cassandra-dev2001 has been reimaged and configured for JBOD. I used the following script to setup the addit... [19:29:16] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:30:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [19:31:39] (03CR) 10Pppery: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146707 (https://phabricator.wikimedia.org/T382199) (owner: 10LD) [19:32:12] (03CR) 10Pppery: frwiki: Enable the NewUserMessage extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146707 (https://phabricator.wikimedia.org/T382199) (owner: 10LD) [19:34:15] RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:38:20] (03PS4) 10LD: frwiki: Enable the NewUserMessage extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146707 (https://phabricator.wikimedia.org/T382199) [19:46:16] FIRING: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:50:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146707 (https://phabricator.wikimedia.org/T382199) (owner: 10LD) [19:51:15] RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:52:42] Jenkins might need to recheck 1146707. [19:54:03] !log eevans@cumin1002 START - Cookbook sre.hosts.reimage for host cassandra-dev2002.codfw.wmnet with OS bullseye [19:54:14] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10828103 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1002 for host... [19:54:15] 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 3 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10828104 (10ArthurPSmith) Confirming this works for me now - https://www.wikidata.... [19:54:39] (03CR) 10AntiCompositeNumber: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146707 (https://phabricator.wikimedia.org/T382199) (owner: 10LD) [19:55:32] thanks AntiComposite ;) [19:55:35] np [19:59:15] FIRING: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window / Backport Party!Members of Release Engineering will be in #wikimedia-operations connect to share the joy of SpiderPig . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T2000). [20:00:05] danisztls, bvibber, and LD: A patch you scheduled for UTC late backport window / Backport Party!Members of Release Engineering will be in #wikimedia-operations connect to share the joy of SpiderPig is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:10] o/ [20:00:23] ohai [20:00:38] hi \O/ [20:00:56] o/ [20:00:59] we are deployment partying :) [20:01:57] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:02:24] looks like we're missing a danisztls bvibber you up for spiderpigging your change? [20:02:33] sure :D [20:02:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146679 (https://phabricator.wikimedia.org/T393518) (owner: 10Bvibber) [20:02:58] * thcipriani watches :) [20:03:17] i love how there's a link right from the deployment schedule to spiderpig :D [20:03:23] (03CR) 10Dreamy Jazz: "(I'm guessing this needs updating given that it depends on an abandoned patch)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081370 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [20:03:46] (03Merged) 10jenkins-bot: Enable Chart extension on phase 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146679 (https://phabricator.wikimedia.org/T393518) (owner: 10Bvibber) [20:04:02] !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1146679|Enable Chart extension on phase 2 wikis (T393518)]] [20:04:06] T393518: Enable Charts for Phase 2 wikis - https://phabricator.wikimedia.org/T393518 [20:04:15] RESOLVED: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:05:15] FIRING: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:05:45] bvibber: blame bd808 and dancy for the link, soon: deploying a bunch together! [20:07:56] btw bvibber I've heard that fr wiktionary was interested in having Chart extension. Do you think it could be ok? if so I'll open a ticket later ;) [20:08:25] sure open a ticket and you can jump the line :) [20:08:37] non-wikipedias will be phase 4 rollout [20:08:45] or if somene asks [20:09:48] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2002.codfw.wmnet with reason: host reimage [20:09:56] !log bvibber@deploy1003 bvibber: Backport for [[gerrit:1146679|Enable Chart extension on phase 2 wikis (T393518)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:09:58] testing... [20:09:59] T393518: Enable Charts for Phase 2 wikis - https://phabricator.wikimedia.org/T393518 [20:10:15] RESOLVED: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:10:16] phase 4 is TBD :') [20:10:33] looks good [20:10:39] !log bvibber@deploy1003 bvibber: Continuing with sync [20:13:11] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2002.codfw.wmnet with reason: host reimage [20:17:18] !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1146679|Enable Chart extension on phase 2 wikis (T393518)]] (duration: 13m 15s) [20:17:21] T393518: Enable Charts for Phase 2 wikis - https://phabricator.wikimedia.org/T393518 [20:17:28] finished! [20:17:45] shall i do the other two or someone else want to take those? [20:17:58] awesome thanks bvibber I can take others [20:18:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:18:24] ok :D [20:18:38] danisztls: I think I saw you enter chat, ready for your patch? [20:19:02] thcipriani: yes [20:19:12] cool, I'll get that going. [20:20:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by thcipriani@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146599 (https://phabricator.wikimedia.org/T394315) (owner: 10DDesouza) [20:21:18] (03Merged) 10jenkins-bot: Design Research participant recruitment survey on eswiki: Deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146599 (https://phabricator.wikimedia.org/T394315) (owner: 10DDesouza) [20:21:32] !log thcipriani@deploy1003 Started scap sync-world: Backport for [[gerrit:1146599|Design Research participant recruitment survey on eswiki: Deploy (T394315)]] [20:21:36] T394315: ES.wiki QuickSurvey request for DR participant recruitment - https://phabricator.wikimedia.org/T394315 [20:22:40] (03PS1) 10Eevans: cassandra-dev2002: configure for JBOD [puppet] - 10https://gerrit.wikimedia.org/r/1146723 (https://phabricator.wikimedia.org/T391544) [20:23:16] RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:23:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [20:24:09] (03CR) 10Eevans: [C:03+2] cassandra-dev2002: configure for JBOD [puppet] - 10https://gerrit.wikimedia.org/r/1146723 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [20:25:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:26:36] thcipriani: as my patch only increases the survey coverage I don't see a practicable way to test it [20:26:59] danisztls: ack, I'll send it on once it prompts me [20:27:04] !log thcipriani@deploy1003 thcipriani, dani: Backport for [[gerrit:1146599|Design Research participant recruitment survey on eswiki: Deploy (T394315)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:27:07] T394315: ES.wiki QuickSurvey request for DR participant recruitment - https://phabricator.wikimedia.org/T394315 [20:27:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [20:28:37] !log thcipriani@deploy1003 thcipriani, dani: Continuing with sync [20:29:01] ^ danisztls going live everywhere now [20:30:15] RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:31:57] FIRING: HelmReleaseBadStatus: Helm release miscweb/design-landing-page on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [20:32:08] (03PS1) 10Greg Grossmeier: admin: update gjg's production ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1146725 [20:32:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:33:37] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2002.codfw.wmnet with OS bullseye [20:33:44] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10828175 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1002 for host cass... [20:35:18] !log thcipriani@deploy1003 Finished scap sync-world: Backport for [[gerrit:1146599|Design Research participant recruitment survey on eswiki: Deploy (T394315)]] (duration: 13m 46s) [20:35:23] T394315: ES.wiki QuickSurvey request for DR participant recruitment - https://phabricator.wikimedia.org/T394315 [20:35:59] LD: you're up! [20:36:07] lets go :p [20:36:47] as the previous patch, it can't really be tested, thats config stuff [20:37:15] RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:37:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by thcipriani@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146707 (https://phabricator.wikimedia.org/T382199) (owner: 10LD) [20:38:39] (03Merged) 10jenkins-bot: frwiki: Enable the NewUserMessage extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146707 (https://phabricator.wikimedia.org/T382199) (owner: 10LD) [20:38:51] !log thcipriani@deploy1003 Started scap sync-world: Backport for [[gerrit:1146707|frwiki: Enable the NewUserMessage extension (T382199)]] [20:38:55] T382199: Enable Extension NewUserMessage on fr.wikipedia - https://phabricator.wikimedia.org/T382199 [20:39:16] FIRING: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:39:16] thanks for the party! [20:40:17] LD: there's no party like a deployment party :) [20:41:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [20:42:09] I thought a s club 7 party, was the superior party? [20:44:05] (03PS1) 10Jdrewniak: styles: Set override also to former value of `line-height-small` token [skins/Vector] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146726 (https://phabricator.wikimedia.org/T389900) [20:44:16] RESOLVED: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:44:45] !log thcipriani@deploy1003 thcipriani, wpld: Backport for [[gerrit:1146707|frwiki: Enable the NewUserMessage extension (T382199)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:44:49] T382199: Enable Extension NewUserMessage on fr.wikipedia - https://phabricator.wikimedia.org/T382199 [20:45:44] p858snake|cloud: lies [20:45:46] :) [20:46:21] LD: your change is up on test wikis, I can confirm using WikimediaDebug that I now see the extension in Special:Version now [20:46:27] anything else to test? [20:46:38] not really :') [20:46:53] okie doke, going live for realz [20:47:02] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:47:02] !log thcipriani@deploy1003 thcipriani, wpld: Continuing with sync [20:49:16] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:52:08] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:52:17] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Dell SSD Critical Firmware Update - https://phabricator.wikimedia.org/T394348#10828215 (10RobH) [20:53:36] !log thcipriani@deploy1003 Finished scap sync-world: Backport for [[gerrit:1146707|frwiki: Enable the NewUserMessage extension (T382199)]] (duration: 14m 44s) [20:53:40] T382199: Enable Extension NewUserMessage on fr.wikipedia - https://phabricator.wikimedia.org/T382199 [20:53:47] ^ LD all done! [20:54:06] thanks again for the party :) [20:54:15] RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:54:16] thanks for attending! [20:54:33] [20:55:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:58:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1003 using scap backport" [skins/Vector] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146726 (https://phabricator.wikimedia.org/T389900) (owner: 10Jdrewniak) [20:58:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [21:00:03] (03Merged) 10jenkins-bot: styles: Set override also to former value of `line-height-small` token [skins/Vector] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146726 (https://phabricator.wikimedia.org/T389900) (owner: 10Jdrewniak) [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T2100) [21:00:15] RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:00:32] !log jdrewniak@deploy1003 Started scap sync-world: Backport for [[gerrit:1146726|styles: Set override also to former value of `line-height-small` token (T389900 T394305)]] [21:00:36] T389900: Font modes: Resolve line-height token discrepancies downstream - https://phabricator.wikimedia.org/T389900 [21:00:36] T394305: 1.45.0-wmf.1: When setting font size to "small", line-height is absolute, making lines with larger font-size cramped - https://phabricator.wikimedia.org/T394305 [21:03:15] FIRING: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:04:04] (03PS1) 10Cwhite: logstash: nest curator configuration to support multiple jobs [puppet] - 10https://gerrit.wikimedia.org/r/1146728 (https://phabricator.wikimedia.org/T377018) [21:04:19] (03CR) 10BryanDavis: [C:03+1] Do not show thumbnails or descriptions on Wikitech search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146491 (owner: 10Majavah) [21:06:09] !log jdrewniak@deploy1003 jdrewniak: Backport for [[gerrit:1146726|styles: Set override also to former value of `line-height-small` token (T389900 T394305)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:06:13] T389900: Font modes: Resolve line-height token discrepancies downstream - https://phabricator.wikimedia.org/T389900 [21:06:13] T394305: 1.45.0-wmf.1: When setting font size to "small", line-height is absolute, making lines with larger font-size cramped - https://phabricator.wikimedia.org/T394305 [21:06:46] (03PS2) 10Cwhite: logstash: nest curator configuration to support multiple jobs [puppet] - 10https://gerrit.wikimedia.org/r/1146728 (https://phabricator.wikimedia.org/T377018) [21:07:55] (03PS1) 10Clare Ming: Experimentation Lab: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146729 [21:08:15] RESOLVED: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:09:15] FIRING: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:09:44] (03CR) 10Dr0ptp4kt: [C:03+2] Experimentation Lab: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146729 (owner: 10Clare Ming) [21:10:47] (03PS1) 10Eevans: cassandra-dev2002: use custom d-i preseed (JBOD) [puppet] - 10https://gerrit.wikimedia.org/r/1146730 (https://phabricator.wikimedia.org/T391544) [21:11:05] (03Merged) 10jenkins-bot: Experimentation Lab: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146729 (owner: 10Clare Ming) [21:12:40] !log jdrewniak@deploy1003 jdrewniak: Continuing with sync [21:12:59] (03CR) 10Cwhite: [C:03+2] "PCC OK: no changes to host https://puppet-compiler.wmflabs.org/output/1146728/5571/" [puppet] - 10https://gerrit.wikimedia.org/r/1146728 (https://phabricator.wikimedia.org/T377018) (owner: 10Cwhite) [21:13:55] (03CR) 10Eevans: [C:03+2] cassandra-dev2002: use custom d-i preseed (JBOD) [puppet] - 10https://gerrit.wikimedia.org/r/1146730 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [21:14:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:16:07] (03PS1) 10Clare Ming: Experimentation Lab: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146731 [21:16:14] !log eevans@cumin1002 START - Cookbook sre.hosts.reimage for host cassandra-dev2002.codfw.wmnet with OS bullseye [21:16:27] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10828298 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1002 for host cassandra-dev2002.... [21:18:09] (03CR) 10LD: frwiki: Enable the NewUserMessage extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146707 (https://phabricator.wikimedia.org/T382199) (owner: 10LD) [21:18:36] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [21:19:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:19:18] !log jdrewniak@deploy1003 Finished scap sync-world: Backport for [[gerrit:1146726|styles: Set override also to former value of `line-height-small` token (T389900 T394305)]] (duration: 18m 45s) [21:19:22] T389900: Font modes: Resolve line-height token discrepancies downstream - https://phabricator.wikimedia.org/T389900 [21:19:22] T394305: 1.45.0-wmf.1: When setting font size to "small", line-height is absolute, making lines with larger font-size cramped - https://phabricator.wikimedia.org/T394305 [21:20:47] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [21:21:36] FIRING: GatewayBackendErrorsHigh: rest-gateway: high 5xx errors from mobileapps_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [21:23:13] o/ [21:23:25] !incidents [21:23:26] 6128 (UNACKED) GatewayBackendErrorsHigh sre (mobileapps_cluster rest-gateway eqiad) [21:23:26] 6124 (RESOLVED) Host db1187 (paged) [21:23:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [21:24:14] I need to leave for an appointment, but this might be a natural consequence of the shift from restbase -> PCS to rest-gateway -> PCS [21:24:47] i.e. the pre-existing 5xxs moved from restbase being the client to rest-gateway, and thus are subject to this alert [21:25:07] I was chatting with h.nowlan earlier today about this [21:25:10] hmm interesting [21:25:24] so are alerting thresholds may need adjustment? [21:25:31] *our [21:25:47] yes, and Hugh was already considering doing that for the non-paging variant of the alert that's been firing intermittently [21:26:03] this is the first time it's got above the paging threshold, though [21:26:36] RESOLVED: GatewayBackendErrorsHigh: rest-gateway: high 5xx errors from mobileapps_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [21:26:59] okay, I suppose I'll just leave things as is for now then [21:27:22] since it's self-resolving, then yeah - that sounds good [21:27:41] if we see more of these transient blips, we might want to silence until Hugh can take a look Friday [21:27:56] ref: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1146668 is the patch for the non-paging alert [21:28:05] * swfrench-wmf out [21:31:41] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2002.codfw.wmnet with reason: host reimage [21:32:39] !log dancy@deploy1003 Installing scap version "4.169.0" for 2 host(s) [21:32:50] FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:34:26] !log dancy@deploy1003 Installation of scap version "4.169.0" completed for 2 hosts [21:35:12] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2002.codfw.wmnet with reason: host reimage [21:40:00] !log brett@cumin2002 END (FAIL) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=1) rolling upgrade of Varnish on A:cp-upload_eqsin - > [21:44:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:47:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [21:49:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:50:36] FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [21:51:04] (03PS1) 10Cwhite: logstash: add forcemerge job [puppet] - 10https://gerrit.wikimedia.org/r/1146736 (https://phabricator.wikimedia.org/T377018) [21:51:05] (03PS1) 10Cwhite: logstash: add job schedule parameter [puppet] - 10https://gerrit.wikimedia.org/r/1146737 (https://phabricator.wikimedia.org/T377018) [21:55:33] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp503[1-2].eqsin.wmnet} and A:cp - > [21:57:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [21:59:16] RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:00:42] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2002.codfw.wmnet with OS bullseye [22:00:49] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10828360 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1002 for host cass... [22:02:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:03:28] (03PS1) 10BCornwall: cdn: Fix args reference [cookbooks] - 10https://gerrit.wikimedia.org/r/1146739 [22:05:36] RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [22:07:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:09:39] FIRING: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [22:10:20] (03CR) 10CI reject: [V:04-1] cdn: Fix args reference [cookbooks] - 10https://gerrit.wikimedia.org/r/1146739 (owner: 10BCornwall) [22:11:07] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_eqsin - > [22:12:15] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:14:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [22:20:34] (03CR) 10Dr0ptp4kt: [C:03+2] Experimentation Lab: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146731 (owner: 10Clare Ming) [22:21:53] (03Merged) 10jenkins-bot: Experimentation Lab: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146731 (owner: 10Clare Ming) [22:23:39] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [22:27:00] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [22:27:52] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp503[1-2].eqsin.wmnet} and A:cp - > [22:38:42] FIRING: JobUnavailable: Reduced availability for job jmx_idp in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:43:43] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10828422 (10thcipriani) >>! In T393723#10813970, @Jdlrobson-WMF wrote: >> @Jdlrobson-WMF this seems like an odd question after all this time, but have you signed L3 Acknowledgement of Wikimed... [23:03:40] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc2018 - https://phabricator.wikimedia.org/T393110#10828450 (10Jhancock.wm) a:03Jhancock.wm [23:04:43] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10828452 (10Jhancock.wm) a:03Jhancock.wm [23:31:39] (03PS1) 10Andrew Bogott: Octavia health manager: listen on [puppet] - 10https://gerrit.wikimedia.org/r/1146770 (https://phabricator.wikimedia.org/T394099) [23:31:44] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146770 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott) [23:33:54] (03PS2) 10Andrew Bogott: Octavia health manager: listen on 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/1146770 (https://phabricator.wikimedia.org/T394099) [23:35:08] (03CR) 10Andrew Bogott: [C:03+2] Octavia health manager: listen on 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/1146770 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott) [23:38:56] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1146777 [23:38:57] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1146777 (owner: 10TrainBranchBot) [23:49:53] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1146777 (owner: 10TrainBranchBot) [23:59:07] !log eevans@cumin1002 START - Cookbook sre.hosts.reimage for host cassandra-dev2002.codfw.wmnet with OS bullseye [23:59:16] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10828513 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1002 for host...