[00:01:57] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:03:24] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:05:27] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:06:41] <jinxer-wm>	 FIRING: [13x] ConfdResourceFailed: confd resource _var_netmapper_public_clouds.json.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[00:08:42] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1146107
[00:08:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1146107 (owner: 10TrainBranchBot)
[00:10:43] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 632.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:11:41] <jinxer-wm>	 FIRING: [111x] ConfdResourceFailed: confd resource _var_netmapper_public_clouds.json.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[00:23:15] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10824694 (10Jhancock.wm)
[00:24:41] <wikibugs>	 (03CR) 10Cwhite: [C:03+1] "I'll roll this out tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/1144555 (https://phabricator.wikimedia.org/T228380) (owner: 10Filippo Giunchedi)
[00:29:43] <icinga-wm>	 RECOVERY - Disk space on arclamp1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp1001&var-datasource=eqiad+prometheus/ops
[00:31:25] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1146107 (owner: 10TrainBranchBot)
[00:41:40] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[00:46:57] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:52:08] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[00:53:15] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:53:15] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:54:15] <icinga-wm>	 PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/2bb59e518169dc32b3a7791729a47586865fb87b42b3ddd914701d94b9555aef/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[01:07:33] <icinga-wm>	 RECOVERY - Disk space on arclamp2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp2001&var-datasource=codfw+prometheus/ops
[01:08:08] <cwhite>	 !log clear up some space on arclamp2001 to allow arclamp_compress_logs to complete
[01:08:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:14:15] <icinga-wm>	 RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[01:32:38] <wikibugs>	 (03PS1) 10Andrew Bogott: Octavia: change hiera port to 9876 [puppet] - 10https://gerrit.wikimedia.org/r/1146117 (https://phabricator.wikimedia.org/T393783)
[01:32:39] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudlb: add octavia endpoint in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1146118 (https://phabricator.wikimedia.org/T393783)
[01:32:45] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146118 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[01:32:50] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[01:36:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Octavia: change hiera port to 9876 [puppet] - 10https://gerrit.wikimedia.org/r/1146117 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[01:36:26] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudlb: add octavia endpoint in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1146118 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[01:36:36] <jinxer-wm>	 FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[01:41:36] <jinxer-wm>	 RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[01:50:36] <jinxer-wm>	 FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[01:55:36] <jinxer-wm>	 RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[02:21:53] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 193878288 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:22:53] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 48192 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:46:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[02:51:45] <jinxer-wm>	 FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[02:56:45] <jinxer-wm>	 FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[03:00:43] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 1.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[03:01:45] <jinxer-wm>	 FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[03:07:36] <jinxer-wm>	 FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[03:12:36] <jinxer-wm>	 RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[03:16:45] <jinxer-wm>	 FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[03:21:45] <jinxer-wm>	 RESOLVED: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[03:49:36] <jinxer-wm>	 FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[03:54:36] <jinxer-wm>	 RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[04:01:57] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:11:41] <jinxer-wm>	 FIRING: [111x] ConfdResourceFailed: confd resource _var_netmapper_public_clouds.json.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[04:41:40] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[04:46:57] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:52:08] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[04:53:45] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Remove db1256 from s8 (T351820)', diff saved to https://phabricator.wikimedia.org/P76157 and previous config saved to /var/cache/conftool/dbconfig/20250515-045345-ladsgroup.json
[04:53:49] <stashbot>	 T351820: Move Wikidata term store to separate database cluster - https://phabricator.wikimedia.org/T351820
[04:56:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Remove db1192 from x3 (T351820)', diff saved to https://phabricator.wikimedia.org/P76158 and previous config saved to /var/cache/conftool/dbconfig/20250515-045631-ladsgroup.json
[04:56:52] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet,es1043.eqiad.wmnet with reason: Maintenance
[04:56:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1043 es2041 T391921', diff saved to https://phabricator.wikimedia.org/P76159 and previous config saved to /var/cache/conftool/dbconfig/20250515-045658-marostegui.json
[04:57:01] <stashbot>	 T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921
[04:57:45] <wikibugs>	 (03PS1) 10Marostegui: es1043: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146171 (https://phabricator.wikimedia.org/T391921)
[04:59:50] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es1043: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146171 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui)
[05:01:52] <wikibugs>	 (03PS1) 10Marostegui: es2041: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146174 (https://phabricator.wikimedia.org/T391921)
[05:03:00] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es2041: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146174 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui)
[05:06:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76160 and previous config saved to /var/cache/conftool/dbconfig/20250515-050607-root.json
[05:06:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76161 and previous config saved to /var/cache/conftool/dbconfig/20250515-050620-root.json
[05:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:07:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc7 T394260', diff saved to https://phabricator.wikimedia.org/P76162 and previous config saved to /var/cache/conftool/dbconfig/20250515-050724-marostegui.json
[05:07:27] <stashbot>	 T394260: Productionize pc8 - https://phabricator.wikimedia.org/T394260
[05:08:07] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on pc1017.eqiad.wmnet with reason: Maintenance
[05:08:25] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on pc2017.codfw.wmnet with reason: Maintenance
[05:10:38] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance
[05:12:24] <wikibugs>	 (03PS1) 10Marostegui: dbconfig.schema: Add pc8 [puppet] - 10https://gerrit.wikimedia.org/r/1146175 (https://phabricator.wikimedia.org/T394260)
[05:15:31] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] dbconfig.schema: Add pc8 [puppet] - 10https://gerrit.wikimedia.org/r/1146175 (https://phabricator.wikimedia.org/T394260) (owner: 10Marostegui)
[05:20:57] <wikibugs>	 (03PS1) 10Marostegui: valid_section.pp: Add pc8 [puppet] - 10https://gerrit.wikimedia.org/r/1146176 (https://phabricator.wikimedia.org/T394260)
[05:21:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76163 and previous config saved to /var/cache/conftool/dbconfig/20250515-052113-root.json
[05:21:24] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] valid_section.pp: Add pc8 [puppet] - 10https://gerrit.wikimedia.org/r/1146176 (https://phabricator.wikimedia.org/T394260) (owner: 10Marostegui)
[05:21:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76164 and previous config saved to /var/cache/conftool/dbconfig/20250515-052126-root.json
[05:25:22] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] valid_section.pp: Add pc8 [puppet] - 10https://gerrit.wikimedia.org/r/1146176 (https://phabricator.wikimedia.org/T394260) (owner: 10Marostegui)
[05:28:36] <wikibugs>	 (03PS1) 10Marostegui: pc1018: Productionize [puppet] - 10https://gerrit.wikimedia.org/r/1146184 (https://phabricator.wikimedia.org/T394260)
[05:29:07] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/1146184 (https://phabricator.wikimedia.org/T394260) (owner: 10Marostegui)
[05:29:23] <wikibugs>	 (03PS2) 10Marostegui: pc1018: Productionize [puppet] - 10https://gerrit.wikimedia.org/r/1146184 (https://phabricator.wikimedia.org/T394260)
[05:30:40] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] pc1018: Productionize [puppet] - 10https://gerrit.wikimedia.org/r/1146184 (https://phabricator.wikimedia.org/T394260) (owner: 10Marostegui)
[05:32:50] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[05:36:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P76165 and previous config saved to /var/cache/conftool/dbconfig/20250515-053618-root.json
[05:36:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P76166 and previous config saved to /var/cache/conftool/dbconfig/20250515-053631-root.json
[05:39:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1042 and es2042 to es4 masters T391921', diff saved to https://phabricator.wikimedia.org/P76167 and previous config saved to /var/cache/conftool/dbconfig/20250515-053958-marostegui.json
[05:40:02] <stashbot>	 T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921
[05:41:05] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Update es4-master [dns] - 10https://gerrit.wikimedia.org/r/1146190 (https://phabricator.wikimedia.org/T391921)
[05:41:21] <wikibugs>	 (03CR) 10Marostegui: "This is a noop" [dns] - 10https://gerrit.wikimedia.org/r/1146190 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui)
[05:41:44] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Update es4-master [dns] - 10https://gerrit.wikimedia.org/r/1146190 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui)
[05:41:47] <logmsgbot>	 !log marostegui@dns1006 START - running authdns-update
[05:43:02] <logmsgbot>	 !log marostegui@dns1006 END - running authdns-update
[05:50:20] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10824887 (10Ladsgroup)
[05:51:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P76168 and previous config saved to /var/cache/conftool/dbconfig/20250515-055124-root.json
[05:51:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P76169 and previous config saved to /var/cache/conftool/dbconfig/20250515-055137-root.json
[05:53:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1138905 (https://phabricator.wikimedia.org/T392629) (owner: 10JHathaway)
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and federico3: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T0600)
[06:01:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:06:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76170 and previous config saved to /var/cache/conftool/dbconfig/20250515-060629-root.json
[06:06:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76171 and previous config saved to /var/cache/conftool/dbconfig/20250515-060643-root.json
[06:16:56] <icinga-wm>	 PROBLEM - Exim SMTP on lists1004 is CRITICAL: connect to address 208.80.154.81 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Exim
[06:19:02] <icinga-wm>	 RECOVERY - Exim SMTP on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 07 Aug 2025 09:25:51 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Exim
[06:21:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76172 and previous config saved to /var/cache/conftool/dbconfig/20250515-062135-root.json
[06:21:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76173 and previous config saved to /var/cache/conftool/dbconfig/20250515-062149-root.json
[06:23:56] <icinga-wm>	 PROBLEM - Exim SMTP on lists1004 is CRITICAL: connect to address 208.80.154.81 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Exim
[06:24:24] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:25:02] <icinga-wm>	 RECOVERY - Exim SMTP on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 07 Aug 2025 09:25:51 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Exim
[06:34:02] <kart_>	 Deploying cxserver..
[06:34:42] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-05-14-005542-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145456 (https://phabricator.wikimedia.org/T394008) (owner: 10KartikMistry)
[06:36:14] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2025-05-14-005542-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145456 (https://phabricator.wikimedia.org/T394008) (owner: 10KartikMistry)
[06:36:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76174 and previous config saved to /var/cache/conftool/dbconfig/20250515-063641-root.json
[06:36:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76175 and previous config saved to /var/cache/conftool/dbconfig/20250515-063655-root.json
[06:38:14] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply
[06:38:36] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[06:40:30] <wikibugs>	 (03CR) 10JMeybohm: "Deploying and testing should be possible without service catalog entry. So usually the entry is created the way the service is supposed to" [puppet] - 10https://gerrit.wikimedia.org/r/1145241 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[06:43:21] <logmsgbot>	 !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply
[06:43:53] <logmsgbot>	 !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[06:46:06] <logmsgbot>	 !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[06:46:38] <logmsgbot>	 !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[06:47:10] <wikibugs>	 (03PS1) 10Muehlenhoff: Bitu: When approving a permission request mention the need for re-login [software/bitu] - 10https://gerrit.wikimedia.org/r/1146446 (https://phabricator.wikimedia.org/T393724)
[06:47:15] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Reapply "Update admin_ng fixtures to reflect puppet changes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145857 (owner: 10JMeybohm)
[06:49:57] <kart_>	 !log Updated cxserver to 2025-05-14-005542-production (T394008, T392499)
[06:50:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:50:01] <stashbot>	 T394008: CXServer doesn't support section suggestions for "be-tarask" language code - https://phabricator.wikimedia.org/T394008
[06:50:01] <stashbot>	 T392499: Post-creation work for rkiwiki - https://phabricator.wikimedia.org/T392499
[06:50:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1045 es2045 T391921', diff saved to https://phabricator.wikimedia.org/P76176 and previous config saved to /var/cache/conftool/dbconfig/20250515-065039-marostegui.json
[06:50:43] <stashbot>	 T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921
[06:51:07] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2045.codfw.wmnet,es1045.eqiad.wmnet with reason: Maintenance
[06:51:22] <wikibugs>	 (03PS1) 10Marostegui: es1045: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146447 (https://phabricator.wikimedia.org/T391921)
[06:51:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76177 and previous config saved to /var/cache/conftool/dbconfig/20250515-065147-root.json
[06:52:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76178 and previous config saved to /var/cache/conftool/dbconfig/20250515-065200-root.json
[06:52:51] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es1045: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146447 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui)
[06:55:06] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: enable vk monitoring in magru to actually remove it [puppet] - 10https://gerrit.wikimedia.org/r/1146021 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur)
[06:55:09] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] admin: SSH key rotation for cmassaro [puppet] - 10https://gerrit.wikimedia.org/r/1146033 (https://phabricator.wikimedia.org/T393140) (owner: 10BCornwall)
[06:56:12] <wikibugs>	 (03PS1) 10Marostegui: es2045: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146448 (https://phabricator.wikimedia.org/T391921)
[06:56:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76179 and previous config saved to /var/cache/conftool/dbconfig/20250515-065613-root.json
[06:57:26] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es2045: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146448 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui)
[06:57:37] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "Looks good." [software/bitu] - 10https://gerrit.wikimedia.org/r/1146446 (https://phabricator.wikimedia.org/T393724) (owner: 10Muehlenhoff)
[06:59:24] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T0700).
[07:00:05] <jouncebot>	 MichaelG_WMF: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:03:36] <jinxer-wm>	 FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[07:04:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76180 and previous config saved to /var/cache/conftool/dbconfig/20250515-070433-root.json
[07:05:30] <wikibugs>	 (03Merged) 10jenkins-bot: Reapply "Update admin_ng fixtures to reflect puppet changes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145857 (owner: 10JMeybohm)
[07:05:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Bitu: When approving a permission request mention the need for re-login [software/bitu] - 10https://gerrit.wikimedia.org/r/1146446 (https://phabricator.wikimedia.org/T393724) (owner: 10Muehlenhoff)
[07:06:30] <godog>	 !log add 70G to arclamp /srv
[07:06:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:06:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76181 and previous config saved to /var/cache/conftool/dbconfig/20250515-070653-root.json
[07:07:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76182 and previous config saved to /var/cache/conftool/dbconfig/20250515-070706-root.json
[07:07:48] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for esanders - https://phabricator.wikimedia.org/T393724#10825034 (10MoritzMuehlenhoff) >>! In T393724#10823734, @thcipriani wrote: >>>! In T393724#10823444, @Esanders wrote: >> |cn |[Esanders] >> |mail |[esanders@wikimedia...
[07:11:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76183 and previous config saved to /var/cache/conftool/dbconfig/20250515-071119-root.json
[07:13:36] <jinxer-wm>	 RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[07:16:17] <MichaelG_WMF>	 hi
[07:16:33] <MichaelG_WMF>	 sorry for being late - network issues...
[07:17:01] <MichaelG_WMF>	 jouncebot: nowandnext
[07:17:01] <jouncebot>	 For the next 0 hour(s) and 42 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T0700)
[07:17:01] <jouncebot>	 In 0 hour(s) and 42 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T0800)
[07:18:32] <moritzm>	 !log installing nginx security updates
[07:18:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:18:35] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] python-webapp: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145226 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[07:19:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76184 and previous config saved to /var/cache/conftool/dbconfig/20250515-071939-root.json
[07:24:35] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Revert "hdfs: Exclude rack F3 hosts from analytics cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1145943 (https://phabricator.wikimedia.org/T390171) (owner: 10Stevemunene)
[07:24:36] <jinxer-wm>	 FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[07:24:53] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] spark-history: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145229 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[07:25:07] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] superset: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145230 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[07:26:25] <wikibugs>	 (03CR) 10Elukey: [C:03+2] python-webapp: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145226 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[07:26:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76185 and previous config saved to /var/cache/conftool/dbconfig/20250515-072625-root.json
[07:26:30] <wikibugs>	 (03PS1) 10JMeybohm: CI test change - do not merge [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146465
[07:26:33] <wikibugs>	 (03CR) 10Elukey: [C:03+2] recommendation-api: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145227 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[07:26:39] <wikibugs>	 (03CR) 10Elukey: [C:03+2] shellbox: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145228 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[07:26:48] <wikibugs>	 (03CR) 10Elukey: [C:03+2] spark-history: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145229 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[07:26:56] <wikibugs>	 (03CR) 10Elukey: [C:03+2] superset: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145230 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[07:27:03] <wikibugs>	 (03CR) 10Elukey: [C:03+2] tegola-vector-tiles: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145231 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[07:27:10] <wikibugs>	 (03CR) 10Elukey: [C:03+2] termbox: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145232 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[07:27:16] <wikibugs>	 (03CR) 10Elukey: [C:03+2] thumbor: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145233 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[07:27:18] <wikibugs>	 (03CR) 10Joal: [C:03+1] "LGTM! Thank you :)" [alerts] - 10https://gerrit.wikimedia.org/r/1136383 (https://phabricator.wikimedia.org/T391810) (owner: 10Fabfur)
[07:27:30] <wikibugs>	 (03PS11) 10Stevemunene: airflow: cleanup deployment charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359)
[07:29:22] <wikibugs>	 (03CR) 10Brouberol: "Almost all good! Just a minor not on `airflow-main/values-production.yaml`" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) (owner: 10Stevemunene)
[07:29:36] <jinxer-wm>	 RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[07:29:46] <wikibugs>	 (03PS1) 10Elukey: toolhub: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146483 (https://phabricator.wikimedia.org/T391333)
[07:29:48] <wikibugs>	 (03PS1) 10Elukey: wikifeeds: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146484 (https://phabricator.wikimedia.org/T391333)
[07:29:49] <wikibugs>	 (03PS1) 10Elukey: zotero: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146485 (https://phabricator.wikimedia.org/T391333)
[07:29:57] <wikibugs>	 (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146465 (owner: 10JMeybohm)
[07:30:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1041 es2043 T391921', diff saved to https://phabricator.wikimedia.org/P76186 and previous config saved to /var/cache/conftool/dbconfig/20250515-073033-marostegui.json
[07:30:37] <stashbot>	 T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921
[07:31:03] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2043.codfw.wmnet,es1041.eqiad.wmnet with reason: Maintenance
[07:31:30] <wikibugs>	 (03PS1) 10Marostegui: es1041: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146490 (https://phabricator.wikimedia.org/T391921)
[07:31:37] <wikibugs>	 (03PS1) 10Majavah: Do not show thumbnails or descriptions on Wikitech search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146491
[07:32:52] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es1041: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146490 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui)
[07:33:24] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply
[07:34:04] <wikibugs>	 (03PS1) 10Elukey: growthbook: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146493 (https://phabricator.wikimedia.org/T391333)
[07:34:16] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply
[07:34:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76187 and previous config saved to /var/cache/conftool/dbconfig/20250515-073445-root.json
[07:35:04] <wikibugs>	 (03PS1) 10Marostegui: es2043: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146494 (https://phabricator.wikimedia.org/T391921)
[07:35:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] growthbook: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146493 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[07:35:51] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply
[07:36:12] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply
[07:37:04] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es2043: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1146494 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui)
[07:37:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76188 and previous config saved to /var/cache/conftool/dbconfig/20250515-073723-root.json
[07:38:58] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply
[07:40:29] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply
[07:41:06] <wikibugs>	 (03PS2) 10Elukey: growthbook: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146493 (https://phabricator.wikimedia.org/T391333)
[07:41:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76189 and previous config saved to /var/cache/conftool/dbconfig/20250515-074131-root.json
[07:41:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76190 and previous config saved to /var/cache/conftool/dbconfig/20250515-074142-root.json
[07:49:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76191 and previous config saved to /var/cache/conftool/dbconfig/20250515-074950-root.json
[07:50:43] <wikibugs>	 (03CR) 10Federico Ceratto: "Thanks for the check. The configuration has been updated with more help from @cgoubert@wikimedia.org and should be ok now:" [puppet] - 10https://gerrit.wikimedia.org/r/1135387 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[07:52:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76192 and previous config saved to /var/cache/conftool/dbconfig/20250515-075228-root.json
[07:53:06] <wikibugs>	 (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146465 (owner: 10JMeybohm)
[07:56:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76193 and previous config saved to /var/cache/conftool/dbconfig/20250515-075636-root.json
[07:56:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76194 and previous config saved to /var/cache/conftool/dbconfig/20250515-075648-root.json
[07:58:09] <wikibugs>	 (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145981 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris)
[07:58:20] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "This LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145981 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris)
[07:59:17] <wikibugs>	 (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146465 (owner: 10JMeybohm)
[08:00:04] <jouncebot>	 jnuche and jeena: gettimeofday() says it's time for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T0800)
[08:00:26] <jnuche>	 hi, I'll be rolling out the train in a few minutes
[08:00:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] toolhub: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146483 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[08:01:57] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:02:13] <wikibugs>	 (03PS1) 10Brouberol: Copy app.generic to make the subsequent diff easier to review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146499 (https://phabricator.wikimedia.org/T391333)
[08:02:14] <wikibugs>	 (03PS1) 10Brouberol: modules/app/generic: allow the definition of app env vars from a configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146500 (https://phabricator.wikimedia.org/T391333)
[08:02:15] <wikibugs>	 (03PS1) 10Brouberol: spark-history: re-introduce environment variable injection from a configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146501 (https://phabricator.wikimedia.org/T391333)
[08:03:06] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146502 (https://phabricator.wikimedia.org/T392171)
[08:03:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146502 (https://phabricator.wikimedia.org/T392171) (owner: 10TrainBranchBot)
[08:03:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] growthbook: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146493 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[08:03:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] zotero: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146485 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[08:03:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] wikifeeds: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146484 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[08:03:59] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146502 (https://phabricator.wikimedia.org/T392171) (owner: 10TrainBranchBot)
[08:04:23] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] growthbook: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146493 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[08:04:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76195 and previous config saved to /var/cache/conftool/dbconfig/20250515-080456-root.json
[08:05:58] <wikibugs>	 06SRE-OnFire, 10SRE-swift-storage, 07Sustainability: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913#10825162 (10Jelto) I'm adding #Sustainability (Incident Followup) and #SRE-OnFire tags here because this task was mentioned during one of the last swi...
[08:06:02] <wikibugs>	 (03CR) 10Brouberol: "This patch cannot be rebased due to conflicts" [puppet] - 10https://gerrit.wikimedia.org/r/1142712 (https://phabricator.wikimedia.org/T390556) (owner: 10Aleksandar Mastilovic)
[08:07:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P76196 and previous config saved to /var/cache/conftool/dbconfig/20250515-080733-root.json
[08:08:31] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Copy app.generic to make the subsequent diff easier to review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146499 (https://phabricator.wikimedia.org/T391333) (owner: 10Brouberol)
[08:08:45] <wikibugs>	 (03CR) 10Elukey: [C:03+1] modules/app/generic: allow the definition of app env vars from a configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146500 (https://phabricator.wikimedia.org/T391333) (owner: 10Brouberol)
[08:09:08] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] data-engineering: duplicating varnishkafka alerts [alerts] - 10https://gerrit.wikimedia.org/r/1136383 (https://phabricator.wikimedia.org/T391810) (owner: 10Fabfur)
[08:09:10] <wikibugs>	 (03CR) 10Elukey: [C:03+1] spark-history: re-introduce environment variable injection from a configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146501 (https://phabricator.wikimedia.org/T391333) (owner: 10Brouberol)
[08:09:20] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Copy app.generic to make the subsequent diff easier to review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146499 (https://phabricator.wikimedia.org/T391333) (owner: 10Brouberol)
[08:09:23] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] modules/app/generic: allow the definition of app env vars from a configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146500 (https://phabricator.wikimedia.org/T391333) (owner: 10Brouberol)
[08:09:26] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] spark-history: re-introduce environment variable injection from a configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146501 (https://phabricator.wikimedia.org/T391333) (owner: 10Brouberol)
[08:09:52] <wikibugs>	 (03CR) 10AOkoth: "Ack. Okay, I'll merge this later then." [puppet] - 10https://gerrit.wikimedia.org/r/1145241 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[08:10:37] <wikibugs>	 (03Merged) 10jenkins-bot: Copy app.generic to make the subsequent diff easier to review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146499 (https://phabricator.wikimedia.org/T391333) (owner: 10Brouberol)
[08:10:46] <wikibugs>	 (03Merged) 10jenkins-bot: modules/app/generic: allow the definition of app env vars from a configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146500 (https://phabricator.wikimedia.org/T391333) (owner: 10Brouberol)
[08:11:10] <wikibugs>	 (03Merged) 10jenkins-bot: spark-history: re-introduce environment variable injection from a configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146501 (https://phabricator.wikimedia.org/T391333) (owner: 10Brouberol)
[08:11:41] <jinxer-wm>	 FIRING: [111x] ConfdResourceFailed: confd resource _var_netmapper_public_clouds.json.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[08:11:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76197 and previous config saved to /var/cache/conftool/dbconfig/20250515-081141-root.json
[08:11:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P76198 and previous config saved to /var/cache/conftool/dbconfig/20250515-081153-root.json
[08:12:03] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply
[08:12:40] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply
[08:13:59] <wikibugs>	 (03CR) 10Elukey: [C:03+2] toolhub: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146483 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[08:14:05] <wikibugs>	 (03CR) 10Elukey: [C:03+2] wikifeeds: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146484 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[08:14:13] <wikibugs>	 (03CR) 10Elukey: [C:03+2] zotero: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146485 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[08:14:21] <wikibugs>	 (03CR) 10Elukey: [C:03+2] growthbook: upgrade to mesh.configuration 1.13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146493 (https://phabricator.wikimedia.org/T391333) (owner: 10Elukey)
[08:14:26] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply
[08:15:03] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply
[08:17:04] <logmsgbot>	 !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.1  refs T392171
[08:17:07] <stashbot>	 T392171: 1.45.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T392171
[08:20:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76200 and previous config saved to /var/cache/conftool/dbconfig/20250515-082002-root.json
[08:20:55] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] cache: lua lookup experiment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1144620 (https://phabricator.wikimedia.org/T393927) (owner: 10Fabfur)
[08:21:36] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 2517 bytes in 0.118 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[08:21:46] <wikibugs>	 (03PS1) 10Brouberol: airflow: upggrade base image to include krenew [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146504 (https://phabricator.wikimedia.org/T394293)
[08:21:56] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146505 (https://phabricator.wikimedia.org/T392171)
[08:21:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146505 (https://phabricator.wikimedia.org/T392171) (owner: 10TrainBranchBot)
[08:22:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76201 and previous config saved to /var/cache/conftool/dbconfig/20250515-082238-root.json
[08:22:51] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146505 (https://phabricator.wikimedia.org/T392171) (owner: 10TrainBranchBot)
[08:22:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] icinga: remove HOSTOUTPUT from vo-host-notify-by-email [puppet] - 10https://gerrit.wikimedia.org/r/1145902 (https://phabricator.wikimedia.org/T264016) (owner: 10Filippo Giunchedi)
[08:23:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1187 for testing T264016', diff saved to https://phabricator.wikimedia.org/P76202 and previous config saved to /var/cache/conftool/dbconfig/20250515-082333-marostegui.json
[08:23:37] <stashbot>	 T264016: Host page did not auto-resolve in VO - https://phabricator.wikimedia.org/T264016
[08:26:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76203 and previous config saved to /var/cache/conftool/dbconfig/20250515-082659-root.json
[08:30:34] <marostegui>	 pages about db1187 are expected
[08:31:41] <wikibugs>	 (03CR) 10MVernon: [C:03+1] "Looks good to me - feel free to ignore the typo I spotted, but it'll make me happy if you do fix it :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1146102 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans)
[08:31:41] <jinxer-wm>	 RESOLVED: [111x] ConfdResourceFailed: confd resource _var_netmapper_public_clouds.json.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[08:31:56] <icinga-wm>	 PROBLEM - Host db1187 #page is DOWN: PING CRITICAL - Packet loss = 100%
[08:31:59] <volans>	 !incidents
[08:31:59] <sirenbot>	 6124 (UNACKED)  Host db1187 (paged)
[08:31:59] <sirenbot>	 6123 (RESOLVED)  ProbeDown sre (10.2.2.30 ip4 probes/service eqiad)
[08:32:00] <sirenbot>	 6122 (RESOLVED)  ProbeDown sre (10.2.2.30 ip4 search-psi-https:9643 probes/service http_search-psi-https_ip4 eqiad)
[08:32:03] <volans>	 !ack 6124
[08:32:04] <sirenbot>	 6124 (ACKED)  Host db1187 (paged)
[08:32:13] <wikibugs>	 (03PS1) 10MVernon: Thanos: add new thanos-fe100[5-7] nodes [puppet] - 10https://gerrit.wikimedia.org/r/1146511 (https://phabricator.wikimedia.org/T389635)
[08:33:48] <icinga-wm>	 RECOVERY - Host db1187 #page is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms
[08:34:49] <marostegui>	 !incidents
[08:34:50] <sirenbot>	 6124 (RESOLVED)  Host db1187 (paged)
[08:34:50] <sirenbot>	 6123 (RESOLVED)  ProbeDown sre (10.2.2.30 ip4 probes/service eqiad)
[08:34:50] <sirenbot>	 6122 (RESOLVED)  ProbeDown sre (10.2.2.30 ip4 search-psi-https:9643 probes/service http_search-psi-https_ip4 eqiad)
[08:35:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76204 and previous config saved to /var/cache/conftool/dbconfig/20250515-083540-root.json
[08:37:23] <marostegui>	 No more pages about db1187 are expected
[08:37:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76205 and previous config saved to /var/cache/conftool/dbconfig/20250515-083744-root.json
[08:38:49] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] Thanos: add new thanos-fe100[5-7] nodes [puppet] - 10https://gerrit.wikimedia.org/r/1146511 (https://phabricator.wikimedia.org/T389635) (owner: 10MVernon)
[08:39:53] <wikibugs>	 (03CR) 10MVernon: [C:03+2] Thanos: add new thanos-fe100[5-7] nodes [puppet] - 10https://gerrit.wikimedia.org/r/1146511 (https://phabricator.wikimedia.org/T389635) (owner: 10MVernon)
[08:40:02] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: codfw: introduce support for octavia-lb-mgmt-net [puppet] - 10https://gerrit.wikimedia.org/r/1146515 (https://phabricator.wikimedia.org/T394099)
[08:40:34] <wikibugs>	 (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146465 (owner: 10JMeybohm)
[08:41:40] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[08:41:42] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146515 (https://phabricator.wikimedia.org/T394099) (owner: 10Arturo Borrero Gonzalez)
[08:41:46] <wikibugs>	 (03PS1) 10Fabfur: Remove unused varnishkafka configuration [alerts] - 10https://gerrit.wikimedia.org/r/1146516 (https://phabricator.wikimedia.org/T391810)
[08:42:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76206 and previous config saved to /var/cache/conftool/dbconfig/20250515-084204-root.json
[08:42:19] <wikibugs>	 (03CR) 10Volans: "I haven't tested but the code looks ok. I've left some optional nits that might simplify some bits, no blocker." [cookbooks] - 10https://gerrit.wikimedia.org/r/1141895 (https://phabricator.wikimedia.org/T391581) (owner: 10Federico Ceratto)
[08:44:32] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] hiera: Add zarcillo k8s service on traffic server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135387 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[08:46:57] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:49:15] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1177.eqiad.wmnet
[08:49:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10825309 (10ops-monitoring-bot) Host rebooted by stevemunene@cumin1002 with reason: Rebooting afte...
[08:50:39] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29762 bytes in 0.438 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[08:50:40] <dhinus>	 !log wikitech-static: rm -rf /srv/mediawiki/images/wikitech/archive/* (T338520)
[08:50:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76207 and previous config saved to /var/cache/conftool/dbconfig/20250515-085045-root.json
[08:50:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:50] <stashbot>	 T338520: Shellbox is broken on wikitech-static due to disk fullness - https://phabricator.wikimedia.org/T338520
[08:52:08] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[08:52:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc7 T394260', diff saved to https://phabricator.wikimedia.org/P76208 and previous config saved to /var/cache/conftool/dbconfig/20250515-085256-marostegui.json
[08:53:00] <stashbot>	 T394260: Productionize pc8 - https://phabricator.wikimedia.org/T394260
[08:53:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76209 and previous config saved to /var/cache/conftool/dbconfig/20250515-085303-root.json
[08:54:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10825319 (10Stevemunene) an-worker1177 seems stuck booting with the error [18134415.076569] system...
[08:54:48] <wikibugs>	 (03PS1) 10Muehlenhoff: imposm-initial-import: Enable the osmupdater DB permissions earlier [puppet] - 10https://gerrit.wikimedia.org/r/1146524 (https://phabricator.wikimedia.org/T381565)
[08:57:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76210 and previous config saved to /var/cache/conftool/dbconfig/20250515-085710-root.json
[09:04:35] <wikibugs>	 (03Abandoned) 10Vgutierrez: varnish: Allow /beacon/v2/event to hit origin servers [puppet] - 10https://gerrit.wikimedia.org/r/1143474 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[09:05:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76211 and previous config saved to /var/cache/conftool/dbconfig/20250515-090551-root.json
[09:07:55] <Emperor>	 !log reboot thanos-fe100[5-7] prior to bringing into service T391352
[09:07:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:58] <stashbot>	 T391352: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352
[09:08:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76212 and previous config saved to /var/cache/conftool/dbconfig/20250515-090808-root.json
[09:10:15] <icinga-wm>	 PROBLEM - Host thanos-fe1006 is DOWN: PING CRITICAL - Packet loss = 100%
[09:10:27] <icinga-wm>	 PROBLEM - Host thanos-fe1005 is DOWN: PING CRITICAL - Packet loss = 100%
[09:10:29] <icinga-wm>	 PROBLEM - Host thanos-fe1007 is DOWN: PING CRITICAL - Packet loss = 100%
[09:11:29] <icinga-wm>	 RECOVERY - Host thanos-fe1006 is UP: PING OK - Packet loss = 0%, RTA = 1.38 ms
[09:11:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "IIRC alert files should be removed post-deploy, please verify after deploy and a puppet run in /srv/alerts/ops/team-data-engineering_* on " [alerts] - 10https://gerrit.wikimedia.org/r/1146516 (https://phabricator.wikimedia.org/T391810) (owner: 10Fabfur)
[09:11:57] <icinga-wm>	 RECOVERY - Host thanos-fe1005 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms
[09:11:57] <icinga-wm>	 RECOVERY - Host thanos-fe1007 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms
[09:12:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76213 and previous config saved to /var/cache/conftool/dbconfig/20250515-091216-root.json
[09:14:04] <wikibugs>	 (03PS6) 10Vgutierrez: trafficserver: Send /evt-103e/v2/event to intake-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1143483 (https://phabricator.wikimedia.org/T391411)
[09:14:20] <wikibugs>	 (03PS1) 10Zabe: FlaggablePageView: don't call getId() on null [extensions/FlaggedRevs] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146528 (https://phabricator.wikimedia.org/T394381)
[09:15:54] <wikibugs>	 (03CR) 10Alexandros Kosiaris: partman: Add a kubernetes-node-containerd-efi recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1143817 (https://phabricator.wikimedia.org/T393053) (owner: 10Alexandros Kosiaris)
[09:17:15] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: partman: Add a kubernetes-node-containerd-efi recipe [puppet] - 10https://gerrit.wikimedia.org/r/1143817 (https://phabricator.wikimedia.org/T393053)
[09:17:15] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: preseed: Use EFI recipes for aux-k8s-worker[12]00[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/1144627 (https://phabricator.wikimedia.org/T393053)
[09:17:54] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe
[09:19:01] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[09:19:07] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[09:19:07] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[09:19:09] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[09:19:17] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[09:19:17] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[09:19:17] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[09:19:52] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift
[09:19:57] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Swift
[09:19:57] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift
[09:19:59] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift
[09:20:07] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[09:20:07] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift
[09:20:07] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Swift
[09:20:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76214 and previous config saved to /var/cache/conftool/dbconfig/20250515-092056-root.json
[09:22:00] <logmsgbot>	 !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.1  refs T392171
[09:22:03] <stashbot>	 T392171: 1.45.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T392171
[09:22:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good! (To the extent that Partman recipes can look good)" [puppet] - 10https://gerrit.wikimedia.org/r/1143817 (https://phabricator.wikimedia.org/T393053) (owner: 10Alexandros Kosiaris)
[09:23:07] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe
[09:23:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76215 and previous config saved to /var/cache/conftool/dbconfig/20250515-092314-root.json
[09:25:08] <Dreamy_Jazz>	 jouncebot: nowandnext
[09:25:08] <jouncebot>	 For the next 0 hour(s) and 34 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T0800)
[09:25:09] <jouncebot>	 In 0 hour(s) and 34 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1000)
[09:26:06] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] imposm-initial-import: Enable the osmupdater DB permissions earlier [puppet] - 10https://gerrit.wikimedia.org/r/1146524 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[09:26:56] <zabe>	 Dreamy_Jazz: feel free to backport
[09:26:56] <Dreamy_Jazz>	 zabe: Do you want to backport the UBN fix? If not, I'm happy to do that.
[09:27:01] <Dreamy_Jazz>	 Thanks. Will do.
[09:27:05] <logmsgbot>	 !log mvernon@cumin1002 conftool action : set/weight=100; selector: name=thanos-fe1005.eqiad.wmnet
[09:27:10] <logmsgbot>	 !log mvernon@cumin1002 conftool action : set/weight=100; selector: name=thanos-fe1006.eqiad.wmnet
[09:27:10] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] FlaggablePageView: don't call getId() on null [extensions/FlaggedRevs] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146528 (https://phabricator.wikimedia.org/T394381) (owner: 10Zabe)
[09:27:15] <logmsgbot>	 !log mvernon@cumin1002 conftool action : set/weight=100; selector: name=thanos-fe1007.eqiad.wmnet
[09:27:15] <zabe>	 is "feel free" a thing in english?
[09:27:20] <logmsgbot>	 !log mvernon@cumin1002 conftool action : set/pooled=yes; selector: name=thanos-fe1005.eqiad.wmnet
[09:27:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76216 and previous config saved to /var/cache/conftool/dbconfig/20250515-092721-root.json
[09:27:24] <zabe>	 or am I just doing a bad translation?
[09:27:25] <logmsgbot>	 !log mvernon@cumin1002 conftool action : set/pooled=yes; selector: name=thanos-fe1006.eqiad.wmnet
[09:27:29] <logmsgbot>	 !log mvernon@cumin1002 conftool action : set/pooled=yes; selector: name=thanos-fe1007.eqiad.wmnet
[09:27:35] <NovemLinguae>	 yeap. it's a thing. sounds fluent
[09:27:38] <Dreamy_Jazz>	 "Feel free" is a thing in english
[09:28:38] <wikibugs>	 (03Merged) 10jenkins-bot: FlaggablePageView: don't call getId() on null [extensions/FlaggedRevs] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146528 (https://phabricator.wikimedia.org/T394381) (owner: 10Zabe)
[09:28:44] <codders>	 hi train people. We have a CI blocker on Wikibase - is there anything that speaks against me +2'ing a patch to the zuul config and redeploying it right now? (https://gerrit.wikimedia.org/r/c/integration/config/+/1146520)
[09:28:46] <zabe>	 nice
[09:29:19] <taavi>	 codders: zuul config seems like a #wikimedia-releng question
[09:29:26] <codders>	 yeah. bit quiet over there
[09:29:36] <codders>	 just wanted to make sure it wouldn't interfere with operations
[09:29:53] <logmsgbot>	 !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1146528|FlaggablePageView: don't call getId() on null (T394381)]]
[09:29:56] <stashbot>	 T394381: Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T394381
[09:30:25] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1143483 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[09:30:45] <wikibugs>	 (03Abandoned) 10Hnowlan: mw::maintenance: migrate refreshLinkRecommendations s1 shard to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143528 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[09:30:51] <logmsgbot>	 !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host an-worker1177.eqiad.wmnet
[09:30:55] <jnuche>	 codders: that shouldn't affect the train
[09:31:03] <codders>	 (y) thanks!
[09:31:43] <wikibugs>	 06SRE, 10SRE-swift-storage: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10825378 (10MatthewVernon)
[09:32:50] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:34:18] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] "lol, agreed! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1143817 (https://phabricator.wikimedia.org/T393053) (owner: 10Alexandros Kosiaris)
[09:34:33] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] preseed: Use EFI recipes for aux-k8s-worker[12]00[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/1144627 (https://phabricator.wikimedia.org/T393053) (owner: 10Alexandros Kosiaris)
[09:36:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76217 and previous config saved to /var/cache/conftool/dbconfig/20250515-093602-root.json
[09:36:36] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz, zabe: Backport for [[gerrit:1146528|FlaggablePageView: don't call getId() on null (T394381)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[09:36:39] <stashbot>	 T394381: Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T394381
[09:37:11] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz, zabe: Continuing with sync
[09:37:47] <Dreamy_Jazz>	 https://test2.wikipedia.org/wiki/Testpage1 no longer has a fatal error. Couldn't reproduce with the `action=veedit` so maybe you have to press save for that case.
[09:38:06] <Dreamy_Jazz>	 *`veaction=edit`
[09:39:02] <logmsgbot>	 !log isaranto@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[09:44:02] <logmsgbot>	 !log isaranto@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[09:44:24] <logmsgbot>	 !log isaranto@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' .
[09:45:54] <logmsgbot>	 !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1146528|FlaggablePageView: don't call getId() on null (T394381)]] (duration: 16m 00s)
[09:45:57] <stashbot>	 T394381: Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T394381
[09:50:22] <jnuche>	 Dreamy_Jazz: I'm seeing the bug on a test server one minute after the backport synchronized there
[09:50:37] <jnuche>	 is it possible the bug is still present?
[09:50:49] <Dreamy_Jazz>	 Hmm. I was testing using https://test2.wikipedia.org/wiki/Testpage1
[09:51:04] <jnuche>	 https://usercontent.irccloud-cdn.com/file/S6Euohjg/image.png
[09:51:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76218 and previous config saved to /var/cache/conftool/dbconfig/20250515-095108-root.json
[09:51:13] <Dreamy_Jazz>	 I can't seem to reproduce the error now.
[09:51:17] <Dreamy_Jazz>	 Using that URL
[09:51:58] <jnuche>	 Dreamy_Jazz: sounds good, maybe it was some lag when generating the logstash timestamp
[09:52:07] <jnuche>	 thank you!
[09:53:07] <jnuche>	 ok, rolling forward the train
[09:53:29] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146535 (https://phabricator.wikimedia.org/T392171)
[09:53:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146535 (https://phabricator.wikimedia.org/T392171) (owner: 10TrainBranchBot)
[09:54:20] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146535 (https://phabricator.wikimedia.org/T392171) (owner: 10TrainBranchBot)
[09:57:14] <wikibugs>	 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10825452 (10taavi) The site at http://ec2-54-81-201-239.compute-1.amazonaws.com/ seems to embed images from `upload.wikimedia.org`, for pages like n...
[09:58:53] <wikibugs>	 (03PS6) 10Hnowlan: mw::periodic_job: add concurrency parameter to k8s jobs [puppet] - 10https://gerrit.wikimedia.org/r/1146010 (https://phabricator.wikimedia.org/T385782)
[09:59:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1000)
[10:00:15] <jnuche>	 please be aware the train is still running
[10:04:46] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] cloudgw: codfw: introduce support for octavia-lb-mgmt-net [puppet] - 10https://gerrit.wikimedia.org/r/1146515 (https://phabricator.wikimedia.org/T394099) (owner: 10Arturo Borrero Gonzalez)
[10:05:04] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Enable link-protection on OSPF links on EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/1145977 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney)
[10:05:45] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9200 on cirrussearch1074 is CRITICAL: CRITICAL - cebwiki_content_1741288099[4](2025-05-09T20:59:37.634Z), enwikibooks_archive_1717258160[0](2025-05-09T20:16:16.046Z), nnwiki_content_1727897783[0](2025-05-12T14:48:52.049Z), enwikiquote_content_1727930976[0](2025-05-12T14:50:54.219Z), ruwiki_content_1727993503[6](2025-05-12T15:10:07.603Z) https://wikitech.wikimedia.org/wiki/Search%23Administrati
[10:05:49] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9400 on cirrussearch1074 is CRITICAL: CRITICAL - azwikiquote_content_1727944408[0](2025-05-09T20:16:25.362Z), skwiktionary_archive_1717364159[0](2025-05-09T20:16:16.114Z), mlwikiquote_content_1728089481[0](2025-05-12T17:12:12.133Z), id_internalwikimedia_content_1717526458[0](2025-05-12T14:44:10.676Z), urwiktionary_content_1728117663[0](2025-05-12T17:12:25.526Z), sdwiki_content_1728047554[0](20
[10:05:49] <icinga-wm>	 T17:12:52.192Z), fiwikibooks_content_1728060458[0](2025-05-12T14:44:16.723Z), newiktionary_content_1728013854[0](2025-05-12T14:44:11.709Z), ukwiktionary_content_1728125590[0](2025-05-12T14:44:51.053Z), kabwiki_content_1727944513[0](2025-05-12T17:12:24.299Z), ocwiktionary_content_1728036052[0](2025-05-12T17:12:45.626Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[10:05:52] <wikibugs>	 (03Merged) 10jenkins-bot: Enable link-protection on OSPF links on EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/1145977 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney)
[10:07:14] <logmsgbot>	 !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.1  refs T392171
[10:07:18] <stashbot>	 T392171: 1.45.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T392171
[10:08:26] <Emperor>	 !log depool thanos-fe100[1-3] prior to decom T391352
[10:08:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:30] <stashbot>	 T391352: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352
[10:10:11] <effie>	 jouncebot: now
[10:10:12] <jouncebot>	 For the next 0 hour(s) and 49 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1000)
[10:10:22] <effie>	 jnuche: train is still running?
[10:10:49] <effie>	 ah yes, please ping me when you are done :)
[10:11:54] <wikibugs>	 (03CR) 10Fabfur: [C:04-1] "You mean after disabling varnishkafka everywhere? I'm ok with that, I'll flag this with a -1 as reminder" [alerts] - 10https://gerrit.wikimedia.org/r/1146516 (https://phabricator.wikimedia.org/T391810) (owner: 10Fabfur)
[10:12:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.538s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:12:37] <jnuche>	 effie: just finished and things look healthy enough
[10:12:41] <jnuche>	 please go ahead :)
[10:13:13] <effie>	 cheers!
[10:14:35] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] mcrouter: update modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141282 (owner: 10Effie Mouzeli)
[10:14:45] <jinxer-wm>	 FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[10:15:10] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.ad
[10:15:24] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.ad
[10:16:04] <wikibugs>	 (03Merged) 10jenkins-bot: mcrouter: update modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1141282 (owner: 10Effie Mouzeli)
[10:17:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.1s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:19:34] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] airflow: cleanup deployment charts (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) (owner: 10Stevemunene)
[10:19:43] <wikibugs>	 (03CR) 10Stevemunene: airflow: cleanup deployment charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) (owner: 10Stevemunene)
[10:19:45] <jinxer-wm>	 FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[10:19:49] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply
[10:20:54] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.ad
[10:20:55] <logmsgbot>	 !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.check-dbs (exit_code=99) Checking container DBs of wikipedia-commons-local-public.ad
[10:21:08] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1145093 (https://phabricator.wikimedia.org/T379283) (owner: 10Majavah)
[10:21:31] <effie>	 !log mw-mcrouter minor update, memcached errors are expected 
[10:21:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:48] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1145094 (https://phabricator.wikimedia.org/T379283) (owner: 10Majavah)
[10:21:55] <wikibugs>	 (03PS1) 10Hnowlan: trafficserver: route mobileapps apis for zhwiki via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1146544 (https://phabricator.wikimedia.org/T393591)
[10:23:46] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1177.eqiad.wmnet with OS bullseye
[10:23:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10825606 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cu...
[10:25:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[10:26:45] <effie>	 ^ expected
[10:27:54] <icinga-wm>	 PROBLEM - OpenSearch unassigned shard check - 9200 on cirrussearch1075 is CRITICAL: CRITICAL - enwikibooks_archive_1717258160[0](2025-05-09T20:16:16.046Z), cebwiki_content_1741288099[4](2025-05-09T20:59:37.634Z), nnwiki_content_1727897783[0](2025-05-12T14:48:52.049Z), ruwiki_content_1727993503[6](2025-05-12T15:10:07.603Z), enwikiquote_content_1727930976[0](2025-05-12T14:50:54.219Z) https://wikitech.wikimedia.org/wiki/Search%23Administrati
[10:29:20] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] trafficserver: route mobileapps apis for zhwiki via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1146544 (https://phabricator.wikimedia.org/T393591) (owner: 10Hnowlan)
[10:29:45] <jinxer-wm>	 FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[10:29:53] <hnowlan>	 jouncebot: nowandnext
[10:29:54] <jouncebot>	 For the next 0 hour(s) and 30 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1000)
[10:29:54] <jouncebot>	 In 1 hour(s) and 30 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1200)
[10:29:56] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.ad
[10:29:57] <logmsgbot>	 !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.check-dbs (exit_code=99) Checking container DBs of wikipedia-commons-local-public.ad
[10:30:04] <wikibugs>	 (03CR) 10Elukey: [C:03+1] imposm-initial-import: Enable the osmupdater DB permissions earlier [puppet] - 10https://gerrit.wikimedia.org/r/1146524 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[10:30:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[10:30:50] <effie>	 hnowlan: I am deploying mcrouter 
[10:31:02] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] trafficserver: route mobileapps apis for zhwiki via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1146544 (https://phabricator.wikimedia.org/T393591) (owner: 10Hnowlan)
[10:32:13] <hnowlan>	 effie: my change shouldn't interfere
[10:32:15] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.ad
[10:32:23] <wikibugs>	 (03PS7) 10Vgutierrez: trafficserver: Send /evt-103e/v2/events to intake-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1143483 (https://phabricator.wikimedia.org/T391411)
[10:32:37] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.ad
[10:33:36] <effie>	 hnowlan: excellent!
[10:34:01] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[10:34:03] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] mw::periodic_job: add concurrency parameter to k8s jobs [puppet] - 10https://gerrit.wikimedia.org/r/1146010 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[10:34:24] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.f2
[10:34:58] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.f2
[10:35:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[10:35:47] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: keystone: Update ACLs for cloud-private v6 [puppet] - 10https://gerrit.wikimedia.org/r/1145093 (https://phabricator.wikimedia.org/T379283) (owner: 10Majavah)
[10:35:54] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: rabbitmq: Add cloud-private v6 nets to firewall [puppet] - 10https://gerrit.wikimedia.org/r/1145094 (https://phabricator.wikimedia.org/T379283) (owner: 10Majavah)
[10:36:07] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.cd
[10:36:09] <logmsgbot>	 !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.check-dbs (exit_code=99) Checking container DBs of wikipedia-commons-local-public.cd
[10:36:35] <logmsgbot>	 !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply
[10:37:26] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.cd
[10:38:01] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.cd
[10:38:40] <wikibugs>	 (03PS1) 10Majavah: hieradata: Update striker-toolsbeta to 2025-05-15-103617-production [puppet] - 10https://gerrit.wikimedia.org/r/1146546
[10:39:01] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[10:40:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[10:41:04] <wikibugs>	 (03PS1) 10Stevemunene: hdfs: add an-worker1177 to in retup role [puppet] - 10https://gerrit.wikimedia.org/r/1146547 (https://phabricator.wikimedia.org/T390171)
[10:43:30] <icinga-wm>	 RECOVERY - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is OK: wikitech-static OK - wikitech and wikitech-static in sync (32816 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static
[10:44:34] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Update striker-toolsbeta to 2025-05-15-103617-production [puppet] - 10https://gerrit.wikimedia.org/r/1146546 (owner: 10Majavah)
[10:44:45] <jinxer-wm>	 FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[10:45:10] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:47:33] <wikibugs>	 (03PS1) 10Gkyziridis: ml-services: edit-check cpu/gpu deployment experimental staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146548 (https://phabricator.wikimedia.org/T393154)
[10:47:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10825707 (10Stevemunene) Host is still stuck, checking the partman recipe and trying the reimage....
[10:48:08] <wikibugs>	 (03CR) 10Btullis: hdfs: add an-worker1177 to in retup role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1146547 (https://phabricator.wikimedia.org/T390171) (owner: 10Stevemunene)
[10:48:45] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.cd
[10:49:02] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-public.cd
[10:49:25] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-public.cde
[10:49:26] <logmsgbot>	 !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.check-dbs (exit_code=99) Checking container DBs of wikipedia-commons-local-public.cde
[10:49:30] <wikibugs>	 (03CR) 10Stevemunene: hdfs: add an-worker1177 to in retup role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1146547 (https://phabricator.wikimedia.org/T390171) (owner: 10Stevemunene)
[10:49:45] <jinxer-wm>	 RESOLVED: [2x] WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[10:50:42] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Neslihan_Turan_WMDE - https://phabricator.wikimedia.org/T394395 (10Neslihan_Turan_WMDE) 03NEW
[10:53:05] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply
[10:53:09] <wikibugs>	 (03PS1) 10Btullis: dumps: Add the addschanges.conf file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146549 (https://phabricator.wikimedia.org/T394389)
[10:53:17] <logmsgbot>	 !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply
[10:53:29] <wikibugs>	 (03CR) 10Btullis: [C:03+1] hdfs: add an-worker1177 to in retup role [puppet] - 10https://gerrit.wikimedia.org/r/1146547 (https://phabricator.wikimedia.org/T390171) (owner: 10Stevemunene)
[10:53:51] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] Revert "hdfs: Exclude rack F3 hosts from analytics cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1145943 (https://phabricator.wikimedia.org/T390171) (owner: 10Stevemunene)
[10:56:31] <wikibugs>	 06SRE, 06Traffic: Research and respond to Let's Encrypt's intent to deprecate OCSP in favour of CRLs - https://phabricator.wikimedia.org/T370821#10825755 (10Vgutierrez) 05Open→03In progress p:05Triage→03Unbreak! Let's encrypt already stopped including OCSP urls in new certificates and it's already caus...
[10:56:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95152319 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[10:57:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[10:58:05] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] hdfs: add an-worker1177 to in retup role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1146547 (https://phabricator.wikimedia.org/T390171) (owner: 10Stevemunene)
[10:59:21] <effie>	 memcached errors are expected 
[11:01:44] <jinxer-wm>	 FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145506 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[11:02:12] <wikibugs>	 (03PS1) 10Vgutierrez: profile,cache: Stop monitoring OCSP freshness for acme-chief managed certs [puppet] - 10https://gerrit.wikimedia.org/r/1146550 (https://phabricator.wikimedia.org/T370821)
[11:02:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[11:02:32] <wikibugs>	 (03PS1) 10Fabfur: submodule update for deploy [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1146551
[11:02:54] <wikibugs>	 (03PS2) 10Vgutierrez: profile,cache: Stop monitoring OCSP freshness for acme-chief managed certs [puppet] - 10https://gerrit.wikimedia.org/r/1146550 (https://phabricator.wikimedia.org/T370821)
[11:02:57] <wikibugs>	 (03CR) 10AikoChou: ml-services: edit-check cpu/gpu deployment experimental staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146548 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis)
[11:02:59] <wikibugs>	 (03PS9) 10MVernon: swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007
[11:03:44] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[11:04:35] <logmsgbot>	 !log stevemunene@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1177.eqiad.wmnet with OS bullseye
[11:04:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10825773 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1...
[11:05:11] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "It looks like the setting only became unused in wmf.1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146020 (https://phabricator.wikimedia.org/T389970) (owner: 10Sbisson)
[11:05:13] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1156.eqiad.wmnet
[11:05:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10825774 (10ops-monitoring-bot) Host rebooted by stevemunene@cumin1002 with reason: Rebooting afte...
[11:05:59] <wikibugs>	 (03Abandoned) 10Fabfur: submodule update for deploy [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1146551 (owner: 10Fabfur)
[11:06:03] <wikibugs>	 (03PS1) 10Vgutierrez: ncredir: Stop using OCSP stapling [puppet] - 10https://gerrit.wikimedia.org/r/1146552 (https://phabricator.wikimedia.org/T370821)
[11:06:40] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146550 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez)
[11:06:51] <wikibugs>	 (03PS2) 10Vgutierrez: ncredir: Stop using OCSP stapling [puppet] - 10https://gerrit.wikimedia.org/r/1146552 (https://phabricator.wikimedia.org/T370821)
[11:07:02] <wikibugs>	 (03PS1) 10Fabfur: New deploy for last modification [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1146553
[11:07:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[11:08:01] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1146552 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez)
[11:08:02] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146552 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez)
[11:08:40] <wikibugs>	 (03CR) 10Fabfur: [V:03+2 C:03+2] New deploy for last modification [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1146553 (owner: 10Fabfur)
[11:09:04] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:09:58] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Minor template modification - fabfur@cumin1002"
[11:10:00] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Minor template modification - fabfur@cumin1002
[11:10:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[11:10:33] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Minor template modification - fabfur@cumin1002
[11:10:34] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Minor template modification - fabfur@cumin1002"
[11:11:01] <wikibugs>	 (03PS1) 10Muehlenhoff: apt: Remove OCSP stapling [puppet] - 10https://gerrit.wikimedia.org/r/1146555 (https://phabricator.wikimedia.org/T370821)
[11:11:10] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: OpenConfirm - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:11:20] <wikibugs>	 (03PS1) 10Vgutierrez: wikidough: Stop using OCSP [puppet] - 10https://gerrit.wikimedia.org/r/1146556 (https://phabricator.wikimedia.org/T370821)
[11:11:44] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146555 (https://phabricator.wikimedia.org/T370821) (owner: 10Muehlenhoff)
[11:12:02] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] ncredir: Stop using OCSP stapling [puppet] - 10https://gerrit.wikimedia.org/r/1146552 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez)
[11:12:28] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1156.eqiad.wmnet
[11:13:39] <jinxer-wm>	 FIRING: TransitBGPDown: Transit BGP session down between cr2-eqsin and Hurricane Electric (2001:df5:b800:bb00::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr2-eqsin:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBG
[11:14:04] <wikibugs>	 (03PS2) 10Gkyziridis: ml-services: edit-check cpu/gpu deployment experimental staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146548 (https://phabricator.wikimedia.org/T393154)
[11:14:17] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146556 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez)
[11:15:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[11:15:19] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] apt: Remove OCSP stapling [puppet] - 10https://gerrit.wikimedia.org/r/1146555 (https://phabricator.wikimedia.org/T370821) (owner: 10Muehlenhoff)
[11:16:01] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] wikidough: Stop using OCSP [puppet] - 10https://gerrit.wikimedia.org/r/1146556 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez)
[11:16:08] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] apt: Remove OCSP stapling [puppet] - 10https://gerrit.wikimedia.org/r/1146555 (https://phabricator.wikimedia.org/T370821) (owner: 10Muehlenhoff)
[11:16:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] apt: Remove OCSP stapling [puppet] - 10https://gerrit.wikimedia.org/r/1146555 (https://phabricator.wikimedia.org/T370821) (owner: 10Muehlenhoff)
[11:17:10] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-6 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:17:12] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-9 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:17:12] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-7 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:17:12] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] wikidough: Stop using OCSP [puppet] - 10https://gerrit.wikimedia.org/r/1146556 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez)
[11:17:14] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-8 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:17:14] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-2 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:17:34] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-11 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:17:40] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-5 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:17:40] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-1 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:17:42] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-11 on ncredir6002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:17:42] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-1 on ncredir6002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:17:42] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-4 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:17:42] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-3 on ncredir6002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:17:42] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-4 on ncredir6002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:17:50] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-3 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:17:54] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-6 on ncredir6002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:17:54] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-9 on ncredir6002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:18:00] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-10 on ncredir4001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:18:00] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-2 on ncredir6002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:18:10] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:18:12] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-10 on ncredir6002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:18:12] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-7 on ncredir6002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:18:12] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-8 on ncredir6002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:18:16] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-5 on ncredir6002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:18:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10825823 (10Stevemunene) Host an-worker1156 is getting onboarded to the cluster {F60011424}
[11:18:39] <jinxer-wm>	 RESOLVED: TransitBGPDown: Transit BGP session down between cr2-eqsin and Hurricane Electric (2001:df5:b800:bb00::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr2-eqsin:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransit
[11:18:47] <wikibugs>	 (03CR) 10Gkyziridis: ml-services: edit-check cpu/gpu deployment experimental staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146548 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis)
[11:19:16] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-1 on ncredir5002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:19:16] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-6 on ncredir5002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:19:16] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-7 on ncredir5002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:19:16] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-10 on ncredir5002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:19:23] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] dumps: Add the addschanges.conf file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146549 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis)
[11:19:52] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-5 on ncredir5002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:19:52] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-11 on ncredir5002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:19:52] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-8 on ncredir5002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:19:52] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-9 on ncredir5002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:19:52] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-4 on ncredir5002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:19:53] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-2 on ncredir5002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:19:53] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-3 on ncredir5002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:19:54] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-1 on ncredir7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:19:54] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-2 on ncredir7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:20:10] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-9 on ncredir7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:20:10] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-1 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:20:12] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-2 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:20:14] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-11 on ncredir7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:20:14] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-3 on ncredir7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:20:16] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-4 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:20:18] <vgutierrez>	 sigh... sorry about the flood
[11:20:38] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-11 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:20:38] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-9 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:20:38] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-3 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:20:40] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-8 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:20:42] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-8 on ncredir7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:20:42] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-7 on ncredir7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:20:42] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-10 on ncredir7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:20:42] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-10 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:20:42] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-6 on ncredir7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:20:42] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-4 on ncredir7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:20:43] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-5 on ncredir7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:21:02] <sukhe>	 !log sudo cumin -b1 -s10 "A:wikidough" "run-puppet-agent": T370821
[11:21:04] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-6 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:21:04] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-7 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:21:04] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-5 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:21:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:08] <stashbot>	 T370821: Research and respond to Let's Encrypt's intent to deprecate OCSP in favour of CRLs - https://phabricator.wikimedia.org/T370821
[11:21:23] <wikibugs>	 (03PS3) 10Gkyziridis: ml-services: edit-check cpu/gpu deployment experimental staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146548 (https://phabricator.wikimedia.org/T393154)
[11:21:42] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-6 on ncredir2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:21:54] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-11 on ncredir2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:22:00] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-5 on ncredir2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:22:04] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-1 on ncredir2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:22:04] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-7 on ncredir2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:22:04] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-9 on ncredir2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:22:05] <wikibugs>	 (03CR) 10Ssingh: profile,cache: Stop monitoring OCSP freshness for acme-chief managed certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1146550 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez)
[11:22:10] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-2 on ncredir2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:22:12] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-3 on ncredir2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:22:38] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-10 on ncredir2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:22:38] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-8 on ncredir2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:22:40] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-4 on ncredir2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:22:53] <volans>	 sukhe: is your patch fixing the above alerts?
[11:23:07] <vgutierrez>	 volans: yes
[11:23:09] <sukhe>	 volans: vg's patch is going to fix that but we are still missing one thing
[11:23:13] <sukhe>	 (on it)
[11:23:13] <volans>	 found the answer in the backlog :D
[11:23:15] <volans>	 thx
[11:23:25] <volans>	 was hidden in the flood :D
[11:23:29] <sukhe>	 sorry for the nosie.
[11:23:32] <sukhe>	 *noise.
[11:23:36] <volans>	 no worries
[11:23:37] <sukhe>	 caught us by surprise :)
[11:23:41] <wikibugs>	 (03CR) 10Vgutierrez: profile,cache: Stop monitoring OCSP freshness for acme-chief managed certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1146550 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez)
[11:23:42] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-9 on ncredir6001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:23:42] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-8 on ncredir6001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:23:42] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-11 on ncredir6001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:23:42] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-6 on ncredir6001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:23:42] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-5 on ncredir6001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:23:42] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-10 on ncredir6001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:23:46] <sukhe>	 I am going to silence
[11:23:49] <volans>	 k
[11:24:10] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-3 on ncredir6001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:24:12] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-1 on ncredir6001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:24:14] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-2 on ncredir6001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:24:14] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-7 on ncredir6001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:24:19] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] profile,cache: Stop monitoring OCSP freshness for acme-chief managed certs [puppet] - 10https://gerrit.wikimedia.org/r/1146550 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez)
[11:24:26] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-4 on ncredir6001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:25:39] <wikibugs>	 (03PS1) 10Vgutierrez: ncredir: Stop requiring OCSP on ssl monitor [puppet] - 10https://gerrit.wikimedia.org/r/1146559 (https://phabricator.wikimedia.org/T370821)
[11:26:08] <wikibugs>	 (03PS2) 10Vgutierrez: ncredir: Stop requiring OCSP on ssl monitor [puppet] - 10https://gerrit.wikimedia.org/r/1146559 (https://phabricator.wikimedia.org/T370821)
[11:26:20] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] "Thanks for taking care of this issue!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146548 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis)
[11:26:41] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146559 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez)
[11:27:00] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-1 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:27:02] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-3 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:27:02] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-2 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:27:06] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-11 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:27:06] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-4 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:27:12] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-9 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:27:14] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-5 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:27:26] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] profile,cache: Stop monitoring OCSP freshness for acme-chief managed certs [puppet] - 10https://gerrit.wikimedia.org/r/1146550 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez)
[11:27:38] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-8 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:27:46] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-7 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:27:54] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+2] ml-services: edit-check cpu/gpu deployment experimental staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146548 (https://phabricator.wikimedia.org/T393154) (owner: 10Gkyziridis)
[11:27:54] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-10 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:27:54] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-6 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:27:54] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-6 on ncredir3004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:28:00] <icinga-wm>	 PROBLEM - HTTPS on apt1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/APT_repository
[11:28:12] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-11 on ncredir3004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:28:12] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-4 on ncredir3004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:28:12] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-8 on ncredir3004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:28:14] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-2 on ncredir3004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:28:14] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-9 on ncredir3004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:28:32] <logmsgbot>	 !log sukhe@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 14 hosts with reason: monitoring alerts
[11:28:36] <sukhe>	 it's a blanket downtime but controlling the flood
[11:28:43] <sukhe>	 will monitor and remove individually
[11:30:33] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] ncredir: Stop requiring OCSP on ssl monitor [puppet] - 10https://gerrit.wikimedia.org/r/1146559 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez)
[11:30:39] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] ncredir: Stop requiring OCSP on ssl monitor [puppet] - 10https://gerrit.wikimedia.org/r/1146559 (https://phabricator.wikimedia.org/T370821) (owner: 10Vgutierrez)
[11:30:58] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix apt.wikimedia.org health check now that OCSP is disabled [puppet] - 10https://gerrit.wikimedia.org/r/1146562
[11:31:15] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Fix apt.wikimedia.org health check now that OCSP is disabled [puppet] - 10https://gerrit.wikimedia.org/r/1146562 (owner: 10Muehlenhoff)
[11:31:44] <jinxer-wm>	 FIRING: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145506 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[11:31:55] <logmsgbot>	 !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[11:33:08] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Research and respond to Let's Encrypt's intent to deprecate OCSP in favour of CRLs - https://phabricator.wikimedia.org/T370821#10825867 (10Vgutierrez) p:05Unbreak!→03High
[11:35:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Fix apt.wikimedia.org health check now that OCSP is disabled [puppet] - 10https://gerrit.wikimedia.org/r/1146562 (owner: 10Muehlenhoff)
[11:35:45] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10825871 (10Jclark-ctr) @MatthewVernon  Nvme for os drives require uefi booting
[11:39:29] <wikibugs>	 (03CR) 10Btullis: [C:03+2] dumps: Add the addschanges.conf file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146549 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis)
[11:40:37] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host apus-be1004.eqiad.wmnet with OS bookworm
[11:40:43] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10825899 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host apus-be1004.eqiad.wmnet with OS bookworm
[11:40:58] <wikibugs>	 (03Merged) 10jenkins-bot: dumps: Add the addschanges.conf file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146549 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis)
[11:41:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95152319 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[11:41:49] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove now unused and obsolete LE OCSP health check [puppet] - 10https://gerrit.wikimedia.org/r/1146563 (https://phabricator.wikimedia.org/T370821)
[11:41:55] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-6 on ncredir3004 is OK: SSL OK - Certificate wikipedia.fi valid until 2025-06-27 04:31:45 +0000 (expires in 42 days) https://wikitech.wikimedia.org/wiki/Ncredir
[11:42:13] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-11 on ncredir3004 is OK: SSL OK - Certificate weekipedia.com valid until 2025-08-03 15:53:03 +0000 (expires in 80 days) https://wikitech.wikimedia.org/wiki/Ncredir
[11:42:13] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-8 on ncredir3004 is OK: SSL OK - Certificate wikimediacommons.uk valid until 2025-07-15 15:17:13 +0000 (expires in 61 days) https://wikitech.wikimedia.org/wiki/Ncredir
[11:42:13] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-4 on ncredir3004 is OK: SSL OK - Certificate *.wikispecies.net valid until 2025-07-19 04:44:00 +0000 (expires in 64 days) https://wikitech.wikimedia.org/wiki/Ncredir
[11:42:15] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-2 on ncredir3004 is OK: SSL OK - Certificate *.wikimania.com valid until 2025-07-19 06:44:30 +0000 (expires in 64 days) https://wikitech.wikimedia.org/wiki/Ncredir
[11:42:15] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-9 on ncredir3004 is OK: SSL OK - Certificate wikipediashop.com valid until 2025-07-22 18:14:58 +0000 (expires in 68 days) https://wikitech.wikimedia.org/wiki/Ncredir
[11:42:37] <wikibugs>	 (03PS1) 10Kamila Součková: mw::maintenance: migrate growthexperiments-updateIsActiveFlagForMentees [puppet] - 10https://gerrit.wikimedia.org/r/1146566 (https://phabricator.wikimedia.org/T385782)
[11:42:51] <wikibugs>	 (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146566 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková)
[11:43:17] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Makes sense, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1146563 (https://phabricator.wikimedia.org/T370821) (owner: 10Muehlenhoff)
[11:43:36] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1177.eqiad.wmnet with OS bullseye
[11:43:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10825905 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1...
[11:44:08] <wikibugs>	 (03PS1) 10Dreamy Jazz: CustomBlockedDomainStorage::validateDomain: Undo hard-deprecation whilst prod callers exist [extensions/AbuseFilter] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146568 (https://phabricator.wikimedia.org/T394267)
[11:44:27] <Dreamy_Jazz>	 jouncebot: nowandnext
[11:44:27] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 15 minute(s)
[11:44:27] <jouncebot>	 In 0 hour(s) and 15 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1200)
[11:44:34] <sukhe>	 ncredir alerts should be clearing up
[11:44:44] <sukhe>	 I am removing the downtime so that we get alerted about other stuff. please ignore the noise for a bit.
[11:44:47] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for 14 hosts
[11:44:53] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 14 hosts
[11:44:57] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-4 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:44:57] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-7 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:44:57] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-8 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:44:57] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-10 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:44:57] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-5 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:44:57] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-2 on ncredir5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:44:57] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-1 on ncredir5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:44:58] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-9 on ncredir5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:44:58] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-8 on ncredir5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:44:59] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-5 on ncredir5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:44:59] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-11 on ncredir5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:45:00] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-3 on ncredir5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:45:00] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-4 on ncredir5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:45:01] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-6 on ncredir5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:45:01] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-10 on ncredir5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:45:02] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-7 on ncredir5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:45:03] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] CustomBlockedDomainStorage::validateDomain: Undo hard-deprecation whilst prod callers exist [extensions/AbuseFilter] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146568 (https://phabricator.wikimedia.org/T394267) (owner: 10Dreamy Jazz)
[11:45:04] <sukhe>	 !log removing downtime on A:ncredir
[11:45:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:11] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:45:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/AbuseFilter] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146568 (https://phabricator.wikimedia.org/T394267) (owner: 10Dreamy Jazz)
[11:45:41] <wikibugs>	 (03PS1) 10Kamila Součková: mw::maintenance: migrate growthexperiments-refreshPraiseworthyMentees [puppet] - 10https://gerrit.wikimedia.org/r/1146569 (https://phabricator.wikimedia.org/T385782)
[11:45:55] <wikibugs>	 (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146569 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková)
[11:45:57] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-2 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:45:57] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-6 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:45:57] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-1 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:45:57] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-3 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:45:57] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-9 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:45:57] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-11 on ncredir4002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/Ncredir
[11:46:12] <sukhe>	 ^ agent is running so these should clear up
[11:46:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove krb1001 from the list of KDCs presented to clients [puppet] - 10https://gerrit.wikimedia.org/r/1146570 (https://phabricator.wikimedia.org/T390863)
[11:48:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 18.41% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[11:48:58] <wikibugs>	 (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146566 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková)
[11:50:17] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: OpenSent - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:51:39] <jinxer-wm>	 FIRING: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[11:52:17] <icinga-wm>	 RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 422, down: 5, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:53:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 16.77% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[11:54:57] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-1 on ncredir2001 is OK: SSL OK - Certificate wikipedia.com valid until 2025-07-28 21:32:43 +0000 (expires in 74 days) https://wikitech.wikimedia.org/wiki/Ncredir
[11:54:57] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-6 on ncredir2001 is OK: SSL OK - Certificate wikipedia.fi valid until 2025-06-27 04:31:45 +0000 (expires in 42 days) https://wikitech.wikimedia.org/wiki/Ncredir
[11:54:57] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-8 on ncredir2001 is OK: SSL OK - Certificate wikimediacommons.uk valid until 2025-07-15 15:17:13 +0000 (expires in 61 days) https://wikitech.wikimedia.org/wiki/Ncredir
[11:54:57] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-6 on ncredir2002 is OK: SSL OK - Certificate wikipedia.fi valid until 2025-06-27 04:31:45 +0000 (expires in 42 days) https://wikitech.wikimedia.org/wiki/Ncredir
[11:54:57] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-10 on ncredir2001 is OK: SSL OK - Certificate wikipediya.org valid until 2025-08-04 16:51:59 +0000 (expires in 81 days) https://wikitech.wikimedia.org/wiki/Ncredir
[11:54:57] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-7 on ncredir2001 is OK: SSL OK - Certificate wikipedia.ro valid until 2025-07-01 19:44:46 +0000 (expires in 47 days) https://wikitech.wikimedia.org/wiki/Ncredir
[11:56:10] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10825944 (10MatthewVernon) @Jclark-ctr EFI booting is fine (I thought I'd said as much on a previous ticket, but may have missed it); I don't want the OS on the NVME drive, the O...
[11:56:39] <jinxer-wm>	 RESOLVED: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[11:56:49] <wikibugs>	 (03Merged) 10jenkins-bot: CustomBlockedDomainStorage::validateDomain: Undo hard-deprecation whilst prod callers exist [extensions/AbuseFilter] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146568 (https://phabricator.wikimedia.org/T394267) (owner: 10Dreamy Jazz)
[11:56:57] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10825945 (10Jclark-ctr) The boss card is 2x m2 nvme drives
[11:57:05] <logmsgbot>	 !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1146568|CustomBlockedDomainStorage::validateDomain: Undo hard-deprecation whilst prod callers exist (T394267)]]
[11:57:08] <stashbot>	 T394267: PHP Deprecated: Use of MediaWiki\Extension\AbuseFilter\BlockedDomains\CustomBlockedDomainStorage::validateDomain was deprecated in MediaWiki 1.44. [Called from MediaWiki\Extension\VisualEditor\EditCheck\ApiEditCheckReferenceUrl - https://phabricator.wikimedia.org/T394267
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1200)
[12:00:09] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[12:01:57] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:02:27] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply
[12:02:37] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply
[12:03:02] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply
[12:03:04] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply
[12:03:41] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1146568|CustomBlockedDomainStorage::validateDomain: Undo hard-deprecation whilst prod callers exist (T394267)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:03:44] <stashbot>	 T394267: PHP Deprecated: Use of MediaWiki\Extension\AbuseFilter\BlockedDomains\CustomBlockedDomainStorage::validateDomain was deprecated in MediaWiki 1.44. [Called from MediaWiki\Extension\VisualEditor\EditCheck\ApiEditCheckReferenceUrl - https://phabricator.wikimedia.org/T394267
[12:03:47] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync
[12:05:09] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[12:07:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95152319 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[12:09:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] imposm-initial-import: Enable the osmupdater DB permissions earlier [puppet] - 10https://gerrit.wikimedia.org/r/1146524 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[12:09:16] <logmsgbot>	 jclark@cumin1002 reimage (PID 2241989) is awaiting input
[12:10:13] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[12:10:35] <logmsgbot>	 !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1146568|CustomBlockedDomainStorage::validateDomain: Undo hard-deprecation whilst prod callers exist (T394267)]] (duration: 13m 30s)
[12:10:39] <stashbot>	 T394267: PHP Deprecated: Use of MediaWiki\Extension\AbuseFilter\BlockedDomains\CustomBlockedDomainStorage::validateDomain was deprecated in MediaWiki 1.44. [Called from MediaWiki\Extension\VisualEditor\EditCheck\ApiEditCheckReferenceUrl - https://phabricator.wikimedia.org/T394267
[12:11:02] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] airflow: cleanup deployment charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) (owner: 10Stevemunene)
[12:31:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2070:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2070 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:32:44] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10826027 (10MatthewVernon) Oh, right, yes, we want the OS on that (which I thought was going to be presented to the OS as a single device, doing RAID-1 in hardware), sorry.
[12:35:11] <wikibugs>	 (03CR) 10Sbisson: [C:04-2] "Yes, I was going to re-evaluate this morning and indeed it's too early. I'll consider it again early next week." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146020 (https://phabricator.wikimedia.org/T389970) (owner: 10Sbisson)
[12:38:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove now unused and obsolete LE OCSP health check [puppet] - 10https://gerrit.wikimedia.org/r/1146563 (https://phabricator.wikimedia.org/T370821) (owner: 10Muehlenhoff)
[12:40:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2070:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2070 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:41:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1146579 (owner: 10Slyngshede)
[12:41:40] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[12:45:22] <wikibugs>	 (03PS1) 10Klausman: preseed: Switch soon-to-arrive ML GPU hosts to using EFI [puppet] - 10https://gerrit.wikimedia.org/r/1146596 (https://phabricator.wikimedia.org/T393948)
[12:45:34] <wikibugs>	 (03CR) 10Klausman: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146596 (https://phabricator.wikimedia.org/T393948) (owner: 10Klausman)
[12:46:39] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[12:46:57] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:50:14] <wikibugs>	 (03PS1) 10Andrew Bogott: network data: expand cloud-instances-octavia-lb-mgmt-net to v6 [puppet] - 10https://gerrit.wikimedia.org/r/1146598 (https://phabricator.wikimedia.org/T394099)
[12:50:35] <wikibugs>	 (03PS1) 10DDesouza: Design Research participant recruitment survey on eswiki: Deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146599 (https://phabricator.wikimedia.org/T394315)
[12:51:03] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users for Jonathan Tweed - https://phabricator.wikimedia.org/T394308#10826075 (10Bmueller) Approved, thanks!
[12:51:19] <wikibugs>	 (03CR) 10Andrew Bogott: "equivalent netbox change is done:  https://netbox.wikimedia.org/search/?q=octavia-lb-mgmt" [puppet] - 10https://gerrit.wikimedia.org/r/1146598 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott)
[12:52:08] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[12:52:10] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146599 (https://phabricator.wikimedia.org/T394315) (owner: 10DDesouza)
[12:52:16] <logmsgbot>	 jhancock@cumin2002 netbox (PID 3706992) is awaiting input
[12:53:54] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host apus-be1004.eqiad.wmnet with OS bookworm
[12:54:04] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10826078 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host apus-be1004.eqiad.wmnet with OS bookworm executed with errors: - apus-be...
[12:55:02] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: generic update - jhancock@cumin2002"
[12:55:09] <wikibugs>	 (03PS1) 10Clément Goubert: python-webapp: Include base.networkpolicy.egress.mariadb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146601
[12:55:09] <wikibugs>	 (03PS1) 10Clément Goubert: zarcillo: Fix ingress and egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146602
[12:55:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2070:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2070 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:55:58] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: generic update - jhancock@cumin2002"
[12:55:58] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:55:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1146596 (https://phabricator.wikimedia.org/T393948) (owner: 10Klausman)
[12:57:33] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Remove krb1001 from the list of KDCs presented to clients [puppet] - 10https://gerrit.wikimedia.org/r/1146570 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff)
[12:58:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826090 (10VRiley-WMF)
[13:00:04] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: Time to do the UTC afternoon backport window / Backport Party!Members of Release Engineering will be in #wikimedia-operations connect to share the joy of SpiderPig deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1300).
[13:00:05] <jouncebot>	 MichaelG_WMF: A patch you scheduled for UTC afternoon backport window / Backport Party!Members of Release Engineering will be in #wikimedia-operations connect to share the joy of SpiderPig is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:47] <MichaelG_WMF>	 Hey hey :)
[13:05:24] <wikibugs>	 (03CR) 10Volans: "I left some suggestions inline that should simplify a bit the approach, but there is no blocker beside the limited check in case of passin" [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 (owner: 10MVernon)
[13:06:08] <wikibugs>	 (03PS2) 10Andrew Bogott: network data: expand cloud-instances-octavia-lb-mgmt-net to v6 [puppet] - 10https://gerrit.wikimedia.org/r/1146598 (https://phabricator.wikimedia.org/T394099)
[13:10:46] <wikibugs>	 (03PS3) 10Andrew Bogott: network data: expand cloud-instances-octavia-lb-mgmt-net to v6 [puppet] - 10https://gerrit.wikimedia.org/r/1146598 (https://phabricator.wikimedia.org/T394099)
[13:14:13] <Lucas_WMDE>	 o/
[13:14:17] <Lucas_WMDE>	 I can deploy!
[13:14:51] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1177.eqiad.wmnet with OS bullseye
[13:15:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10826146 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-wo...
[13:15:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145184 (https://phabricator.wikimedia.org/T392869) (owner: 10Urbanecm)
[13:15:31] <wikibugs>	 (03CR) 10Clément Goubert: "Without `startingDeadlineSeconds`, as I understand it, it'll "miss" a scheduling every 10s, so if the job overruns its next scheduling tim" [puppet] - 10https://gerrit.wikimedia.org/r/1146010 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[13:15:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove krb1001 from the list of KDCs presented to clients [puppet] - 10https://gerrit.wikimedia.org/r/1146570 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff)
[13:16:12] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Lower geodns TTLs for dyna.wm.org from 300s (5 min) to 180s (3 min) - https://phabricator.wikimedia.org/T394312#10826148 (10Vgutierrez) p:05Triage→03Medium
[13:16:33] <wikibugs>	 (03Merged) 10jenkins-bot: [Growth] eswiki: Bump mentorship to 70% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1145184 (https://phabricator.wikimedia.org/T392869) (owner: 10Urbanecm)
[13:16:46] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1145184|[Growth] eswiki: Bump mentorship to 70% of users (T392869)]]
[13:16:50] <stashbot>	 T392869: Incrementally increase mentorship at Spanish Wikipedia - https://phabricator.wikimedia.org/T392869
[13:18:25] <wikibugs>	 (03CR) 10Hnowlan: "Fair point! Given the amount of nuance here I might just remove this setting for this job as part of this change for the time being. There" [puppet] - 10https://gerrit.wikimedia.org/r/1146010 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[13:18:33] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Lower geodns TTLs for dyna.wm.org and upload.wm.org from 300s (5 min) to 180s (3 min) - https://phabricator.wikimedia.org/T394312#10826152 (10ssingh)
[13:18:39] <jinxer-wm>	 FIRING: TransitBGPDown: Transit BGP session down between cr2-eqsin and Hurricane Electric (103.231.152.47) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr2-eqsin:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[13:19:52] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1072.eqiad.wmnet with OS bookworm
[13:19:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826158 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1072.eqiad.wmnet with OS bookworm
[13:20:47] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1074.eqiad.wmnet with OS bookworm
[13:20:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826164 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1074.eqiad.wmnet with OS bookworm
[13:22:41] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1071.eqiad.wmnet with OS bookworm
[13:22:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826169 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1071.eqiad.wmnet with OS bookworm
[13:22:54] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 urbanecm, lucaswerkmeister-wmde: Backport for [[gerrit:1145184|[Growth] eswiki: Bump mentorship to 70% of users (T392869)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:22:57] <stashbot>	 T392869: Incrementally increase mentorship at Spanish Wikipedia - https://phabricator.wikimedia.org/T392869
[13:23:19] <Lucas_WMDE>	 MichaelG_WMF: please test :)
[13:23:39] <jinxer-wm>	 FIRING: [4x] TransitBGPDown: Transit BGP session down between cr2-eqsin and Hurricane Electric (103.231.152.47) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[13:23:46] <MichaelG_WMF>	 Lucas_WMDE: sorry, I missed your earlier message
[13:23:53] * MichaelG_WMF is looking
[13:24:26] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] network data: expand cloud-instances-octavia-lb-mgmt-net to v6 [puppet] - 10https://gerrit.wikimedia.org/r/1146598 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott)
[13:24:35] <wikibugs>	 (03PS2) 10Klausman: preseed: Switch soon-to-arrive ML GPU hosts to using EFI [puppet] - 10https://gerrit.wikimedia.org/r/1146596 (https://phabricator.wikimedia.org/T393948)
[13:24:43] <wikibugs>	 (03CR) 10Klausman: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146596 (https://phabricator.wikimedia.org/T393948) (owner: 10Klausman)
[13:25:20] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] network data: expand cloud-instances-octavia-lb-mgmt-net to v6 [puppet] - 10https://gerrit.wikimedia.org/r/1146598 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott)
[13:26:01] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1073.eqiad.wmnet with OS bookworm
[13:26:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826186 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1073.eqiad.wmnet with OS bookworm
[13:26:39] <wikibugs>	 (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5557/co" [puppet] - 10https://gerrit.wikimedia.org/r/1146596 (https://phabricator.wikimedia.org/T393948) (owner: 10Klausman)
[13:27:40] <wikibugs>	 (03PS3) 10AOkoth: wmnet: create os-reports record [dns] - 10https://gerrit.wikimedia.org/r/1145191 (https://phabricator.wikimedia.org/T350794)
[13:29:19] <wikibugs>	 (03PS1) 10Andrew Bogott: Octavia: upgrade amphora boot/mgmt network [puppet] - 10https://gerrit.wikimedia.org/r/1146622 (https://phabricator.wikimedia.org/T394099)
[13:29:38] <wikibugs>	 (03CR) 10Klausman: [V:03+1 C:03+2] preseed: Switch soon-to-arrive ML GPU hosts to using EFI [puppet] - 10https://gerrit.wikimedia.org/r/1146596 (https://phabricator.wikimedia.org/T393948) (owner: 10Klausman)
[13:30:30] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudgw: codfw: introduce support for octavia-lb-mgmt-net [puppet] - 10https://gerrit.wikimedia.org/r/1146515 (https://phabricator.wikimedia.org/T394099) (owner: 10Arturo Borrero Gonzalez)
[13:30:35] <MichaelG_WMF>	 @Lucas_WMDE not seeing any errors, though got myself blocked trying to create an account on spanish wikipedia
[13:30:41] <Lucas_WMDE>	 :(
[13:30:48] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 urbanecm, lucaswerkmeister-wmde: Continuing with sync
[13:31:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Octavia: upgrade amphora boot/mgmt network [puppet] - 10https://gerrit.wikimedia.org/r/1146622 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott)
[13:31:32] <MichaelG_WMF>	 this is not really something that I expect to fail, and it is just changing the percentage of new users that might get a mentor
[13:31:57] <Lucas_WMDE>	 yeah
[13:32:38] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] wmnet: create os-reports record [dns] - 10https://gerrit.wikimedia.org/r/1145191 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[13:32:50] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[13:33:46] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] add os-reports to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1145241 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[13:34:09] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] trafficserver: update os-reports replacment url [puppet] - 10https://gerrit.wikimedia.org/r/1145192 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[13:34:27] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] trafficserver: update os-reports replacment url (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1145192 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[13:34:56] <logmsgbot>	 !log aokoth@dns1004 START - running authdns-update
[13:35:27] <wikibugs>	 (03PS1) 10Clément Goubert: mediawiki: Add mwcron.suspended_jobs list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146626 (https://phabricator.wikimedia.org/T394409)
[13:35:28] <wikibugs>	 (03PS1) 10Clément Goubert: mw-cron: Suspend growthexperiments-listtaskcounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146627 (https://phabricator.wikimedia.org/T394019)
[13:36:16] <logmsgbot>	 !log aokoth@dns1004 END - running authdns-update
[13:36:34] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1072.eqiad.wmnet with reason: host reimage
[13:36:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediawiki: Add mwcron.suspended_jobs list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146626 (https://phabricator.wikimedia.org/T394409) (owner: 10Clément Goubert)
[13:36:53] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1074.eqiad.wmnet with reason: host reimage
[13:37:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mw-cron: Suspend growthexperiments-listtaskcounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146627 (https://phabricator.wikimedia.org/T394019) (owner: 10Clément Goubert)
[13:37:26] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1145184|[Growth] eswiki: Bump mentorship to 70% of users (T392869)]] (duration: 20m 39s)
[13:37:29] <stashbot>	 T392869: Incrementally increase mentorship at Spanish Wikipedia - https://phabricator.wikimedia.org/T392869
[13:37:38] <wikibugs>	 (03PS1) 10Mhorsey: release CampaignEvents to cbk-zam wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146628 (https://phabricator.wikimedia.org/T393604)
[13:38:31] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:38:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:45] <Lucas_WMDE>	 also found a SpiderPig bug (T394411), yay
[13:38:45] <stashbot>	 T394411: “Show sensitive information” checkbox broken, suspends terminal - https://phabricator.wikimedia.org/T394411
[13:38:54] <wikibugs>	 (03PS2) 10Clément Goubert: mediawiki: Add mwcron.suspended_jobs list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146626 (https://phabricator.wikimedia.org/T394409)
[13:38:54] <wikibugs>	 (03PS2) 10Clément Goubert: mw-cron: Suspend growthexperiments-listtaskcounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146627 (https://phabricator.wikimedia.org/T394019)
[13:39:11] <wikibugs>	 (03PS1) 10Majavah: P:wmcs: cloudgw: Remove internet access for octavia-lb-mgmt-net [puppet] - 10https://gerrit.wikimedia.org/r/1146629 (https://phabricator.wikimedia.org/T394099)
[13:39:57] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5559/co" [puppet] - 10https://gerrit.wikimedia.org/r/1146629 (https://phabricator.wikimedia.org/T394099) (owner: 10Majavah)
[13:40:08] <MichaelG_WMF>	 @Lucas_WMDE Thank you for running the window! 🙏
[13:40:22] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1072.eqiad.wmnet with reason: host reimage
[13:41:16] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5560/co" [puppet] - 10https://gerrit.wikimedia.org/r/1146629 (https://phabricator.wikimedia.org/T394099) (owner: 10Majavah)
[13:41:29] <Lucas_WMDE>	 np :)
[13:41:34] <wikibugs>	 (03PS2) 10Majavah: P:wmcs: cloudgw: Remove internet access for octavia-lb-mgmt-net [puppet] - 10https://gerrit.wikimedia.org/r/1146629 (https://phabricator.wikimedia.org/T394099)
[13:42:45] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1146629 (https://phabricator.wikimedia.org/T394099) (owner: 10Majavah)
[13:43:39] <jinxer-wm>	 FIRING: [4x] TransitBGPDown: Transit BGP session down between cr2-eqsin and Hurricane Electric (103.231.152.47) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[13:43:41] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1074.eqiad.wmnet with reason: host reimage
[13:44:43] <wikibugs>	 (03CR) 10Kamila Součková: "From k8s docs:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146626 (https://phabricator.wikimedia.org/T394409) (owner: 10Clément Goubert)
[13:45:40] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1073.eqiad.wmnet with OS bookworm
[13:45:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826301 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1073.eqiad.wmnet with OS bookworm executed...
[13:46:18] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1073.eqiad.wmnet with OS bookworm
[13:46:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826303 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1073.eqiad.wmnet with OS bookworm
[13:46:34] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146628 (https://phabricator.wikimedia.org/T393604) (owner: 10Mhorsey)
[13:49:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826313 (10VRiley-WMF)
[13:49:36] <jinxer-wm>	 FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[13:50:18] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] P:wmcs: cloudgw: Remove internet access for octavia-lb-mgmt-net [puppet] - 10https://gerrit.wikimedia.org/r/1146629 (https://phabricator.wikimedia.org/T394099) (owner: 10Majavah)
[13:51:22] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs: cloudgw: Remove internet access for octavia-lb-mgmt-net [puppet] - 10https://gerrit.wikimedia.org/r/1146629 (https://phabricator.wikimedia.org/T394099) (owner: 10Majavah)
[13:52:14] <wikibugs>	 (03CR) 10Clément Goubert: "It's even worse than that, without `startingDeadlineSeconds`, if the CronJob has been suspended for more than 100 scheduled executions, th" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146626 (https://phabricator.wikimedia.org/T394409) (owner: 10Clément Goubert)
[13:54:36] <jinxer-wm>	 RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[13:56:34] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10826364 (10MoritzMuehlenhoff)
[13:57:03] <moritzm>	 !log installing openjdk-8 security updates
[13:57:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:16] <wikibugs>	 (03PS1) 10Jforrester: Merge remote-tracking branch 'origin/master' into wmf_deploy [extensions/CentralNotice] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146631 (https://phabricator.wikimedia.org/T341775)
[13:58:04] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[13:58:34] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[13:58:35] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1072.eqiad.wmnet with OS bookworm
[13:58:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826396 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1072.eqiad.wmnet with OS bookworm completed...
[13:59:15] <wikibugs>	 (03PS1) 10Jforrester: Stabilization: convert deprecated Xml methods to Html [extensions/FlaggedRevs] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146634 (https://phabricator.wikimedia.org/T394403)
[14:00:22] <wikibugs>	 06SRE, 10Observability-Metrics: Rework the Pyrra list dashboard - https://phabricator.wikimedia.org/T394415 (10elukey) 03NEW
[14:01:21] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[14:01:35] <logmsgbot>	 !log pfischer@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[14:02:05] <wikibugs>	 06SRE, 10Observability-Metrics: Rework the Pyrra list dashboard - https://phabricator.wikimedia.org/T394415#10826456 (10elukey)
[14:03:26] <logmsgbot>	 !log pfischer@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:03:33] <logmsgbot>	 vriley@cumin1002 reimage (PID 2257605) is awaiting input
[14:04:09] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[14:04:10] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1074.eqiad.wmnet with OS bookworm
[14:04:15] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826479 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1074.eqiad.wmnet with OS bookworm completed...
[14:04:36] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1073.eqiad.wmnet with OS bookworm
[14:04:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826491 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1073.eqiad.wmnet with OS bookworm executed...
[14:05:32] <wikibugs>	 (03CR) 10BBlack: [C:03+1] "Seems low-risk and beneficial at this point!" [dns] - 10https://gerrit.wikimedia.org/r/1146028 (https://phabricator.wikimedia.org/T394312) (owner: 10Ssingh)
[14:05:44] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt1073.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:06:03] <wikibugs>	 (03PS1) 10Andrew Bogott: Octavia: open firewall for amphora health checks [puppet] - 10https://gerrit.wikimedia.org/r/1146635 (https://phabricator.wikimedia.org/T394099)
[14:06:25] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] templates: lower TTLs for dyna.wm.org and upload.wm.org to 240s [dns] - 10https://gerrit.wikimedia.org/r/1146028 (https://phabricator.wikimedia.org/T394312) (owner: 10Ssingh)
[14:06:32] <wikibugs>	 (03PS2) 10Andrew Bogott: Octavia: open firewall for amphora health checks [puppet] - 10https://gerrit.wikimedia.org/r/1146635 (https://phabricator.wikimedia.org/T394099)
[14:06:35] <wikibugs>	 (03PS2) 10Ssingh: templates: lower TTLs for dyna.wm.org and upload.wm.org to 240s [dns] - 10https://gerrit.wikimedia.org/r/1146028 (https://phabricator.wikimedia.org/T394312)
[14:07:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Octavia: open firewall for amphora health checks [puppet] - 10https://gerrit.wikimedia.org/r/1146635 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott)
[14:08:18] <wikibugs>	 (03PS1) 10Andrew Bogott: octavia: refresh services if config changes [puppet] - 10https://gerrit.wikimedia.org/r/1146636 (https://phabricator.wikimedia.org/T393783)
[14:08:27] <wikibugs>	 (03CR) 10Ssingh: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1146028 (https://phabricator.wikimedia.org/T394312) (owner: 10Ssingh)
[14:08:43] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1071.eqiad.wmnet with OS bookworm
[14:08:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826511 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1071.eqiad.wmnet with OS bookworm executed...
[14:09:05] <wikibugs>	 (03PS3) 10Andrew Bogott: Octavia: open firewall for amphora health checks [puppet] - 10https://gerrit.wikimedia.org/r/1146635 (https://phabricator.wikimedia.org/T394099)
[14:09:05] <wikibugs>	 (03PS2) 10Andrew Bogott: octavia: refresh services if config changes [puppet] - 10https://gerrit.wikimedia.org/r/1146636 (https://phabricator.wikimedia.org/T393783)
[14:09:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] octavia: refresh services if config changes [puppet] - 10https://gerrit.wikimedia.org/r/1146636 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[14:09:41] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudvirt1071.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:10:10] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146635 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott)
[14:10:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] octavia: refresh services if config changes [puppet] - 10https://gerrit.wikimedia.org/r/1146636 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[14:10:21] <wikibugs>	 (03PS3) 10Ssingh: templates: lower TTLs for dyna.wm.org and upload.wm.org to 240s [dns] - 10https://gerrit.wikimedia.org/r/1146028 (https://phabricator.wikimedia.org/T394312)
[14:10:57] <wikibugs>	 (03PS3) 10Andrew Bogott: octavia: refresh services if config changes [puppet] - 10https://gerrit.wikimedia.org/r/1146636 (https://phabricator.wikimedia.org/T393783)
[14:11:28] <wikibugs>	 (03CR) 10Ssingh: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1146028 (https://phabricator.wikimedia.org/T394312) (owner: 10Ssingh)
[14:12:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Octavia: open firewall for amphora health checks [puppet] - 10https://gerrit.wikimedia.org/r/1146635 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott)
[14:12:42] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] octavia: refresh services if config changes [puppet] - 10https://gerrit.wikimedia.org/r/1146636 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[14:12:48] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[14:13:25] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[14:13:31] <sukhe>	 !log finished running lowering of dyna/upload TTL to 240s: T394312
[14:13:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:36] <stashbot>	 T394312: Lower geodns TTLs for dyna.wm.org and upload.wm.org from 300s (5 min) to 180s (3 min) - https://phabricator.wikimedia.org/T394312
[14:17:32] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1073.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:17:59] <wikibugs>	 06SRE, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10826585 (10elukey) 05Open→03Resolved a:03elukey I think that the purpose of this task is completed, We should follow up on the subtasks.
[14:18:18] <wikibugs>	 (03PS2) 10Jsn.sherman: Create dblist for ores extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143638 (https://phabricator.wikimedia.org/T391103)
[14:18:47] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1073.eqiad.wmnet with OS bookworm
[14:18:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826598 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1073.eqiad.wmnet with OS bookworm
[14:21:25] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1071.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[14:21:59] <wikibugs>	 (03PS1) 10Muehlenhoff: Bump the version numbers for Java images based on latest security releases [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146638
[14:22:00] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1071.eqiad.wmnet with OS bookworm
[14:22:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1071.eqiad.wmnet with OS bookworm
[14:24:32] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Lower geodns TTLs for dyna.wm.org and upload.wm.org from 300s (5 min) to 180s (3 min) - https://phabricator.wikimedia.org/T394312#10826667 (10ssingh)
[14:24:36] <jinxer-wm>	 FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[14:24:44] <wikibugs>	 (03PS3) 10Jsn.sherman: Create dblist for ores extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143638 (https://phabricator.wikimedia.org/T391103)
[14:24:51] <wikibugs>	 (03PS1) 10Andrew Bogott: Octavia: allow amphora health checks over ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099)
[14:25:47] <wikibugs>	 (03PS5) 10Eevans: cassandra: configurable local_system_data_file_directory [puppet] - 10https://gerrit.wikimedia.org/r/1146102 (https://phabricator.wikimedia.org/T391544)
[14:26:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Create dblist for ores extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143638 (https://phabricator.wikimedia.org/T391103) (owner: 10Jsn.sherman)
[14:26:53] <wikibugs>	 (03PS2) 10Andrew Bogott: Octavia: allow amphora health checks over ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099)
[14:26:59] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott)
[14:30:30] <wikibugs>	 (03PS3) 10Andrew Bogott: Octavia: allow amphora health checks over ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099)
[14:30:52] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott)
[14:33:17] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: upggrade base image to include krenew [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146504 (https://phabricator.wikimedia.org/T394293) (owner: 10Brouberol)
[14:33:53] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1073.eqiad.wmnet with reason: host reimage
[14:34:08] <wikibugs>	 (03CR) 10Eevans: [C:03+2] cassandra: configurable local_system_data_file_directory [puppet] - 10https://gerrit.wikimedia.org/r/1146102 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans)
[14:37:04] <wikibugs>	 (03PS4) 10Andrew Bogott: Octavia: allow amphora health checks over ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099)
[14:37:15] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1071.eqiad.wmnet with reason: host reimage
[14:37:16] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott)
[14:37:58] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1073.eqiad.wmnet with reason: host reimage
[14:39:50] <wikibugs>	 (03PS5) 10Andrew Bogott: Octavia: allow amphora health checks over ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099)
[14:39:50] <logmsgbot>	 stevemunene@cumin1002 reimage (PID 2253166) is awaiting input
[14:39:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826758 (10VRiley-WMF)
[14:40:01] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott)
[14:40:36] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1071.eqiad.wmnet with reason: host reimage
[14:42:08] <wikibugs>	 (03PS1) 10DCausse: Revert "Make weighted tags no longer be WMF-specific" [extensions/CirrusSearch] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146643
[14:42:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826770 (10VRiley-WMF)
[14:43:40] <wikibugs>	 (03PS4) 10Jsn.sherman: Create dblist for ores extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143638 (https://phabricator.wikimedia.org/T391103)
[14:45:09] <wikibugs>	 (03CR) 10Jsn.sherman: "Thank you! I missed `euwiki` and also the whole `composer manage-dblist update` step." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143638 (https://phabricator.wikimedia.org/T391103) (owner: 10Jsn.sherman)
[14:46:39] <wikibugs>	 (03PS5) 10Jsn.sherman: Create dblist for ores extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143638 (https://phabricator.wikimedia.org/T391103)
[14:47:20] <wikibugs>	 (03PS6) 10Andrew Bogott: Octavia: allow amphora health checks over ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099)
[14:47:29] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott)
[14:47:35] <wikibugs>	 (03CR) 10Jsn.sherman: "...and I see fawiki was in fact enabled; nevermind!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143638 (https://phabricator.wikimedia.org/T391103) (owner: 10Jsn.sherman)
[14:48:24] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Octavia: allow amphora health checks over ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott)
[14:48:41] <dcausse>	 jouncebot: nowandnext
[14:48:41] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 11 minute(s)
[14:48:41] <jouncebot>	 In 0 hour(s) and 11 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1500)
[14:49:06] <wikibugs>	 (03CR) 10CDobbins: "I just wanted to ask for additional clarification on this, since it's been a while and there's been no activity. While we could merge this" [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins)
[14:49:36] <jinxer-wm>	 RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[14:50:35] <wikibugs>	 (03PS7) 10Andrew Bogott: Octavia: allow amphora health checks over ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099)
[14:51:53] <wikibugs>	 (03PS1) 10Eevans: cassandra_dev: actually put system keyspaces on RAID [puppet] - 10https://gerrit.wikimedia.org/r/1146646 (https://phabricator.wikimedia.org/T391544)
[14:53:50] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott)
[14:54:06] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[14:55:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10826849 (10klausman)
[14:56:45] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Octavia: allow amphora health checks over ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/1146641 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott)
[14:57:12] <logmsgbot>	 vriley@cumin1002 reimage (PID 2265179) is awaiting input
[14:57:56] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Bump the version numbers for Java images based on latest security releases [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146638 (owner: 10Muehlenhoff)
[14:58:00] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[14:58:01] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1073.eqiad.wmnet with OS bookworm
[14:58:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826854 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1073.eqiad.wmnet with OS bookworm completed...
[14:58:37] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[14:58:51] <fabfur>	 !log disable puppet on A:cp to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1144620 (T393927)
[14:58:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:54] <stashbot>	 T393927: Deploy geoip lookup script on 2 hosts - https://phabricator.wikimedia.org/T393927
[15:00:04] <jouncebot>	 jnuche and jeena: Time to snap out of that daydream and deploy Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1500).
[15:00:35] <wikibugs>	 (03CR) 10Michael Große: [C:03+1] Revert "Make weighted tags no longer be WMF-specific" [extensions/CirrusSearch] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146643 (owner: 10DCausse)
[15:01:36] <jinxer-wm>	 FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[15:01:46] <logmsgbot>	 vriley@cumin1002 reimage (PID 2265364) is awaiting input
[15:02:29] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[15:02:29] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1071.eqiad.wmnet with OS bookworm
[15:02:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826875 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1071.eqiad.wmnet with OS bookworm completed...
[15:02:49] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp3073.esams.wmnet
[15:02:57] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp3081.esams.wmnet
[15:03:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10826876 (10VRiley-WMF) 05Open→03Resolved
[15:03:48] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cache: lua lookup experiment [puppet] - 10https://gerrit.wikimedia.org/r/1144620 (https://phabricator.wikimedia.org/T393927) (owner: 10Fabfur)
[15:04:37] <MichaelG_WMF>	 jnuche and jeena: would you be able to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1146643 as part of train-log-triage? this change in the train broke a lot of things
[15:05:25] <MichaelG_WMF>	 See https://phabricator.wikimedia.org/T394416 and conversation in #wikimedia-search for context
[15:06:06] * Lucas_WMDE is also around if needed
[15:06:09] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] "Acknowledged" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146626 (https://phabricator.wikimedia.org/T394409) (owner: 10Clément Goubert)
[15:06:13] <jnuche>	 MichaelG_WMF: I can backport it
[15:06:35] <MichaelG_WMF>	 jnuche: thank you!
[15:06:36] <jinxer-wm>	 RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[15:06:43] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:07:07] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] "Thanks! I've created https://phabricator.wikimedia.org/T394423 for discussion of `startingDeadlineSeconds`" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146626 (https://phabricator.wikimedia.org/T394409) (owner: 10Clément Goubert)
[15:08:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jnuche@deploy1003 using scap backport" [extensions/CirrusSearch] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146643 (owner: 10DCausse)
[15:08:53] <dcausse>	 MichaelG_WMF, jnuche thanks for backporting this!
[15:09:47] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Make weighted tags no longer be WMF-specific" [extensions/CirrusSearch] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146643 (owner: 10DCausse)
[15:09:58] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Add mwcron.suspended_jobs list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146626 (https://phabricator.wikimedia.org/T394409) (owner: 10Clément Goubert)
[15:10:04] <logmsgbot>	 !log jnuche@deploy1003 Started scap sync-world: Backport for [[gerrit:1146643|Revert "Make weighted tags no longer be WMF-specific"]]
[15:15:00] <logmsgbot>	 !log jnuche@deploy1003 dcausse, jnuche: Backport for [[gerrit:1146643|Revert "Make weighted tags no longer be WMF-specific"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[15:15:00] <jnuche>	 MichaelG_WMF, dcausse: the revert is on the test servers, is it possible for you to check there that the problem is gone? otherwise I'm fine with continuing the backport
[15:15:01] <MichaelG_WMF>	 jnuche: yes, it should be possible to check, one moment
[15:15:03] <stephanebisson>	 I confirm it works on the test servers, at least for my use cases
[15:15:11] <dcausse>	 jnuche: all good
[15:15:14] <jnuche>	 ty
[15:15:40] <logmsgbot>	 !log jnuche@deploy1003 dcausse, jnuche: Continuing with sync
[15:15:42] * MichaelG_WMF jnuche: looks good from my side too!
[15:38:45] <mszabo>	 jouncebot: nowandnext
[15:38:45] <jouncebot>	 For the next 0 hour(s) and 21 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1500)
[15:38:45] <jouncebot>	 In 0 hour(s) and 21 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1600)
[15:39:36] <jinxer-wm>	 RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[15:40:34] <fabfur>	 !log reenabling puppet on A:cp (T393927)
[15:40:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:34] <stashbot>	 T393927: Deploy geoip lookup script on 2 hosts - https://phabricator.wikimedia.org/T393927
[15:41:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:44:03] <wikibugs>	 (03PS1) 10Majavah: ssh: Do not shell out for root SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/1146661 (https://phabricator.wikimedia.org/T394283)
[15:45:13] <wikibugs>	 (03CR) 10Eevans: [C:03+2] cassandra_dev: actually put system keyspaces on RAID [puppet] - 10https://gerrit.wikimedia.org/r/1146646 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans)
[15:45:13] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_esams - <bound method SREBatchRunnerBase._reason of <cookbooks.sre.cdn.roll-upgrade-varnish.RollUpgradeVarnishRunner object at 0x7fd386623c70>>
[15:45:18] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_esams - <bound method SREBatchRunnerBase._reason of <cookbooks.sre.cdn.roll-upgrade-varnish.RollUpgradeVarnishRunner object at 0x7f58099c1b50>>
[15:48:33] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.168.0" for 2 host(s)
[15:48:49] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] admin: SSH key rotation for cmassaro [puppet] - 10https://gerrit.wikimedia.org/r/1146033 (https://phabricator.wikimedia.org/T393140) (owner: 10BCornwall)
[15:49:36] <jinxer-wm>	 FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[15:49:59] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough
[15:50:11] <wikibugs>	 (03PS1) 10Cathal Mooney: Add EBGP between codfw row A-D spines and row E/F spines [homer/public] - 10https://gerrit.wikimedia.org/r/1146662 (https://phabricator.wikimedia.org/T394021)
[15:50:21] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.168.0" completed for 2 hosts
[15:50:51] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Update SSH key for apine - https://phabricator.wikimedia.org/T393140#10827128 (10BCornwall) 05In progress→03Resolved Hi, @cmassaro! Your key has been rotated. Feel free to re-open if anything was missed.  Thank you!
[15:51:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[23] - https://phabricator.wikimedia.org/T393948#10827139 (10RobH)
[15:52:41] <jinxer-wm>	 RESOLVED: [2x] ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[15:53:15] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10827152 (10RobH)
[15:53:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10827161 (10RobH)
[15:54:36] <jinxer-wm>	 RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[15:55:11] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp3081.esams.wmnet
[15:55:16] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp3073.esams.wmnet
[15:56:17] <mszabo>	 !log Starting patch deployment for T394393
[15:56:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:33] <wikibugs>	 (03CR) 10RLazarus: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5569/console" [puppet] - 10https://gerrit.wikimedia.org/r/1145949 (https://phabricator.wikimedia.org/T394299) (owner: 10Dreamy Jazz)
[15:58:51] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: lsw1-c6-codfw: PEM 0 Not Powered - https://phabricator.wikimedia.org/T394261#10827184 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[15:59:17] <wikibugs>	 (03CR) 10Scott French: "No objections in principle, though this needs rebased to reflect I7e9c97537327a4de42a0d8013971beec4da6cb83 and may benefit from tuning the" [puppet] - 10https://gerrit.wikimedia.org/r/1143529 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[16:00:05] <jouncebot>	 jhathaway and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1600).
[16:00:05] <jouncebot>	 Dreamy_Jazz: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:49] <rzl>	 Dreamy_Jazz: o/ just running PCC and then I can merge
[16:00:53] <Dreamy_Jazz>	 Thanks!
[16:00:55] <rzl>	 doesn't look like you'll need to test anything?
[16:01:35] <Dreamy_Jazz>	 The only thing I'll be able to test is that when it's been deployed we stop seeing LogicException errors being thrown on beta logstash.
[16:01:45] <rzl>	 👍
[16:01:57] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:02:36] <Dreamy_Jazz>	 The logs I'll be checking is at https://beta-logs.wmcloud.org/goto/7dacc8512956b14f79255206bf05187e
[16:03:01] <wikibugs>	 (03CR) 10RLazarus: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5570/console" [puppet] - 10https://gerrit.wikimedia.org/r/1145949 (https://phabricator.wikimedia.org/T394299) (owner: 10Dreamy Jazz)
[16:03:03] <rzl>	 PCC noops on mwmaint (for the old-timey systemd job) and deploy (for the sparkly new k8s one), going ahead
[16:03:12] <wikibugs>	 (03CR) 10RLazarus: [V:03+1 C:03+2] MediaModeration: Only running scanning scripts on production [puppet] - 10https://gerrit.wikimedia.org/r/1145949 (https://phabricator.wikimedia.org/T394299) (owner: 10Dreamy Jazz)
[16:03:19] <wikibugs>	 (03PS1) 10SBassett: Update deployment image for security.wikimedia.org site [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146665 (https://phabricator.wikimedia.org/T392098)
[16:04:41] <rzl>	 Dreamy_Jazz: merged and deployed to prod puppetmasters, I haven't touched anything in beta but feel free to take it from there
[16:04:57] <rzl>	 thank you for flying puppet request window, please ensure you have all your personal belongings and watch your step as you exit
[16:04:59] <wikibugs>	 (03PS2) 10SBassett: Update deployment image for security.wikimedia.org site [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146665 (https://phabricator.wikimedia.org/T392098)
[16:05:07] <Dreamy_Jazz>	 Is there anything I'd need to deploy on the beta wikis specifically?
[16:05:15] <Dreamy_Jazz>	 Or is it an automatic thing?
[16:05:22] <wikibugs>	 (03PS1) 10Fabfur: Revert "cache: lua lookup experiment" [puppet] - 10https://gerrit.wikimedia.org/r/1146667
[16:05:26] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thank you both. Moving ahead with plumbing this in, since we'll presumably want to use it in some form anyway, but without using it quite " [puppet] - 10https://gerrit.wikimedia.org/r/1146010 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[16:05:48] <wikibugs>	 (03PS2) 10Fabfur: Revert "cache: lua lookup experiment" [puppet] - 10https://gerrit.wikimedia.org/r/1146667
[16:06:23] <rzl>	 it should happen automatically, I don't know the exact timing; in prod I would say wait 30 minutes max
[16:06:33] <Dreamy_Jazz>	 Thanks!
[16:06:47] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate growthexperiments-updateIsActiveFlagForMentees [puppet] - 10https://gerrit.wikimedia.org/r/1146566 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková)
[16:07:48] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] graphite: remove access to port 2003 tcp/udp [puppet] - 10https://gerrit.wikimedia.org/r/1144555 (https://phabricator.wikimedia.org/T228380) (owner: 10Filippo Giunchedi)
[16:07:54] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough
[16:07:58] <logmsgbot>	 !log mszabo Deployed security patch for T394393
[16:08:14] <wikibugs>	 (03PS1) 10Hnowlan: sre:api-gateway: bump alerting threshold for elevated error [alerts] - 10https://gerrit.wikimedia.org/r/1146668
[16:08:30] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate growthexperiments-refreshPraiseworthyMentees [puppet] - 10https://gerrit.wikimedia.org/r/1146569 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková)
[16:09:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre:api-gateway: bump alerting threshold for elevated error [alerts] - 10https://gerrit.wikimedia.org/r/1146668 (owner: 10Hnowlan)
[16:10:06] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] Revert "cache: lua lookup experiment" [puppet] - 10https://gerrit.wikimedia.org/r/1146667 (owner: 10Fabfur)
[16:11:30] <wikibugs>	 10ops-codfw, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): SSD firmware update for cirrussearch211-0-5] - https://phabricator.wikimedia.org/T394432#10827258 (10RobH) p:05Triage→03Medium
[16:12:54] <wikibugs>	 10ops-codfw, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): SSD firmware update for cirrussearch211-0-5] - https://phabricator.wikimedia.org/T394432#10827263 (10RobH) a:05RobH→03bking @RKemper or @bking: Can you advise which of these cirrusseach hosts would be most easily put into maint/offline...
[16:12:56] <wikibugs>	 10ops-codfw, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): SSD firmware update for cirrussearch211-0-5] - https://phabricator.wikimedia.org/T394432#10827265 (10RobH)
[16:13:13] <logmsgbot>	 !log mszabo Deployed security patch for T394393
[16:13:17] <wikibugs>	 10ops-codfw, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): SSD firmware update for cirrussearch211[0-5] - https://phabricator.wikimedia.org/T394432#10827266 (10RobH)
[16:14:04] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp3073.esams.wmnet
[16:14:11] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp3081.esams.wmnet
[16:14:16] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice, thanks." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146638 (owner: 10Muehlenhoff)
[16:14:28] <wikibugs>	 (03CR) 10SBassett: [C:03+2] Update deployment image for security.wikimedia.org site [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146665 (https://phabricator.wikimedia.org/T392098) (owner: 10SBassett)
[16:14:35] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Dell SSD Critical Firmware Update - https://phabricator.wikimedia.org/T394348#10827267 (10RobH) I've created sub-task T392935 to track cirrussearch maint windows, likely should have just done that to start but was hoping one was just easily kicked offline for test...
[16:16:10] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: create partition for ml logs [puppet] - 10https://gerrit.wikimedia.org/r/1145339 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite)
[16:16:12] <wikibugs>	 (03Merged) 10jenkins-bot: Update deployment image for security.wikimedia.org site [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146665 (https://phabricator.wikimedia.org/T392098) (owner: 10SBassett)
[16:16:32] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp3081.esams.wmnet
[16:16:37] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp3073.esams.wmnet
[16:17:04] <logmsgbot>	 !log sbassett@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[16:17:25] <wikibugs>	 (03PS8) 10Brouberol: airflow: prevent resource name collisions when multiple releases are installed in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145200 (https://phabricator.wikimedia.org/T393999)
[16:17:25] <wikibugs>	 (03PS1) 10Brouberol: airflow: use the devenv.db.name in the PG URI instead of /app [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146670 (https://phabricator.wikimedia.org/T393999)
[16:17:25] <wikibugs>	 (03PS1) 10Brouberol: airflow: rely on krenew instead of 'airflow kerberos' to renew the kerberos ticket [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146671 (https://phabricator.wikimedia.org/T393999)
[16:17:28] <wikibugs>	 (03PS1) 10Brouberol: airflow: define an airflow-dev values file, containing the devenv default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146672 (https://phabricator.wikimedia.org/T393999)
[16:17:31] <wikibugs>	 (03PS1) 10Brouberol: airflow: don't define OAUTH-related configs in devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146673 (https://phabricator.wikimedia.org/T393999)
[16:17:32] <wikibugs>	 (03PS1) 10Brouberol: airflow: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146674 (https://phabricator.wikimedia.org/T393999)
[16:18:09] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "This indeed does what it says on the tin, so +1 in that regard. As discussed elsewhere, we'll want to wait until the `startingDeadlineSeco" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146627 (https://phabricator.wikimedia.org/T394019) (owner: 10Clément Goubert)
[16:19:51] <wikibugs>	 07Puppet, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Data-Platform-SRE, 06Discovery-Search: Puppet failing on deployment-cirrussearch{12,13,14}.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T393924#10827290 (10bd808)
[16:20:29] <wikibugs>	 07Puppet, 10Beta-Cluster-Infrastructure, 10CirrusSearch, 06Data-Platform-SRE, 06Discovery-Search: Puppet failing on deployment-cirrussearch{12,13,14}.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T393924#10827295 (10bd808) >>! In T394430#10827113, @dancy wrote: > The failing...
[16:26:20] <sbassett>	 Trying to do a helmfile -e staging -i apply --context 5 for miscweb but it seems to be hanging on research-landing-page.  Should probably just ctrl+z?
[16:27:13] <logmsgbot>	 !log sbassett@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[16:27:15] <logmsgbot>	 !log sbassett@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[16:27:26] <sbassett>	 heh, n/m
[16:31:57] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release miscweb/design-landing-page on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[16:34:25] <wikibugs>	 (03PS1) 10Bvibber: Enable Chart extension on phase 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146679 (https://phabricator.wikimedia.org/T393518)
[16:35:27] <wikibugs>	 (03CR) 10Scott French: "Thanks, Dan! I think this looks good, aside from the missing bullseye update." [puppet] - 10https://gerrit.wikimedia.org/r/1146091 (https://phabricator.wikimedia.org/T392526) (owner: 10Dduvall)
[16:35:30] <topranks>	 !log add bgp peerings from codfw row A-D switches to new spines in rows E/F T394021
[16:35:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:34] <stashbot>	 T394021: Stage and configure new Juniper switches in codfw rows E/F - https://phabricator.wikimedia.org/T394021
[16:35:50] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146679 (https://phabricator.wikimedia.org/T393518) (owner: 10Bvibber)
[16:36:49] <sbassett>	 !log helmfile [staging] HALTED helmfile.d/services/miscweb: apply
[16:36:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:48] <wikibugs>	 (03PS1) 10Andrew Bogott: Dynamic proxy: pin python3-flask package [puppet] - 10https://gerrit.wikimedia.org/r/1146680
[16:38:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Dynamic proxy: pin python3-flask package [puppet] - 10https://gerrit.wikimedia.org/r/1146680 (owner: 10Andrew Bogott)
[16:39:17] <logmsgbot>	 herron@cumin1002 roll-restart-reboot-brokers (PID 2287058) is awaiting input
[16:39:32] <wikibugs>	 (03PS2) 10Andrew Bogott: Dynamic proxy: pin python3-flask package [puppet] - 10https://gerrit.wikimedia.org/r/1146680
[16:40:14] <logmsgbot>	 !log herron@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-eqiad
[16:40:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between ssw1-a1-codfw and ssw1-e1-codfw (10.192.253.193) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=ssw1-a1-codfw:9804&var-bgp_group=core&var-bgp_neighbor=ssw1-e1-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[16:41:40] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[16:44:17] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Dynamic proxy: pin python3-flask package [puppet] - 10https://gerrit.wikimedia.org/r/1146680 (owner: 10Andrew Bogott)
[16:45:39] <jinxer-wm>	 RESOLVED: CoreBGPDown: Core BGP session down between ssw1-a1-codfw and ssw1-e1-codfw (10.192.253.193) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=ssw1-a1-codfw:9804&var-bgp_group=core&var-bgp_neighbor=ssw1-e1-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[16:46:57] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:48:49] <wikibugs>	 (03PS3) 10Dduvall: aptrepo: Provide thirdparty/docker component with upstream packages [puppet] - 10https://gerrit.wikimedia.org/r/1146091 (https://phabricator.wikimedia.org/T392526)
[16:49:46] <wikibugs>	 (03CR) 10Dduvall: "Thanks for the review, Scott!" [puppet] - 10https://gerrit.wikimedia.org/r/1146091 (https://phabricator.wikimedia.org/T392526) (owner: 10Dduvall)
[16:52:08] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[17:00:04] <jouncebot>	 bd808: It is that lovely time of the day again! You are hereby commanded to deploy Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1700).
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1700)
[17:02:16] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump container to 2025-05-15-122256-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146686
[17:02:51] <wikibugs>	 (03PS2) 10Hnowlan: sre:api-gateway: bump alerting threshold for elevated error [alerts] - 10https://gerrit.wikimedia.org/r/1146668
[17:03:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:04:23] <wikibugs>	 (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2025-05-15-122256-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146686 (owner: 10BryanDavis)
[17:04:57] <logmsgbot>	 !log herron@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-eqiad
[17:05:50] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump container to 2025-05-15-122256-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146686 (owner: 10BryanDavis)
[17:08:59] <wikibugs>	 (03PS2) 10Cathal Mooney: Add EBGP between codfw row A-D spines and row E/F spines [homer/public] - 10https://gerrit.wikimedia.org/r/1146662 (https://phabricator.wikimedia.org/T394021)
[17:09:01] <logmsgbot>	 !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply
[17:10:29] <logmsgbot>	 !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[17:10:41] <logmsgbot>	 !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply
[17:11:15] <logmsgbot>	 !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[17:11:28] <logmsgbot>	 !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[17:12:01] <logmsgbot>	 !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[17:12:08] <logmsgbot>	 !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[17:12:40] <logmsgbot>	 !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[17:15:15] <bd808>	 developer portal looks good. There were some helm changes from T391333 that rode along with the container update I was intending to push.
[17:15:16] <stashbot>	 T391333: Revisit default envoy histogram buckets - https://phabricator.wikimedia.org/T391333
[17:15:43] * bd808 is done with deploying during this window.
[17:20:32] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Add EBGP between codfw row A-D spines and row E/F spines [homer/public] - 10https://gerrit.wikimedia.org/r/1146662 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney)
[17:21:04] <wikibugs>	 (03Merged) 10jenkins-bot: Add EBGP between codfw row A-D spines and row E/F spines [homer/public] - 10https://gerrit.wikimedia.org/r/1146662 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney)
[17:22:16] <wikibugs>	 (03CR) 10Dwisehaupt: [C:03+1] "This changes looks ok from a notification standpoint. There is a concern about sending alerts to us that are not directly actionable by us" [puppet] - 10https://gerrit.wikimedia.org/r/1145169 (https://phabricator.wikimedia.org/T388641) (owner: 10Filippo Giunchedi)
[17:23:16] <topranks>	 !log add remaining bgp peerings from codfw row A-D switches to new spines in rows E/F T394021
[17:23:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:22] <stashbot>	 T394021: Stage and configure new Juniper switches in codfw rows E/F - https://phabricator.wikimedia.org/T394021
[17:24:25] <logmsgbot>	 !log herron@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-codfw
[17:32:50] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[17:41:36] <jinxer-wm>	 FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[17:43:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[17:46:35] <logmsgbot>	 !log herron@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-codfw
[17:46:36] <jinxer-wm>	 RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[17:50:15] <jinxer-wm>	 FIRING: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:55:15] <jinxer-wm>	 RESOLVED: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:58:37] <wikibugs>	 (03PS1) 10Andrew Bogott: Octavia: new attempt at health check ports & ips [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783)
[17:59:16] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[17:59:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Octavia: new attempt at health check ports & ips [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[18:00:05] <jouncebot>	 jnuche and jeena: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T1800).
[18:00:47] <wikibugs>	 (03PS2) 10Andrew Bogott: Octavia: new attempt at health check ports & ips [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783)
[18:01:46] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[18:03:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:07:19] <wikibugs>	 (03PS2) 10Brouberol: airflow: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146674 (https://phabricator.wikimedia.org/T393999)
[18:07:20] <wikibugs>	 (03PS1) 10Brouberol: airflow: include an ENVOY_SERVICE_NAME env var pointing to the envoy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146693 (https://phabricator.wikimedia.org/T393999)
[18:10:12] <wikibugs>	 (03PS3) 10Andrew Bogott: Octavia: new attempt at health check ports & ips [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783)
[18:12:35] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[18:14:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1146027 (https://phabricator.wikimedia.org/T394308) (owner: 10BCornwall)
[18:14:43] <wikibugs>	 (03PS4) 10Andrew Bogott: Octavia: new attempt at health check ports & ips [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783)
[18:14:48] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[18:19:40] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] admin: Add jtweed to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1146027 (https://phabricator.wikimedia.org/T394308) (owner: 10BCornwall)
[18:20:54] <wikibugs>	 (03PS5) 10Andrew Bogott: Octavia: new attempt at health check ports & ips [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783)
[18:21:04] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[18:21:07] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users for Jonathan Tweed - https://phabricator.wikimedia.org/T394308#10827849 (10BCornwall) 05In progress→03Resolved This access has been granted. It'll be up to an hour before it will be...
[18:23:15] <wikibugs>	 (03PS6) 10Andrew Bogott: Octavia: new attempt at health check ports & ips [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783)
[18:23:22] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[18:25:22] <wikibugs>	 (03PS1) 10TChin: [eventgate-analytics-external] bump version v1.13.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146695 (https://phabricator.wikimedia.org/T391959)
[18:25:41] <wikibugs>	 (03PS7) 10Andrew Bogott: Octavia: new attempt at health check ports & ips [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783)
[18:25:50] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[18:27:41] <wikibugs>	 (03CR) 10Dr0ptp4kt: [C:03+2] [eventgate-analytics-external] bump version v1.13.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146695 (https://phabricator.wikimedia.org/T391959) (owner: 10TChin)
[18:28:09] <wikibugs>	 (03PS8) 10Andrew Bogott: Octavia: new attempt at health check ports & ips [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783)
[18:29:02] <wikibugs>	 (03Merged) 10jenkins-bot: [eventgate-analytics-external] bump version v1.13.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146695 (https://phabricator.wikimedia.org/T391959) (owner: 10TChin)
[18:32:27] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Octavia: new attempt at health check ports & ips [puppet] - 10https://gerrit.wikimedia.org/r/1146689 (https://phabricator.wikimedia.org/T393783) (owner: 10Andrew Bogott)
[18:33:53] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for Neslihan_Turan_WMDE - https://phabricator.wikimedia.org/T394395#10827898 (10BCornwall) 05Open→03In progress p:05Triage→03Medium a:03WMDECyn
[18:34:27] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for Neslihan_Turan_WMDE - https://phabricator.wikimedia.org/T394395#10827905 (10BCornwall) L3/NDA is indeed valid, but the approval needs to happen still. @WMDECyn, Can you please comment here with your approva...
[18:34:51] <logmsgbot>	 !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply
[18:35:14] <logmsgbot>	 !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply
[18:36:13] <logmsgbot>	 !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply
[18:36:14] <wikibugs>	 06SRE, 06Data-Engineering, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10827909 (10BCornwall)
[18:36:55] <logmsgbot>	 !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply
[18:40:54] <wikibugs>	 (03PS2) 10Bvibber: Enable Chart extension on phase 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146679 (https://phabricator.wikimedia.org/T393518)
[18:40:59] <logmsgbot>	 !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply
[18:41:43] <logmsgbot>	 !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply
[18:49:11] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_esams - <bound method SREBatchRunnerBase._reason of <cookbooks.sre.cdn.roll-upgrade-varnish.RollUpgradeVarnishRunner object at 0x7f58099c1b50>>
[18:53:35] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_esams - <bound method SREBatchRunnerBase._reason of <cookbooks.sre.cdn.roll-upgrade-varnish.RollUpgradeVarnishRunner object at 0x7fd386623c70>>
[18:55:43] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10827952 (10Jclark-ctr) @MatthewVernon  The BOSS card did not appear in the boot order initially. Under NVMe settings, I changed the BIOS NVMe Driver setting to "All Drives" inst...
[18:55:52] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_eqsin - <bound method SREBatchRunnerBase._reason of <cookbooks.sre.cdn.roll-upgrade-varnish.RollUpgradeVarnishRunner object at 0x7f87783bdac0>>
[18:55:53] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_eqsin - <bound method SREBatchRunnerBase._reason of <cookbooks.sre.cdn.roll-upgrade-varnish.RollUpgradeVarnishRunner object at 0x7fc4014eef10>>
[18:58:15] <jinxer-wm>	 FIRING: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:00:35] <wikibugs>	 (03PS1) 10Eevans: cassandra: create storage directory for local keyspaces [puppet] - 10https://gerrit.wikimedia.org/r/1146705 (https://phabricator.wikimedia.org/T391544)
[19:01:30] <wikibugs>	 (03PS2) 10Eevans: cassandra: create storage directory for local keyspaces [puppet] - 10https://gerrit.wikimedia.org/r/1146705 (https://phabricator.wikimedia.org/T391544)
[19:03:07] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146705 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans)
[19:03:15] <jinxer-wm>	 RESOLVED: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:06:28] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.168.1" for 2 host(s)
[19:06:30] <wikibugs>	 (03CR) 10Eevans: [C:03+2] cassandra: create storage directory for local keyspaces [puppet] - 10https://gerrit.wikimedia.org/r/1146705 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans)
[19:08:15] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.168.1" completed for 2 hosts
[19:11:56] <wikibugs>	 (03PS1) 10Andrew Bogott: Octavia health checks: open firewall to UDP [puppet] - 10https://gerrit.wikimedia.org/r/1146708 (https://phabricator.wikimedia.org/T394099)
[19:12:24] <wikibugs>	 (03PS2) 10Andrew Bogott: Octavia health checks: open firewall to UDP [puppet] - 10https://gerrit.wikimedia.org/r/1146708 (https://phabricator.wikimedia.org/T394099)
[19:13:09] <wikibugs>	 (03PS3) 10Andrew Bogott: Octavia health checks: open firewall to UDP [puppet] - 10https://gerrit.wikimedia.org/r/1146708 (https://phabricator.wikimedia.org/T394099)
[19:13:19] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146708 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott)
[19:13:26] <wikibugs>	 10ops-esams, 06SRE, 06DC-Ops, 06Traffic: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10827984 (10BCornwall) 05Open→03Resolved I'm not seeing any errors in the kernel log, anomalies in the graphs, or outputs in `getsel`. I'll go ahead and resolve this. Thanks...
[19:16:36] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Octavia health checks: open firewall to UDP [puppet] - 10https://gerrit.wikimedia.org/r/1146708 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott)
[19:18:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10827991 (10BCornwall)
[19:24:00] <wikibugs>	 (03PS3) 10LD: frwiki: Enable the NewUserMessage extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146707 (https://phabricator.wikimedia.org/T382199)
[19:25:36] <jinxer-wm>	 FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[19:28:31] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10828031 (10Eevans) cassandra-dev2001 has been reimaged and configured for JBOD.  I used the following script to setup the addit...
[19:29:16] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:30:36] <jinxer-wm>	 RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[19:31:39] <wikibugs>	 (03CR) 10Pppery: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146707 (https://phabricator.wikimedia.org/T382199) (owner: 10LD)
[19:32:12] <wikibugs>	 (03CR) 10Pppery: frwiki: Enable the NewUserMessage extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146707 (https://phabricator.wikimedia.org/T382199) (owner: 10LD)
[19:34:15] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:38:20] <wikibugs>	 (03PS4) 10LD: frwiki: Enable the NewUserMessage extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146707 (https://phabricator.wikimedia.org/T382199)
[19:46:16] <jinxer-wm>	 FIRING: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:50:02] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146707 (https://phabricator.wikimedia.org/T382199) (owner: 10LD)
[19:51:15] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:52:42] <LD>	 Jenkins might need to recheck 1146707.
[19:54:03] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.reimage for host cassandra-dev2002.codfw.wmnet with OS bullseye
[19:54:14] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10828103 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1002 for host...
[19:54:15] <wikibugs>	 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 3 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10828104 (10ArthurPSmith) Confirming this works for me now - https://www.wikidata....
[19:54:39] <wikibugs>	 (03CR) 10AntiCompositeNumber: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146707 (https://phabricator.wikimedia.org/T382199) (owner: 10LD)
[19:55:32] <LD>	 thanks AntiComposite ;)
[19:55:35] <AntiComposite>	 np
[19:59:15] <jinxer-wm>	 FIRING: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window / Backport Party!Members of Release Engineering will be in #wikimedia-operations connect to share the joy of SpiderPig . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T2000).
[20:00:05] <jouncebot>	 danisztls, bvibber, and LD: A patch you scheduled for UTC late backport window / Backport Party!Members of Release Engineering will be in #wikimedia-operations connect to share the joy of SpiderPig is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:10] <bvibber>	 o/
[20:00:23] <thcipriani>	 ohai
[20:00:38] <LD>	 hi \O/
[20:00:56] <brennen>	 o/
[20:00:59] <thcipriani>	 we are deployment partying :)
[20:01:57] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:02:24] <thcipriani>	 looks like we're missing a danisztls bvibber you up for spiderpigging your change?
[20:02:33] <bvibber>	 sure :D
[20:02:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146679 (https://phabricator.wikimedia.org/T393518) (owner: 10Bvibber)
[20:02:58] * thcipriani watches :)
[20:03:17] <bvibber>	 i love how there's a link right from the deployment schedule to spiderpig :D
[20:03:23] <wikibugs>	 (03CR) 10Dreamy Jazz: "(I'm guessing this needs updating given that it depends on an abandoned patch)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081370 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó)
[20:03:46] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Chart extension on phase 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146679 (https://phabricator.wikimedia.org/T393518) (owner: 10Bvibber)
[20:04:02] <logmsgbot>	 !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1146679|Enable Chart extension on phase 2 wikis (T393518)]]
[20:04:06] <stashbot>	 T393518: Enable Charts for Phase 2 wikis - https://phabricator.wikimedia.org/T393518
[20:04:15] <jinxer-wm>	 RESOLVED: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:05:15] <jinxer-wm>	 FIRING: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:05:45] <thcipriani>	 bvibber: blame bd808 and dancy for the link, soon: deploying a bunch together!
[20:07:56] <LD>	 btw bvibber I've heard that fr wiktionary was interested in having Chart extension. Do you think it could be ok? if so I'll open a ticket later ;)
[20:08:25] <bvibber>	 sure open a ticket and you can jump the line :)
[20:08:37] <bvibber>	 non-wikipedias will be phase 4 rollout
[20:08:45] <bvibber>	 or if somene asks
[20:09:48] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2002.codfw.wmnet with reason: host reimage
[20:09:56] <logmsgbot>	 !log bvibber@deploy1003 bvibber: Backport for [[gerrit:1146679|Enable Chart extension on phase 2 wikis (T393518)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:09:58] <bvibber>	 testing...
[20:09:59] <stashbot>	 T393518: Enable Charts for Phase 2 wikis - https://phabricator.wikimedia.org/T393518
[20:10:15] <jinxer-wm>	 RESOLVED: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:10:16] <LD>	 phase 4 is TBD :')
[20:10:33] <bvibber>	 looks good
[20:10:39] <logmsgbot>	 !log bvibber@deploy1003 bvibber: Continuing with sync
[20:13:11] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2002.codfw.wmnet with reason: host reimage
[20:17:18] <logmsgbot>	 !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1146679|Enable Chart extension on phase 2 wikis (T393518)]] (duration: 13m 15s)
[20:17:21] <stashbot>	 T393518: Enable Charts for Phase 2 wikis - https://phabricator.wikimedia.org/T393518
[20:17:28] <bvibber>	 finished!
[20:17:45] <bvibber>	 shall i do the other two or someone else want to take those?
[20:17:58] <thcipriani>	 awesome thanks bvibber I can take others
[20:18:15] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:18:24] <bvibber>	 ok :D
[20:18:38] <thcipriani>	 danisztls: I think I saw you enter chat, ready for your patch?
[20:19:02] <danisztls>	 thcipriani: yes
[20:19:12] <thcipriani>	 cool, I'll get that going.
[20:20:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by thcipriani@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146599 (https://phabricator.wikimedia.org/T394315) (owner: 10DDesouza)
[20:21:18] <wikibugs>	 (03Merged) 10jenkins-bot: Design Research participant recruitment survey on eswiki: Deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146599 (https://phabricator.wikimedia.org/T394315) (owner: 10DDesouza)
[20:21:32] <logmsgbot>	 !log thcipriani@deploy1003 Started scap sync-world: Backport for [[gerrit:1146599|Design Research participant recruitment survey on eswiki: Deploy (T394315)]]
[20:21:36] <stashbot>	 T394315: ES.wiki QuickSurvey request for DR participant recruitment - https://phabricator.wikimedia.org/T394315
[20:22:40] <wikibugs>	 (03PS1) 10Eevans: cassandra-dev2002: configure for JBOD [puppet] - 10https://gerrit.wikimedia.org/r/1146723 (https://phabricator.wikimedia.org/T391544)
[20:23:16] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:23:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[20:24:09] <wikibugs>	 (03CR) 10Eevans: [C:03+2] cassandra-dev2002: configure for JBOD [puppet] - 10https://gerrit.wikimedia.org/r/1146723 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans)
[20:25:15] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:26:36] <danisztls>	 thcipriani: as my patch only increases the survey coverage I don't see a practicable way to test it
[20:26:59] <thcipriani>	 danisztls: ack, I'll send it on once it prompts me
[20:27:04] <logmsgbot>	 !log thcipriani@deploy1003 thcipriani, dani: Backport for [[gerrit:1146599|Design Research participant recruitment survey on eswiki: Deploy (T394315)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:27:07] <stashbot>	 T394315: ES.wiki QuickSurvey request for DR participant recruitment - https://phabricator.wikimedia.org/T394315
[20:27:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[20:28:37] <logmsgbot>	 !log thcipriani@deploy1003 thcipriani, dani: Continuing with sync
[20:29:01] <thcipriani>	 ^ danisztls going live everywhere now
[20:30:15] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:31:57] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release miscweb/design-landing-page on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=miscweb - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[20:32:08] <wikibugs>	 (03PS1) 10Greg Grossmeier: admin: update gjg's production ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1146725
[20:32:15] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:33:37] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2002.codfw.wmnet with OS bullseye
[20:33:44] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10828175 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1002 for host cass...
[20:35:18] <logmsgbot>	 !log thcipriani@deploy1003 Finished scap sync-world: Backport for [[gerrit:1146599|Design Research participant recruitment survey on eswiki: Deploy (T394315)]] (duration: 13m 46s)
[20:35:23] <stashbot>	 T394315: ES.wiki QuickSurvey request for DR participant recruitment - https://phabricator.wikimedia.org/T394315
[20:35:59] <thcipriani>	 LD: you're up!
[20:36:07] <LD>	 lets go :p
[20:36:47] <LD>	 as the previous patch, it can't really be tested, thats config stuff
[20:37:15] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:37:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by thcipriani@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146707 (https://phabricator.wikimedia.org/T382199) (owner: 10LD)
[20:38:39] <wikibugs>	 (03Merged) 10jenkins-bot: frwiki: Enable the NewUserMessage extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146707 (https://phabricator.wikimedia.org/T382199) (owner: 10LD)
[20:38:51] <logmsgbot>	 !log thcipriani@deploy1003 Started scap sync-world: Backport for [[gerrit:1146707|frwiki: Enable the NewUserMessage extension (T382199)]]
[20:38:55] <stashbot>	 T382199: Enable Extension NewUserMessage on fr.wikipedia - https://phabricator.wikimedia.org/T382199
[20:39:16] <jinxer-wm>	 FIRING: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:39:16] <LD>	 thanks for the party!
[20:40:17] <thcipriani>	 LD: there's no party like a deployment party :)
[20:41:40] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[20:42:09] <p858snake|cloud>	 I thought a s club 7 party, was the superior party?
[20:44:05] <wikibugs>	 (03PS1) 10Jdrewniak: styles: Set override also to former value of `line-height-small` token [skins/Vector] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146726 (https://phabricator.wikimedia.org/T389900)
[20:44:16] <jinxer-wm>	 RESOLVED: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:44:45] <logmsgbot>	 !log thcipriani@deploy1003 thcipriani, wpld: Backport for [[gerrit:1146707|frwiki: Enable the NewUserMessage extension (T382199)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:44:49] <stashbot>	 T382199: Enable Extension NewUserMessage on fr.wikipedia - https://phabricator.wikimedia.org/T382199
[20:45:44] <thcipriani>	 p858snake|cloud: lies
[20:45:46] <thcipriani>	 :)
[20:46:21] <thcipriani>	 LD: your change is up on test wikis, I can confirm using WikimediaDebug that I now see the extension in Special:Version now
[20:46:27] <thcipriani>	 anything else to test?
[20:46:38] <LD>	 not really :')
[20:46:53] <thcipriani>	 okie doke, going live for realz
[20:47:02] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:47:02] <logmsgbot>	 !log thcipriani@deploy1003 thcipriani, wpld: Continuing with sync
[20:49:16] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:52:08] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[20:52:17] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Dell SSD Critical Firmware Update - https://phabricator.wikimedia.org/T394348#10828215 (10RobH)
[20:53:36] <logmsgbot>	 !log thcipriani@deploy1003 Finished scap sync-world: Backport for [[gerrit:1146707|frwiki: Enable the NewUserMessage extension (T382199)]] (duration: 14m 44s)
[20:53:40] <stashbot>	 T382199: Enable Extension NewUserMessage on fr.wikipedia - https://phabricator.wikimedia.org/T382199
[20:53:47] <thcipriani>	 ^ LD all done!
[20:54:06] <LD>	 thanks again for the party :)
[20:54:15] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:54:16] <thcipriani>	 thanks for attending!
[20:54:33] <thcipriani>	 </party><normal time>
[20:55:15] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:58:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1003 using scap backport" [skins/Vector] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146726 (https://phabricator.wikimedia.org/T389900) (owner: 10Jdrewniak)
[20:58:36] <jinxer-wm>	 FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[21:00:03] <wikibugs>	 (03Merged) 10jenkins-bot: styles: Set override also to former value of `line-height-small` token [skins/Vector] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146726 (https://phabricator.wikimedia.org/T389900) (owner: 10Jdrewniak)
[21:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250515T2100)
[21:00:15] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:00:32] <logmsgbot>	 !log jdrewniak@deploy1003 Started scap sync-world: Backport for [[gerrit:1146726|styles: Set override also to former value of `line-height-small` token (T389900 T394305)]]
[21:00:36] <stashbot>	 T389900: Font modes: Resolve line-height token discrepancies downstream - https://phabricator.wikimedia.org/T389900
[21:00:36] <stashbot>	 T394305: 1.45.0-wmf.1: When setting font size to "small", line-height is absolute, making lines with larger font-size cramped - https://phabricator.wikimedia.org/T394305
[21:03:15] <jinxer-wm>	 FIRING: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:04:04] <wikibugs>	 (03PS1) 10Cwhite: logstash: nest curator configuration to support multiple jobs [puppet] - 10https://gerrit.wikimedia.org/r/1146728 (https://phabricator.wikimedia.org/T377018)
[21:04:19] <wikibugs>	 (03CR) 10BryanDavis: [C:03+1] Do not show thumbnails or descriptions on Wikitech search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146491 (owner: 10Majavah)
[21:06:09] <logmsgbot>	 !log jdrewniak@deploy1003 jdrewniak: Backport for [[gerrit:1146726|styles: Set override also to former value of `line-height-small` token (T389900 T394305)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:06:13] <stashbot>	 T389900: Font modes: Resolve line-height token discrepancies downstream - https://phabricator.wikimedia.org/T389900
[21:06:13] <stashbot>	 T394305: 1.45.0-wmf.1: When setting font size to "small", line-height is absolute, making lines with larger font-size cramped - https://phabricator.wikimedia.org/T394305
[21:06:46] <wikibugs>	 (03PS2) 10Cwhite: logstash: nest curator configuration to support multiple jobs [puppet] - 10https://gerrit.wikimedia.org/r/1146728 (https://phabricator.wikimedia.org/T377018)
[21:07:55] <wikibugs>	 (03PS1) 10Clare Ming: Experimentation Lab: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146729
[21:08:15] <jinxer-wm>	 RESOLVED: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:09:15] <jinxer-wm>	 FIRING: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:09:44] <wikibugs>	 (03CR) 10Dr0ptp4kt: [C:03+2] Experimentation Lab: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146729 (owner: 10Clare Ming)
[21:10:47] <wikibugs>	 (03PS1) 10Eevans: cassandra-dev2002: use custom d-i preseed (JBOD) [puppet] - 10https://gerrit.wikimedia.org/r/1146730 (https://phabricator.wikimedia.org/T391544)
[21:11:05] <wikibugs>	 (03Merged) 10jenkins-bot: Experimentation Lab: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146729 (owner: 10Clare Ming)
[21:12:40] <logmsgbot>	 !log jdrewniak@deploy1003 jdrewniak: Continuing with sync
[21:12:59] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] "PCC OK: no changes to host https://puppet-compiler.wmflabs.org/output/1146728/5571/" [puppet] - 10https://gerrit.wikimedia.org/r/1146728 (https://phabricator.wikimedia.org/T377018) (owner: 10Cwhite)
[21:13:55] <wikibugs>	 (03CR) 10Eevans: [C:03+2] cassandra-dev2002: use custom d-i preseed (JBOD) [puppet] - 10https://gerrit.wikimedia.org/r/1146730 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans)
[21:14:15] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:16:07] <wikibugs>	 (03PS1) 10Clare Ming: Experimentation Lab: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146731
[21:16:14] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.reimage for host cassandra-dev2002.codfw.wmnet with OS bullseye
[21:16:27] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10828298 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1002 for host cassandra-dev2002....
[21:18:09] <wikibugs>	 (03CR) 10LD: frwiki: Enable the NewUserMessage extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146707 (https://phabricator.wikimedia.org/T382199) (owner: 10LD)
[21:18:36] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply
[21:19:15] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:19:18] <logmsgbot>	 !log jdrewniak@deploy1003 Finished scap sync-world: Backport for [[gerrit:1146726|styles: Set override also to former value of `line-height-small` token (T389900 T394305)]] (duration: 18m 45s)
[21:19:22] <stashbot>	 T389900: Font modes: Resolve line-height token discrepancies downstream - https://phabricator.wikimedia.org/T389900
[21:19:22] <stashbot>	 T394305: 1.45.0-wmf.1: When setting font size to "small", line-height is absolute, making lines with larger font-size cramped - https://phabricator.wikimedia.org/T394305
[21:20:47] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply
[21:21:36] <jinxer-wm>	 FIRING: GatewayBackendErrorsHigh: rest-gateway: high 5xx errors from mobileapps_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh
[21:23:13] <jhathaway>	 o/
[21:23:25] <jhathaway>	 !incidents
[21:23:26] <sirenbot>	 6128 (UNACKED)  GatewayBackendErrorsHigh sre (mobileapps_cluster rest-gateway eqiad)
[21:23:26] <sirenbot>	 6124 (RESOLVED)  Host db1187 (paged)
[21:23:36] <jinxer-wm>	 RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[21:24:14] <swfrench-wmf>	 I need to leave for an appointment, but this might be a natural consequence of the shift from restbase -> PCS to rest-gateway -> PCS
[21:24:47] <swfrench-wmf>	 i.e. the pre-existing 5xxs moved from restbase being the client to rest-gateway, and thus are subject to this alert
[21:25:07] <swfrench-wmf>	 I was chatting with h.nowlan earlier today about this
[21:25:10] <jhathaway>	 hmm interesting
[21:25:24] <jhathaway>	 so are alerting thresholds may need adjustment?
[21:25:31] <jhathaway>	 *our
[21:25:47] <swfrench-wmf>	 yes, and Hugh was already considering doing that for the non-paging variant of the alert that's been firing intermittently
[21:26:03] <swfrench-wmf>	 this is the first time it's got above the paging threshold, though
[21:26:36] <jinxer-wm>	 RESOLVED: GatewayBackendErrorsHigh: rest-gateway: high 5xx errors from mobileapps_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh
[21:26:59] <jhathaway>	 okay, I suppose I'll just leave things as is for now then
[21:27:22] <swfrench-wmf>	 since it's self-resolving, then yeah - that sounds good
[21:27:41] <swfrench-wmf>	 if we see more of these transient blips, we might want to silence until Hugh can take a look Friday
[21:27:56] <swfrench-wmf>	 ref: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1146668 is the patch for the non-paging alert
[21:28:05] * swfrench-wmf out
[21:31:41] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2002.codfw.wmnet with reason: host reimage
[21:32:39] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.169.0" for 2 host(s)
[21:32:50] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[21:34:26] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.169.0" completed for 2 hosts
[21:35:12] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2002.codfw.wmnet with reason: host reimage
[21:40:00] <logmsgbot>	 !log brett@cumin2002 END (FAIL) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=1) rolling upgrade of Varnish on A:cp-upload_eqsin - <bound method SREBatchRunnerBase._reason of <cookbooks.sre.cdn.roll-upgrade-varnish.RollUpgradeVarnishRunner object at 0x7fc4014eef10>>
[21:44:15] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:47:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[21:49:15] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:50:36] <jinxer-wm>	 FIRING: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[21:51:04] <wikibugs>	 (03PS1) 10Cwhite: logstash: add forcemerge job [puppet] - 10https://gerrit.wikimedia.org/r/1146736 (https://phabricator.wikimedia.org/T377018)
[21:51:05] <wikibugs>	 (03PS1) 10Cwhite: logstash: add job schedule parameter [puppet] - 10https://gerrit.wikimedia.org/r/1146737 (https://phabricator.wikimedia.org/T377018)
[21:55:33] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp503[1-2].eqsin.wmnet} and A:cp - <bound method SREBatchRunnerBase._reason of <cookbooks.sre.cdn.roll-upgrade-varnish.RollUpgradeVarnishRunner object at 0x7f818c5f7df0>>
[21:57:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[21:59:16] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:00:42] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2002.codfw.wmnet with OS bullseye
[22:00:49] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10828360 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1002 for host cass...
[22:02:15] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:03:28] <wikibugs>	 (03PS1) 10BCornwall: cdn: Fix args reference [cookbooks] - 10https://gerrit.wikimedia.org/r/1146739
[22:05:36] <jinxer-wm>	 RESOLVED: GatewayBackendErrorsElevated: rest-gateway: elevated 5xx errors from mobileapps_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated
[22:07:15] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:09:39] <jinxer-wm>	 FIRING: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[22:10:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cdn: Fix args reference [cookbooks] - 10https://gerrit.wikimedia.org/r/1146739 (owner: 10BCornwall)
[22:11:07] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_eqsin - <bound method SREBatchRunnerBase._reason of <cookbooks.sre.cdn.roll-upgrade-varnish.RollUpgradeVarnishRunner object at 0x7f87783bdac0>>
[22:12:15] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:14:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[22:20:34] <wikibugs>	 (03CR) 10Dr0ptp4kt: [C:03+2] Experimentation Lab: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146731 (owner: 10Clare Ming)
[22:21:53] <wikibugs>	 (03Merged) 10jenkins-bot: Experimentation Lab: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146731 (owner: 10Clare Ming)
[22:23:39] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply
[22:27:00] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply
[22:27:52] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp503[1-2].eqsin.wmnet} and A:cp - <bound method SREBatchRunnerBase._reason of <cookbooks.sre.cdn.roll-upgrade-varnish.RollUpgradeVarnishRunner object at 0x7f818c5f7df0>>
[22:38:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job jmx_idp in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:43:43] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10828422 (10thcipriani) >>! In T393723#10813970, @Jdlrobson-WMF wrote: >> @Jdlrobson-WMF this seems like an odd question after all this time, but have you signed L3 Acknowledgement of Wikimed...
[23:03:40] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc2018 - https://phabricator.wikimedia.org/T393110#10828450 (10Jhancock.wm) a:03Jhancock.wm
[23:04:43] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10828452 (10Jhancock.wm) a:03Jhancock.wm
[23:31:39] <wikibugs>	 (03PS1) 10Andrew Bogott: Octavia health manager: listen on <ipaddress> [puppet] - 10https://gerrit.wikimedia.org/r/1146770 (https://phabricator.wikimedia.org/T394099)
[23:31:44] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146770 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott)
[23:33:54] <wikibugs>	 (03PS2) 10Andrew Bogott: Octavia health manager: listen on 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/1146770 (https://phabricator.wikimedia.org/T394099)
[23:35:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Octavia health manager: listen on 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/1146770 (https://phabricator.wikimedia.org/T394099) (owner: 10Andrew Bogott)
[23:38:56] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1146777
[23:38:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1146777 (owner: 10TrainBranchBot)
[23:49:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1146777 (owner: 10TrainBranchBot)
[23:59:07] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.reimage for host cassandra-dev2002.codfw.wmnet with OS bullseye
[23:59:16] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10828513 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1002 for host...