[00:02:20] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Request access to servers Dcops group - https://phabricator.wikimedia.org/T360356#9993182 (10wiki_willy) Thanks @elukey, that sounds good! >>! In T360356#9990691, @elukey wrote: > Filed a proposal in https://gerrit.wikimedia.org/r/1054894 > > @wiki_w... [00:02:33] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1054955 (owner: 10TrainBranchBot) [00:04:27] (03CR) 10Dzahn: [C:03+2] crm: add gnupg to crm role [puppet] - 10https://gerrit.wikimedia.org/r/1054953 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [00:05:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T367781)', diff saved to https://phabricator.wikimedia.org/P66801 and previous config saved to /var/cache/conftool/dbconfig/20240718-000500-arnaudb.json [00:05:21] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [00:08:04] (03CR) 10Dzahn: [C:03+2] "package is now installed on crm2001" [puppet] - 10https://gerrit.wikimedia.org/r/1054953 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [00:21:25] (03CR) 10Dwisehaupt: "Wonderful. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1054953 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [00:22:58] (03CR) 10Ladsgroup: [C:03+1] "I confirm." [puppet] - 10https://gerrit.wikimedia.org/r/1054797 (owner: 10Marostegui) [00:35:02] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic110[0-2]* for row maint - ryankemper@cumin2002 - T348977 [00:35:05] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic110[0-2]* for row maint - ryankemper@cumin2002 - T348977 [00:35:08] T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977 [00:44:19] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:04:42] FIRING: [2x] SystemdUnitFailed: envoyproxy.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:18:29] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:38:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1100-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [01:43:39] FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1100-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [02:23:57] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#9993327 (10Papaul) @Jclark-ctr check first with the service owner if those servers are ready for puppet 7 if they need to added to "hieradata/hosts" with ` profile::puppet::a... [02:28:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#9993333 (10Papaul) @Jclark-ctr please update task with the error you are getting and what is on the console. [02:37:57] (03PS1) 10Scott French: WIP: mediawiki-cache-warmup: support 'clone' for mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1054968 [02:39:19] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:29] 10ops-eqiad, 06SRE, 06DC-Ops: 10gbit nic option for centrallog1002 - https://phabricator.wikimedia.org/T369825#9993336 (10VRiley-WMF) @Papaul Thanks for the confirmation. On July 15th I asked you at 10:18AM if it would need a change and you responded that with confirmation at 11:00AM. The timestamp and comme... [02:44:29] 10ops-eqiad, 06SRE, 06DC-Ops: 10gbit nic option for centrallog1002 - https://phabricator.wikimedia.org/T369825#9993337 (10Papaul) @VRiley-WMF no problem let me know if you need any help since we did move one in codfw yesterday [02:59:19] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:07:29] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:37:14] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:46:08] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9993353 (10Papaul) I do agree with the 2 options however there is a possibility too that Frack will be taking a new rack if we do the codfw... [03:52:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T367856)', diff saved to https://phabricator.wikimedia.org/P66802 and previous config saved to /var/cache/conftool/dbconfig/20240718-035218-marostegui.json [03:52:26] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [04:01:27] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9993358 (10Papaul) >>! In T365167#9956217, @elukey wrote: > @Papaul I compared the BIOS settings between sretest2001 and kubernetes2054, these are the differences: > >... [04:04:00] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9993359 (10Papaul) @elukey for the pxe booting testing we can use this server first. [04:07:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P66803 and previous config saved to /var/cache/conftool/dbconfig/20240718-040725-marostegui.json [04:22:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P66804 and previous config saved to /var/cache/conftool/dbconfig/20240718-042232-marostegui.json [04:22:56] (03CR) 10Marostegui: [C:03+2] filtered_tables.txt: Remove non existing tables [puppet] - 10https://gerrit.wikimedia.org/r/1054797 (owner: 10Marostegui) [04:37:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T367856)', diff saved to https://phabricator.wikimedia.org/P66805 and previous config saved to /var/cache/conftool/dbconfig/20240718-043739-marostegui.json [04:37:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [04:37:45] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [04:37:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [04:37:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [04:38:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [04:38:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T367856)', diff saved to https://phabricator.wikimedia.org/P66806 and previous config saved to /var/cache/conftool/dbconfig/20240718-043817-marostegui.json [05:04:42] FIRING: [2x] SystemdUnitFailed: envoyproxy.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:43:54] FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1100-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240718T0600) [06:00:05] marostegui, Amir1, and arnaudb: Time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240718T0600). [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:59:19] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:01] 10ops-eqiad, 06SRE, 06DC-Ops: 10gbit nic option for centrallog1002 - https://phabricator.wikimedia.org/T369825#9993459 (10fgiunchedi) Thank you all for the clarification, I'm glad we're able to do the move without re-addressing after all! This is what we did for centrallog2002 the other day in https://phabr... [07:00:05] Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240718T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:20] indeed, nothing to do [07:01:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:04:07] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:06:53] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:11:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:12:01] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 213, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:12:03] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:12:53] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:13:12] (03CR) 10Aklapper: "I see these 11 short month names alrready in https://gerrit.wikimedia.org/r/plugins/gitiles/phabricator/translations/+/refs/heads/wmf/stab" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1054634 (https://phabricator.wikimedia.org/T363188) (owner: 10Pppery) [07:14:12] (03PS1) 10Kevin Bazira: ml-services: outlink_topic_model from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055085 (https://phabricator.wikimedia.org/T369344) [07:15:00] (03CR) 10Aklapper: [C:03+2] "Ah, I think I get it now - merging this will allow to not rely on testcases for translation, alright." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1054634 (https://phabricator.wikimedia.org/T363188) (owner: 10Pppery) [07:15:01] (03CR) 10Ayounsi: [C:03+1] Add identifiers for ESI-LAGs to legacy switches on codfw row D spines [homer/public] - 10https://gerrit.wikimedia.org/r/1054942 (https://phabricator.wikimedia.org/T366941) (owner: 10Cathal Mooney) [07:15:04] (03CR) 10Aklapper: [V:03+2 C:03+2] Add extra date elements for arcanist [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1054634 (https://phabricator.wikimedia.org/T363188) (owner: 10Pppery) [07:23:57] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:24:01] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:24:03] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:29:10] (03PS1) 10Kevin Bazira: ml-services: langid from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055145 (https://phabricator.wikimedia.org/T369344) [07:32:27] (03CR) 10Aklapper: [V:03+2 C:03+2] "Thank you!" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1054635 (https://phabricator.wikimedia.org/T363188) (owner: 10Pppery) [07:36:00] (03PS1) 10Slavina Stefanova: envvars backend: update endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1055146 (https://phabricator.wikimedia.org/T365014) [07:37:14] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:37:29] (03CR) 10Slavina Stefanova: "depends on https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/40" [puppet] - 10https://gerrit.wikimedia.org/r/1055146 (https://phabricator.wikimedia.org/T365014) (owner: 10Slavina Stefanova) [07:38:44] 06SRE, 06Infrastructure-Foundations, 10netops: Issue creating GNMI telemetry subscription to certain QFX5120 devices - https://phabricator.wikimedia.org/T370366#9993525 (10ayounsi) 05Open→03Resolved a:03ayounsi Thanks for the investigation ! Seems like the last step was : ` asw1-b3-magru> restart a... [07:41:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054914 (https://phabricator.wikimedia.org/T370097) (owner: 10Dreamrimmer) [07:44:33] 06SRE, 10Continuous-Integration-Infrastructure, 06Infrastructure-Foundations, 06Release-Engineering-Team: package_builder python-all conflicts with base::standard_packages python2.7 removal - https://phabricator.wikimedia.org/T370337#9993557 (10hashar) `base::standard_packages()` `remove_python2` parameter... [07:48:43] (03CR) 10Elukey: [V:03+1] "Thanks for the review! So in theory IIUC the dcops users would end up being part of sre-admins, that is already in always_groups. I do see" [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [07:51:49] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9993574 (10ayounsi) Or we could just use a IPv6 /64 and stop worrying about space :) Thinking more globally, if we were to redo the product... [07:56:25] (03CR) 10Elukey: [C:03+1] "Added a nit but the rest looks good!" [software/homer] - 10https://gerrit.wikimedia.org/r/1054543 (owner: 10Ayounsi) [07:56:47] (03CR) 10Elukey: [C:03+1] CHANGELOG: add changelogs for release v0.6.7 (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1054543 (owner: 10Ayounsi) [07:57:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=mw-api-ext-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [07:58:02] excellent [07:58:07] * volans looking [07:58:09] !incidents [07:58:10] 4879 (UNACKED) ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet esams) [07:58:10] 4859 (RESOLVED) db1219 (paged)/MariaDB Replica Lag: s1 (paged) [07:58:10] 4858 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet esams) [07:58:15] !ack 4879 [07:58:15] 4879 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet esams) [07:58:41] nice, looks like yesterday's one effie [07:59:01] wait I was cooking a dashboard [07:59:12] (moving to -sre) [08:01:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:01:44] (03PS1) 10Hashar: ci: keep python2 packages on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1055147 (https://phabricator.wikimedia.org/T367544) [08:01:54] Looks like there were no backports earlier [08:02:00] so I will now start promoting group1 wikis again to 1.43.0-wmf.14 [08:03:31] (03PS1) 10TrainBranchBot: group1 wikis to 1.43.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055148 (https://phabricator.wikimedia.org/T366959) [08:03:32] (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.43.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055148 (https://phabricator.wikimedia.org/T366959) (owner: 10TrainBranchBot) [08:04:07] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:04:17] (03Merged) 10jenkins-bot: group1 wikis to 1.43.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055148 (https://phabricator.wikimedia.org/T366959) (owner: 10TrainBranchBot) [08:06:25] RESOLVED: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:53] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:08:04] (03CR) 10Hashar: "I have cherry picked that on the integration Puppet master and I have confirmed Puppet on `integration-agent-pkgbuilder-1003.integration.e" [puppet] - 10https://gerrit.wikimedia.org/r/1055147 (https://phabricator.wikimedia.org/T367544) (owner: 10Hashar) [08:09:44] (03PS2) 10Ayounsi: CHANGELOG: add changelogs for release v0.6.7 [software/homer] - 10https://gerrit.wikimedia.org/r/1054543 [08:09:49] (03CR) 10Ayounsi: CHANGELOG: add changelogs for release v0.6.7 (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1054543 (owner: 10Ayounsi) [08:13:40] !log aklapper@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.43.0-wmf.14 refs T366959 [08:13:44] T366959: 1.43.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T366959 [08:15:26] (03CR) 10Vgutierrez: Add public suffix list module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (owner: 10BCornwall) [08:17:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=mw-api-ext-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [08:19:12] (03CR) 10Brouberol: "Apologies for the slow review, I was OOO. The change looks good, but I'm not familiar with the istio config. Could we pair on the deployme" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052701 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:20:40] (03PS26) 10Arnaudb: mysqld-exporter: hotfix config for es1 to es5 [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) [08:21:46] (03CR) 10JMeybohm: "I'm on PTO starting next week. Maybe @tklausmann@wikimedia.org can help out here at I think the ml config is pretty similar to the dse one" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052701 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:22:42] (03CR) 10Btullis: [C:03+1] datahub-next: upgrade datahub to 0.13.3 (latest version) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051786 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [08:23:16] (03CR) 10Brouberol: [C:03+2] datahub-next: upgrade datahub to 0.13.3 (latest version) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051786 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [08:24:11] (03CR) 10Filippo Giunchedi: [C:03+1] mysqld-exporter: hotfix config for es1 to es5 [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) (owner: 10Arnaudb) [08:24:56] (03CR) 10Elukey: "I can help as well if needed!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052701 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:25:05] (03CR) 10Arnaudb: [C:03+2] mysqld-exporter: hotfix config for es1 to es5 [puppet] - 10https://gerrit.wikimedia.org/r/1053698 (https://phabricator.wikimedia.org/T369720) (owner: 10Arnaudb) [08:26:39] (03CR) 10Brouberol: "@ltoscano@wikimedia.org that'd be greatly appreciated m(_ _)m" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052701 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:27:05] (03PS4) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) [08:30:11] (03PS1) 10Elukey: profile::docker::reporter: remove unnecessary filters [puppet] - 10https://gerrit.wikimedia.org/r/1055150 (https://phabricator.wikimedia.org/T367427) [08:38:03] jouncebot: nowandnext [08:38:03] No deployments scheduled for the next 1 hour(s) and 21 minute(s) [08:38:03] In 1 hour(s) and 21 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240718T1000) [08:41:59] (03PS1) 10Elukey: CHANGELOG: add changelogs for release v8.8.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1055152 [08:42:36] (03CR) 10DCausse: [C:03+2] rdf-streaming-updater: configure the split graph updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054584 (https://phabricator.wikimedia.org/T361935) (owner: 10DCausse) [08:43:29] (03Merged) 10jenkins-bot: rdf-streaming-updater: configure the split graph updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054584 (https://phabricator.wikimedia.org/T361935) (owner: 10DCausse) [08:47:00] (03CR) 10Ayounsi: CHANGELOG: add changelogs for release v8.8.0 (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1055152 (owner: 10Elukey) [08:47:01] (03PS3) 10Hashar: grafana: clone grafana-grizzly with default parameters [puppet] - 10https://gerrit.wikimedia.org/r/1054892 (https://phabricator.wikimedia.org/T338277) [08:47:02] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [08:47:06] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1054892 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [08:47:24] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [08:47:59] (03PS3) 10Ayounsi: CHANGELOG: add changelogs for release v0.6.7 [software/homer] - 10https://gerrit.wikimedia.org/r/1054543 [08:49:19] (03PS2) 10Elukey: CHANGELOG: add changelogs for release v8.8.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1055152 [08:49:31] (03CR) 10Elukey: CHANGELOG: add changelogs for release v8.8.0 (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1055152 (owner: 10Elukey) [08:49:59] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: outlink_topic_model from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055085 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [08:50:24] (03PS3) 10Elukey: CHANGELOG: add changelogs for release v8.8.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1055152 [08:51:01] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [08:51:12] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [08:51:31] (03CR) 10Ayounsi: [C:03+1] "ship it!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1055152 (owner: 10Elukey) [08:51:43] (03CR) 10Elukey: CHANGELOG: add changelogs for release v8.8.0 (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1055152 (owner: 10Elukey) [08:52:49] (03PS4) 10Ayounsi: CHANGELOG: add changelogs for release v0.6.7 [software/homer] - 10https://gerrit.wikimedia.org/r/1054543 [08:53:02] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9993751 (10cmooney) >>! In T370164#9993574, @ayounsi wrote: > Or we could just use a IPv6 /64 and stop worrying about space :) One day :)... [08:54:40] (03CR) 10Hashar: "Puppet Catalogue Compiler output at https://puppet-compiler.wmflabs.org/output/1054889/1474/an-web1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/1054889 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [08:55:48] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [08:56:01] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [08:59:18] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v8.8.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1055152 (owner: 10Elukey) [09:00:48] (03CR) 10Filippo Giunchedi: [C:03+2] logstash: add auto_offset_reset to kafka input [puppet] - 10https://gerrit.wikimedia.org/r/1042917 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi) [09:02:09] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [09:02:33] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [09:03:28] (03PS1) 10Elukey: Upstream release v8.8.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1055154 [09:04:42] FIRING: [2x] SystemdUnitFailed: envoyproxy.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:06:23] !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [09:06:55] !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [09:08:58] !log disabled check-private-data.timer on clouddb1021, pending decom. [09:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:55] (03CR) 10Elukey: [C:03+2] Upstream release v8.8.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1055154 (owner: 10Elukey) [09:13:06] are there any ongoing incidents at the moment? especially anything that might affect the job queue? [09:13:08] * Lucas_WMDE looking into https://alerts.wikimedia.org/?q=%40state%3Dactive&q=alertname%3DDelay%20injecting%20Recent%20Changes%2C%20aggregated%20across%20client%20wikis%20alert [09:18:10] (03PS1) 10Arnaudb: mariadb: reducing pt-heartbeat monitoring [alerts] - 10https://gerrit.wikimedia.org/r/1055155 (https://phabricator.wikimedia.org/T369720) [09:18:10] (03CR) 10Arnaudb: "this fixes false positives on PrometheusMysqldExporterFailed" [alerts] - 10https://gerrit.wikimedia.org/r/1055155 (https://phabricator.wikimedia.org/T369720) (owner: 10Arnaudb) [09:20:41] (03PS1) 10Brouberol: an-launcher: use the datahub kafka registry URL instead of Karapace [puppet] - 10https://gerrit.wikimedia.org/r/1055156 (https://phabricator.wikimedia.org/T363461) [09:22:25] (03PS2) 10Brouberol: an-launcher: use the datahub kafka registry URL instead of Karapace [puppet] - 10https://gerrit.wikimedia.org/r/1055156 (https://phabricator.wikimedia.org/T363461) [09:22:27] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1055156 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [09:23:14] (03CR) 10Btullis: [C:03+2] statistics: remove git::clone file mode [puppet] - 10https://gerrit.wikimedia.org/r/1054889 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [09:26:27] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [09:26:55] !log uploaded spicerack_8.8.0 to apt.wikimedia.org bullseye-wikimedia [09:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:17] (03CR) 10Btullis: [C:03+1] "Nice." [puppet] - 10https://gerrit.wikimedia.org/r/1055156 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [09:28:37] Lucas_WMDE: sorry I missed your message, not [09:28:40] no* [09:30:22] (03CR) 10Brouberol: [C:03+2] an-launcher: use the datahub kafka registry URL instead of Karapace [puppet] - 10https://gerrit.wikimedia.org/r/1055156 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [09:38:49] FIRING: HelmReleaseBadStatus: Helm release datahub-next/staging on k8s-dse@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datahub-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:42:08] (03CR) 10Kevin Bazira: [C:03+2] ml-services: outlink_topic_model from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055085 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [09:43:04] (03Merged) 10jenkins-bot: ml-services: outlink_topic_model from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055085 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [09:43:38] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [09:43:54] FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1100-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [09:44:33] !log upgrade spicerack to 8.8.0 on cumin2002 - testing the new release [09:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:08] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [09:46:49] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [09:48:49] RESOLVED: HelmReleaseBadStatus: Helm release datahub-next/staging on k8s-dse@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datahub-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:52:09] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2432 [09:56:12] effie: okay, thanks [09:56:17] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.convert-disks (exit_code=99) for host mw2432 [09:56:22] guess I’ll keep looking around then (or hope it resolves itself ^^) [09:56:37] * Lucas_WMDE got sidetracked looking into the unrelated task T370396 [09:59:35] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: langid from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055145 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240718T1000) [10:04:18] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2432 [10:08:10] (03PS1) 10JMeybohm: ermbox: Enable mesh for termbox-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055164 (https://phabricator.wikimedia.org/T368523) [10:08:26] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.convert-disks (exit_code=99) for host mw2432 [10:17:34] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2432 [10:23:20] (FTR, I think I found the cause of my alert and it’s a metric problem, nothing to worry about) [10:28:07] !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.convert-disks (exit_code=97) for host mw2432 [10:29:22] (03CR) 10Cathal Mooney: [C:03+2] Add identifiers for ESI-LAGs to legacy switches on codfw row D spines [homer/public] - 10https://gerrit.wikimedia.org/r/1054942 (https://phabricator.wikimedia.org/T366941) (owner: 10Cathal Mooney) [10:29:53] (03Merged) 10jenkins-bot: Add identifiers for ESI-LAGs to legacy switches on codfw row D spines [homer/public] - 10https://gerrit.wikimedia.org/r/1054942 (https://phabricator.wikimedia.org/T366941) (owner: 10Cathal Mooney) [10:38:17] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2432.codfw.wmnet with OS buster [10:42:38] (03PS1) 10Cathal Mooney: Add monitoring checks for codfw row D spines [puppet] - 10https://gerrit.wikimedia.org/r/1055169 (https://phabricator.wikimedia.org/T366941) [10:54:35] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [10:56:36] (03CR) 10Volans: [C:03+1] "LGTM, but I have little context on the specific problem." [puppet] - 10https://gerrit.wikimedia.org/r/1054864 (https://phabricator.wikimedia.org/T370255) (owner: 10MVernon) [10:57:12] (03CR) 10MVernon: [C:03+2] cephadm::target mask the podman-auto-update service [puppet] - 10https://gerrit.wikimedia.org/r/1054864 (https://phabricator.wikimedia.org/T370255) (owner: 10MVernon) [10:59:19] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:59:19] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:00:34] (03PS1) 10Brouberol: datahub-next: upgrade datahub to 0.13.3 (latest version) with the jettty XML file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055175 (https://phabricator.wikimedia.org/T363461) [11:01:18] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: podman-auto-update failures - https://phabricator.wikimedia.org/T370255#9994199 (10MatthewVernon) 05Open→03Resolved service masked on targets, so we shouldn't see this again. [11:01:51] (03PS2) 10Brouberol: datahub-next: upgrade datahub to 0.13.3 (latest version) with the jettty XML file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055175 (https://phabricator.wikimedia.org/T363461) [11:02:17] (03PS1) 10Btullis: Increase the heap for the mapreduce history service on an-master1003 [puppet] - 10https://gerrit.wikimedia.org/r/1055176 (https://phabricator.wikimedia.org/T369278) [11:02:40] (03CR) 10CI reject: [V:04-1] Increase the heap for the mapreduce history service on an-master1003 [puppet] - 10https://gerrit.wikimedia.org/r/1055176 (https://phabricator.wikimedia.org/T369278) (owner: 10Btullis) [11:02:53] (03CR) 10Brouberol: [C:03+2] datahub-next: upgrade datahub to 0.13.3 (latest version) with the jettty XML file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055175 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [11:03:35] (03PS2) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-07-09-154549 to 2024-07-17-145805 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054909 (https://phabricator.wikimedia.org/T364413) [11:03:38] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2024-07-09-154549 to 2024-07-17-145805 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054909 (https://phabricator.wikimedia.org/T364413) (owner: 10Jforrester) [11:03:43] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [11:04:38] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2024-07-09-154549 to 2024-07-17-145805 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054909 (https://phabricator.wikimedia.org/T364413) (owner: 10Jforrester) [11:04:48] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [11:04:51] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [11:05:36] !log cmooney@cumin1002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [11:05:55] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [11:06:10] (03PS2) 10Btullis: Increase the heap for the mapreduce history service on an-master1003 [puppet] - 10https://gerrit.wikimedia.org/r/1055176 (https://phabricator.wikimedia.org/T369278) [11:07:50] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:07:53] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [11:09:30] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [11:10:17] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [11:10:39] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [11:12:06] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [11:12:11] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [11:13:06] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for new IRB interfaces codfw - cmooney@cumin1002" [11:14:05] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for new IRB interfaces codfw - cmooney@cumin1002" [11:14:05] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:14:08] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [11:15:36] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [11:15:49] FIRING: HelmReleaseBadStatus: Helm release datahub-next/staging on k8s-dse@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datahub-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:18:18] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 26): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3305/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1055176 (https://phabricator.wikimedia.org/T369278) (owner: 10Btullis) [11:20:48] (03PS3) 10Btullis: Increase the heap for the mapreduce history service on an-master1003 [puppet] - 10https://gerrit.wikimedia.org/r/1055176 (https://phabricator.wikimedia.org/T369278) [11:20:49] (03PS2) 10JMeybohm: termbox: Enable mesh for termbox-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055164 (https://phabricator.wikimedia.org/T368523) [11:20:49] RESOLVED: HelmReleaseBadStatus: Helm release datahub-next/staging on k8s-dse@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datahub-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:25:02] (03PS1) 10Michael Große: fix(editor): make PageTitleControl reliably blankable [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055182 (https://phabricator.wikimedia.org/T370326) [11:25:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055182 (https://phabricator.wikimedia.org/T370326) (owner: 10Michael Große) [11:26:08] (03PS1) 10Cathal Mooney: Disable config for RA generation on Spines in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1055183 (https://phabricator.wikimedia.org/T366941) [11:31:36] (03CR) 10Ayounsi: [C:03+1] Add monitoring checks for codfw row D spines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1055169 (https://phabricator.wikimedia.org/T366941) (owner: 10Cathal Mooney) [11:31:47] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 24 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/1055176 (https://phabricator.wikimedia.org/T369278) (owner: 10Btullis) [11:32:46] (03CR) 10Ayounsi: [C:03+1] Disable config for RA generation on Spines in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1055183 (https://phabricator.wikimedia.org/T366941) (owner: 10Cathal Mooney) [11:35:19] (03CR) 10Cathal Mooney: [C:03+2] Disable config for RA generation on Spines in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1055183 (https://phabricator.wikimedia.org/T366941) (owner: 10Cathal Mooney) [11:35:47] (03Merged) 10jenkins-bot: Disable config for RA generation on Spines in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1055183 (https://phabricator.wikimedia.org/T366941) (owner: 10Cathal Mooney) [11:36:49] (03Abandoned) 10BCornwall: ncmonitor: Set path for public suffix domain list [puppet] - 10https://gerrit.wikimedia.org/r/1054073 (https://phabricator.wikimedia.org/T369114) (owner: 10BCornwall) [11:37:14] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:38:05] (03PS13) 10BCornwall: ncmonitor: Add public suffix list module [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (https://phabricator.wikimedia.org/T369114) [11:39:55] !log cgoubert@cumin1002 START - Cookbook sre.hosts.provision for host mw2432.mgmt.codfw.wmnet with reboot policy GRACEFUL [11:42:18] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2432.mgmt.codfw.wmnet with reboot policy GRACEFUL [11:46:28] (03CR) 10Vgutierrez: ncmonitor: Add public suffix list module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (https://phabricator.wikimedia.org/T369114) (owner: 10BCornwall) [11:47:41] (03PS1) 10Ayounsi: Netbox 4: point prod service to new servers [puppet] - 10https://gerrit.wikimedia.org/r/1055187 (https://phabricator.wikimedia.org/T336275) [11:49:40] PROBLEM - Etcd cluster health on wikikube-ctrl2002 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [11:49:54] PROBLEM - Etcd cluster health on conf2004 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [11:49:54] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - toolhub_4011: Servers mw2451.codfw.wmnet, wikikube-worker2007.codfw.wmnet, mw2396.codfw.wmnet, kubernetes2058.codfw.wmnet, mw2302.codfw.wmnet, mw2376.codfw.wmnet, wikikube-worker2019.codfw.wmnet, wikikube-worker2031.codfw.wmnet, kubernetes2030.codfw.wmnet, mw2446.codfw.wmnet, mw2373.codfw.wmnet, kubernetes2035.codfw.wmnet, kubernetes2017.codfw.wmnet, [11:49:54] tes2012.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2009.codfw.wmnet, mw2449.codfw.wmnet, mw2338.codfw.wmnet, kubernetes2014.codfw.wmnet, mw2443.codfw.wmnet, kubernetes2022.codfw.wmnet, mw2374.codfw.wmnet, mw2369.codfw.wmnet, mw2382.codfw.wmnet, mw2297.codfw.wmnet, parse2018.codfw.wmnet, mw2295.codfw.wmnet, mw2370.codfw.wmnet, kubernetes2056.codfw.wmnet, mw2335.codfw.wmnet, wikikube-worker2008.codfw.wmnet, kubernetes2043.codfw.wmne [11:49:54] ube-worker2021.codfw.wmnet, kubernetes2019.codfw.wmnet, mw2351.codfw.wmnet, mw2385.codfw.wmnet, wikikube-worker2028.codfw.wmnet, parse2013.codfw.wmnet, parse2010.codfw.wmnet, wikikube-w https://wikitech.wikimedia.org/wiki/PyBal [11:49:54] PROBLEM - Etcd cluster health on kubestagemaster2005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [11:49:58] FIRING: [3x] ProbeDown: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:49:58] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - toolhub_4011: Servers kubernetes2060.codfw.wmnet, mw2350.codfw.wmnet, wikikube-worker2002.codfw.wmnet, mw2357.codfw.wmnet, wikikube-worker2003.codfw.wmnet, mw2446.codfw.wmnet, mw2422.codfw.wmnet, wikikube-worker2024.codfw.wmnet, mw2442.codfw.wmnet, parse2019.codfw.wmnet, mw2294.codfw.wmnet, mw2414.codfw.wmnet, mw2434.codfw.wmnet, kubernetes2020.codfw [11:49:58] mw2310.codfw.wmnet, kubernetes2038.codfw.wmnet, mw2394.codfw.wmnet, mw2338.codfw.wmnet, mw2370.codfw.wmnet, kubernetes2034.codfw.wmnet, kubernetes2048.codfw.wmnet, parse2003.codfw.wmnet, mw2387.codfw.wmnet, mw2297.codfw.wmnet, wikikube-worker2014.codfw.wmnet, kubernetes2016.codfw.wmnet, mw2355.codfw.wmnet, mw2293.codfw.wmnet, kubernetes2036.codfw.wmnet, kubernetes2029.codfw.wmnet, parse2015.codfw.wmnet, kubernetes2005.codfw.wmnet, mw2402. [11:49:58] net, kubernetes2008.codfw.wmnet, mw2319.codfw.wmnet, wikikube-worker2030.codfw.wmnet, mw2380.codfw.wmnet, mw2449.codfw.wmnet, mw2318.codfw.wmnet, mw2292.codfw.wmnet, kubernetes2027.codf https://wikitech.wikimedia.org/wiki/PyBal [11:50:06] PROBLEM - Etcd cluster health on ml-staging-etcd2003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [11:50:06] PROBLEM - Etcd cluster health on ml-etcd2001 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [11:50:07] * volans looking [11:50:13] FIRING: [4x] ProbeDown: Service api-https:443 has failed probes (http_api-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:50:13] acked [11:50:16] PROBLEM - Restbase root url on restbase2031 is CRITICAL: connect to address 10.192.32.30 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [11:50:18] PROBLEM - OpenSearch health check for shards on 9200 on logstash2030 is CRITICAL: CRITICAL - elasticsearch inactive shards 716 threshold =0.34 breach: cluster_name: production-elk7-codfw, status: red, timed_out: False, number_of_nodes: 9, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 591, active_shards: 804, relocating_shards: 6, initializing_shards: 14, unassigned_shards: 702, [11:50:18] unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 52.89473684210526 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:50:20] PROBLEM - OpenSearch health check for shards on 9200 on logstash2031 is CRITICAL: CRITICAL - elasticsearch inactive shards 716 threshold =0.34 breach: cluster_name: production-elk7-codfw, status: red, timed_out: False, number_of_nodes: 9, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 591, active_shards: 804, relocating_shards: 6, initializing_shards: 14, unassigned_shards: 702, [11:50:20] unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 52.89473684210526 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:50:20] PROBLEM - OpenSearch health check for shards on 9200 on logstash2032 is CRITICAL: CRITICAL - elasticsearch inactive shards 716 threshold =0.34 breach: cluster_name: production-elk7-codfw, status: red, timed_out: False, number_of_nodes: 9, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 591, active_shards: 804, relocating_shards: 6, initializing_shards: 14, unassigned_shards: 702, [11:50:20] unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 52.89473684210526 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:50:22] PROBLEM - Restbase root url on restbase2027 is CRITICAL: connect to address 10.192.48.16 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [11:50:22] PROBLEM - Restbase root url on restbase2026 is CRITICAL: connect to address 10.192.48.169 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [11:50:26] PROBLEM - OpenSearch health check for shards on 9200 on logstash2024 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [11:50:30] PROBLEM - Restbase root url on restbase2034 is CRITICAL: connect to address 10.192.48.67 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [11:50:31] here [11:50:32] ok [11:50:34] PROBLEM - Docker registry health on registry2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - pattern not found - 235 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Docker [11:50:36] something in codfw went down [11:50:38] PROBLEM - MariaDB Replica IO: s8 on db2167 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2005, Errmsg: error reconnecting to master repl2024@db2161.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Unknown server host db2161.codfw.wmnet (-3) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:50:38] PROBLEM - MariaDB Replica IO: s2 on db2138 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2005, Errmsg: error reconnecting to master repl2024@db2204.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Unknown server host db2204.codfw.wmnet (-3) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:50:38] PROBLEM - MariaDB Replica IO: x1 on db2115 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2196.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2196.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:50:38] PROBLEM - MariaDB Replica IO: s3 on db2127 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2005, Errmsg: error reconnecting to master repl2024@db2205.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Unknown server host db2205.codfw.wmnet (-3) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:50:38] PROBLEM - MariaDB Replica IO: s1 on db2130 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2203.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2203.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:50:40] let's move to -sre [11:50:40] PROBLEM - Docker registry HTTPS interface on registry2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string schemaVersion not found on https://registry2004.codfw.wmnet:443/v2/bullseye/manifests/latest - 362 bytes in 0.129 second response time https://wikitech.wikimedia.org/wiki/Docker [11:50:41] FIRING: [8x] ProbeDown: Service gitlab2002:22 has failed probes (tcp_gitlab_wikimedia_org_ssh_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:50:42] PROBLEM - MariaDB Replica IO: s7 on db2200 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2218.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2218.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:50:44] PROBLEM - OpenSearch health check for shards on 9200 on logstash2025 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [11:50:44] PROBLEM - MariaDB Replica IO: x1 on db2191 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2196.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2196.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:50:44] PROBLEM - MariaDB Replica IO: s5 on db2192 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2123.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2123.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:50:44] PROBLEM - MariaDB Replica IO: s8 on db2195 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2005, Errmsg: error reconnecting to master repl2024@db2161.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Unknown server host db2161.codfw.wmnet (-3) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:50:44] PROBLEM - MariaDB Replica IO: s3 on db2190 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2205.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2205.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:50:46] PROBLEM - MariaDB Replica IO: s6 on db2197 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2214.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2214.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:50:46] PROBLEM - MariaDB Replica IO: s4 on db2199 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2179.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2179.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:50:46] PROBLEM - MariaDB Replica IO: es6 on es2036 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@es2035.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on es2035.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:50:46] PROBLEM - MariaDB Replica IO: s1 on db2212 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2203.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2203.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:50:48] PROBLEM - MariaDB Replica IO: s1 on db2216 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2005, Errmsg: error reconnecting to master repl2024@db2203.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Unknown server host db2203.codfw.wmnet (-3) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:50:52] PROBLEM - Docker registry HTTPS interface on registry2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string schemaVersion not found on https://registry2003.codfw.wmnet:443/v2/bullseye/manifests/latest - 362 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Docker [11:50:52] PROBLEM - MariaDB Replica IO: s5 on db2213 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2123.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2123.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:50:54] FIRING: [100x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:50:54] PROBLEM - Docker registry health on registry2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - pattern not found - 235 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Docker [11:50:56] PROBLEM - MariaDB Replica IO: s3 on db2139 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2205.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2205.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:50:57] PROBLEM - MariaDB Replica IO: s4 on db2147 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2179.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2179.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:50:58] PROBLEM - MariaDB Replica IO: s1 on db2170 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2005, Errmsg: error reconnecting to master repl2024@db2203.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Unknown server host db2203.codfw.wmnet (-3) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:50:58] PROBLEM - MariaDB Replica IO: s4 on db2137 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2179.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2179.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:50:58] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 1305 threshold =0.2 breach: cluster_name: production-search-omega-codfw, status: red, timed_out: False, number_of_nodes: 13, number_of_data_nodes: 13, active_primary_shards: 1655, active_shards: 3662, relocating_shards: 0, initializing_shards: 22, unassigned_shards: 1283, delayed_unassigned_shards: 0, num [11:50:58] ending_tasks: 21, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 37695, active_shards_percent_as_number: 73.7265955305013 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:50:58] PROBLEM - MariaDB Replica IO: s1 on db2173 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2203.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2203.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:51:02] PROBLEM - MariaDB Replica IO: s3 on db2149 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2205.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2205.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:51:02] PROBLEM - MariaDB Replica IO: s8 on db2166 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2161.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2161.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:51:02] PROBLEM - MariaDB Replica IO: s7 on db2159 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2218.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2218.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:51:02] PROBLEM - OpenSearch health check for shards on 9200 on logstash2026 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [11:51:04] PROBLEM - MariaDB Replica IO: s1 on db2174 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2203.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2203.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:51:04] PROBLEM - MariaDB Replica IO: s6 on db2158 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2214.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2214.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:51:04] PROBLEM - MariaDB Replica IO: s8 on db2181 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2161.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2161.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:51:04] PROBLEM - MariaDB Replica IO: s5 on db2171 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2123.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2123.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:51:04] PROBLEM - MariaDB Replica IO: x1 on db2131 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2005, Errmsg: error reconnecting to master repl2024@db2196.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Unknown server host db2196.codfw.wmnet (-3) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:51:05] PROBLEM - MariaDB Replica IO: s6 on db2124 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2214.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2214.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:51:05] PROBLEM - MariaDB Replica IO: s8 on db2152 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2005, Errmsg: error reconnecting to master repl2024@db2161.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Unknown server host db2161.codfw.wmnet (-3) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:51:06] PROBLEM - MariaDB Replica IO: s5 on db2128 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2123.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2123.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:51:06] PROBLEM - MariaDB Replica IO: s7 on db2121 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2218.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2218.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:51:07] PROBLEM - MariaDB Replica IO: s7 on db2122 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2218.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2218.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:51:07] PROBLEM - MariaDB Replica IO: s1 on db2116 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2203.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2203.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:51:07] FIRING: [12x] ProbeDown: Service gitlab2002:22 has failed probes (tcp_gitlab_wikimedia_org_ssh_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:51:08] PROBLEM - MariaDB Replica IO: s2 on db2126 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2204.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2204.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:51:08] PROBLEM - OpenSearch health check for shards on 9200 on logstash2023 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [11:51:09] PROBLEM - MariaDB Replica IO: s3 on db2209 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2205.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2205.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:51:09] PROBLEM - MariaDB Replica IO: s2 on db2207 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2204.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2204.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:51:10] PROBLEM - MariaDB Replica IO: x1 on db2215 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2005, Errmsg: error reconnecting to master repl2024@db2196.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Unknown server host db2196.codfw.wmnet (-3) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:51:10] PROBLEM - MariaDB Replica IO: es6 on es2037 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@es2035.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on es2035.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:51:11] PROBLEM - ElasticSearch health check for shards on 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.codfw.wmnet:9643/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.codfw.wmnet, port=9643): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [11:51:11] PROBLEM - MariaDB Replica IO: s7 on db2198 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2218.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2218.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:51:12] PROBLEM - MariaDB Replica IO: s3 on db2194 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2205.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2205.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:51:12] PROBLEM - MariaDB Replica IO: s5 on db2211 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2005, Errmsg: error reconnecting to master repl2024@db2123.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Unknown server host db2123.codfw.wmnet (-3) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:51:16] PROBLEM - OpenSearch health check for shards on 9200 on logstash2036 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [11:51:16] PROBLEM - OpenSearch health check for shards on 9200 on logstash2034 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [11:51:16] PROBLEM - OpenSearch health check for shards on 9200 on logging-hd2001 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [11:51:17] PROBLEM - OpenSearch health check for shards on 9200 on logstash2027 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [11:51:17] PROBLEM - OpenSearch health check for shards on 9200 on logstash2033 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [11:51:36] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 22.66% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:51:36] PROBLEM - OpenSearch health check for shards on 9200 on logging-hd2003 is CRITICAL: CRITICAL - elasticsearch inactive shards 715 threshold =0.34 breach: cluster_name: production-elk7-codfw, status: red, timed_out: False, number_of_nodes: 9, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 591, active_shards: 805, relocating_shards: 5, initializing_shards: 15, unassigned_shards: 700 [11:51:36] d_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 52.960526315789465 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:51:36] PROBLEM - OpenSearch health check for shards on 9200 on logstash2028 is CRITICAL: CRITICAL - elasticsearch inactive shards 715 threshold =0.34 breach: cluster_name: production-elk7-codfw, status: red, timed_out: False, number_of_nodes: 9, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 591, active_shards: 805, relocating_shards: 5, initializing_shards: 15, unassigned_shards: 700, [11:51:37] unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 52.960526315789465 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:51:37] PROBLEM - MariaDB Replica IO: s2 on db2125 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2204.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2204.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:51:37] PROBLEM - MariaDB Replica IO: m5 on db2160 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2135.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2135.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:51:54] PROBLEM - OpenSearch health check for shards on 9200 on logstash2029 is CRITICAL: CRITICAL - elasticsearch inactive shards 715 threshold =0.34 breach: cluster_name: production-elk7-codfw, status: red, timed_out: False, number_of_nodes: 9, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 591, active_shards: 805, relocating_shards: 5, initializing_shards: 15, unassigned_shards: 700, [11:51:54] unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 52.960526315789465 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:51:58] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.codfw.wmnet is OK: OK - elasticsearch status production-search-omega-codfw: cluster_name: production-search-omega-codfw, status: red, timed_out: False, number_of_nodes: 13, number_of_data_nodes: 13, active_primary_shards: 1655, active_shards: 4463, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 496, delayed_unassigned_shards: 0, number_of_pending_task [11:51:58] umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 97557, active_shards_percent_as_number: 89.85302999798671 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:52:02] PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 1722 threshold =0.2 breach: cluster_name: production-search-codfw, status: red, timed_out: False, number_of_nodes: 29, number_of_data_nodes: 29, active_primary_shards: 1325, active_shards: 2351, relocating_shards: 0, initializing_shards: 114, unassigned_shards: 1608, delayed_unassigned_shards: 0, number_o [11:52:02] g_tasks: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 57.72158114411982 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:52:04] RECOVERY - ElasticSearch health check for shards on 9643 on search.svc.codfw.wmnet is OK: OK - elasticsearch status production-search-psi-codfw: cluster_name: production-search-psi-codfw, status: yellow, timed_out: False, number_of_nodes: 15, number_of_data_nodes: 15, active_primary_shards: 1642, active_shards: 4894, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 29, delayed_unassigned_shards: 0, number_of_pending_tasks: [11:52:04] er_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.37055837563452 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:52:06] PROBLEM - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [11:52:08] PROBLEM - OpenSearch health check for shards on 9200 on logging-hd2002 is CRITICAL: CRITICAL - elasticsearch inactive shards 715 threshold =0.34 breach: cluster_name: production-elk7-codfw, status: red, timed_out: False, number_of_nodes: 9, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 591, active_shards: 805, relocating_shards: 5, initializing_shards: 15, unassigned_shards: 700 [11:52:08] d_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 52.960526315789465 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:52:08] PROBLEM - OpenSearch health check for shards on 9200 on logstash2037 is CRITICAL: CRITICAL - elasticsearch inactive shards 715 threshold =0.34 breach: cluster_name: production-elk7-codfw, status: red, timed_out: False, number_of_nodes: 9, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 591, active_shards: 805, relocating_shards: 5, initializing_shards: 15, unassigned_shards: 700, [11:52:08] unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 52.960526315789465 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:52:08] PROBLEM - OpenSearch health check for shards on 9200 on logstash2035 is CRITICAL: CRITICAL - elasticsearch inactive shards 715 threshold =0.34 breach: cluster_name: production-elk7-codfw, status: red, timed_out: False, number_of_nodes: 9, number_of_data_nodes: 6, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 591, active_shards: 805, relocating_shards: 5, initializing_shards: 15, unassigned_shards: 700, [11:52:09] unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 52.960526315789465 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:52:28] FIRING: [2x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:52:38] FIRING: [4x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:52:47] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at codfw: 21.73% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:52:58] PROBLEM - CirrusSearch full_text codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [11:53:31] FIRING: ZookeeperQuorumLost: Zookeeper cluster main-codfw has lost quorum - https://wikitech.wikimedia.org/wiki/Zookeeper - https://grafana.wikimedia.org/d/000000261/zookeeper?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DZookeeperQuorumLost [11:53:35] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:53:36] jouncebot: nowandnext [11:53:37] No deployments scheduled for the next 0 hour(s) and 6 minute(s) [11:53:37] In 0 hour(s) and 6 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240718T1200) [11:53:39] FIRING: EtcdReplicationDown: etcd replication down on conf2005:8000 #page - https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication - TODO - https://alerts.wikimedia.org/?q=alertname%3DEtcdReplicationDown [11:53:40] PROBLEM - PyBal connections to etcd on lvs2013 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=80) https://wikitech.wikimedia.org/wiki/PyBal [11:53:43] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:53:49] FIRING: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:53:52] !incidents [11:53:52] 4880 (ACKED) [3x] ProbeDown sre (ip4 probes/service codfw) [11:53:52] 4881 (UNACKED) EtcdReplicationDown etcd sre (conf2005:8000 etcdmirror codfw) [11:53:53] 4879 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet esams) [11:53:53] 4859 (RESOLVED) db1219 (paged)/MariaDB Replica Lag: s1 (paged) [11:53:55] !ack 4881 [11:53:56] 4881 (ACKED) EtcdReplicationDown etcd sre (conf2005:8000 etcdmirror codfw) [11:54:04] Dreamy_Jazz: Not the moment I'm afraid [11:54:14] (03PS1) 10Volans: Emergency depool of codfw [dns] - 10https://gerrit.wikimedia.org/r/1055189 [11:54:21] Understood. [11:54:29] Do you think things will be alright by the window? [11:54:36] (03CR) 10BCornwall: [C:03+1] Emergency depool of codfw [dns] - 10https://gerrit.wikimedia.org/r/1055189 (owner: 10Volans) [11:54:52] (03CR) 10Arnaudb: [C:03+1] Emergency depool of codfw [dns] - 10https://gerrit.wikimedia.org/r/1055189 (owner: 10Volans) [11:54:56] Dreamy_Jazz: no idea what's happening yet, we're in reactive mode rn [11:54:59] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [11:55:04] FIRING: [2x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:55:04] (03PS1) 10Effie Mouzeli: depool codfw as it looks like it is in trouble [dns] - 10https://gerrit.wikimedia.org/r/1055190 [11:55:08] FIRING: [39x] KubernetesCalicoDown: kubernetes2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:55:13] (03CR) 10CI reject: [V:04-1] depool codfw as it looks like it is in trouble [dns] - 10https://gerrit.wikimedia.org/r/1055190 (owner: 10Effie Mouzeli) [11:55:24] (03CR) 10Volans: [C:03+2] Emergency depool of codfw [dns] - 10https://gerrit.wikimedia.org/r/1055189 (owner: 10Volans) [11:55:36] Dreamy_Jazz: check on #wikimedia-sre, here will be impossible to follow with the alert spam [11:55:37] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [11:55:42] FIRING: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [11:55:53] FIRING: WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:55:57] FIRING: [73x] KubernetesCalicoDown: kubernetes2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:56:05] FIRING: [10x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:56:16] FIRING: WdqsStreamingUpdaterFlinkJobNotRunning: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [11:56:30] FIRING: [13x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:56:39] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:56:52] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:56:55] FIRING: [288x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:57:00] FIRING: [32x] JobUnavailable: Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:57:00] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [11:57:08] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.codfw.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.codfw.wmnet, port=9443): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [11:57:15] FIRING: CalicoTyphaDown: Too few (1) calico-typha replicas running - https://wikitech.wikimedia.org/wiki/Calico#Typha" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoTyphaDown [11:57:27] FIRING: [3x] KubernetesAPINotScrapable: k8s@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [11:57:30] FIRING: [24x] ProbeDown: Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:57:38] !incidents [11:57:39] PROBLEM - MariaDB Replica Lag: s4 on db2137 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 600.77 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:57:39] PROBLEM - MariaDB Replica Lag: s5 on db2171 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 601.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:57:39] PROBLEM - MariaDB Replica Lag: x1 on db2131 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 601.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:57:39] PROBLEM - MariaDB Replica Lag: s1 on db2174 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 601.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:57:39] 4880 (ACKED) [3x] ProbeDown sre (ip4 probes/service codfw) [11:57:39] 4881 (ACKED) EtcdReplicationDown etcd sre (conf2005:8000 etcdmirror codfw) [11:57:39] 4879 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet esams) [11:57:39] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [11:57:39] 4859 (RESOLVED) db1219 (paged)/MariaDB Replica Lag: s1 (paged) [11:57:40] PROBLEM - MariaDB Replica Lag: s5 on db2128 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 601.62 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:57:40] PROBLEM - MariaDB Replica Lag: s8 on db2152 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 601.62 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:57:40] PROBLEM - MariaDB Replica Lag: s2 on db2126 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 601.64 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:57:40] PROBLEM - MariaDB Replica Lag: s6 on db2124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 602.51 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:57:40] PROBLEM - MariaDB Replica Lag: s8 on db2181 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 602.51 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:57:41] PROBLEM - MariaDB Replica Lag: s1 on db2130 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 602.60 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:57:41] PROBLEM - MariaDB Replica Lag: s1 on db2116 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 602.86 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:57:42] PROBLEM - MariaDB Replica Lag: s3 on db2194 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 604.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:57:43] FIRING: [29x] ProbeDown: Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:57:46] PROBLEM - MariaDB Replica Lag: s4 on db2199 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 607.65 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:57:46] PROBLEM - MariaDB Replica Lag: s5 on db2211 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 608.41 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:57:46] PROBLEM - MariaDB Replica Lag: x1 on db2215 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 608.44 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:57:46] PROBLEM - MariaDB Replica Lag: s3 on db2190 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 609.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:57:47] PROBLEM - MariaDB Replica Lag: s1 on db2212 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 609.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:57:47] PROBLEM - MariaDB Replica Lag: s8 on db2195 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 609.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:57:47] PROBLEM - MariaDB Replica Lag: s1 on db2216 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 609.40 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:57:48] PROBLEM - MariaDB Replica Lag: s5 on db2213 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 609.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:57:48] !incidents [11:57:49] 4880 (ACKED) [3x] ProbeDown sre (ip4 probes/service codfw) [11:57:49] 4881 (ACKED) EtcdReplicationDown etcd sre (conf2005:8000 etcdmirror codfw) [11:57:49] 4879 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet esams) [11:57:49] 4859 (RESOLVED) db1219 (paged)/MariaDB Replica Lag: s1 (paged) [11:57:54] FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:57:58] PROBLEM - MariaDB Replica Lag: m5 on db2160 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 619.56 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:57:58] PROBLEM - MariaDB Replica Lag: s4 on db2147 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 620.60 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:57:58] PROBLEM - MariaDB Replica Lag: s1 on db2170 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 621.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:57:58] PROBLEM - MariaDB Replica Lag: s8 on db2167 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 621.40 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:00] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.codfw.wmnet is OK: OK - elasticsearch status production-search-omega-codfw: cluster_name: production-search-omega-codfw, status: red, timed_out: False, number_of_nodes: 13, number_of_data_nodes: 13, active_primary_shards: 1655, active_shards: 4965, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2, delayed_unassigned_shards: 0, number_of_pending_tasks: [11:58:00] er_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.95973424602376 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:58:04] PROBLEM - MariaDB Replica Lag: s7 on db2159 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 625.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:04] PROBLEM - MariaDB Replica Lag: s8 on db2166 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 625.57 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:04] PROBLEM - MariaDB Replica Lag: s6 on db2158 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 625.63 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:04] PROBLEM - MariaDB Replica Lag: s2 on db2138 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 625.65 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:04] PROBLEM - MariaDB Replica Lag: s3 on db2149 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 625.84 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:05] PROBLEM - MariaDB Replica Lag: x2 #page on db2144 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 625.85 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:05] PROBLEM - MariaDB Replica Lag: x1 on db2115 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 625.86 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:06] PROBLEM - MariaDB Replica Lag: s7 on db2122 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 626.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:06] PROBLEM - MariaDB Replica Lag: s3 on db2127 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 626.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:07] PROBLEM - MariaDB Replica Lag: s7 on db2121 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 626.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:07] FIRING: [18x] ProbeDown: Service gitlab2002:22 has failed probes (tcp_gitlab_wikimedia_org_ssh_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:58:08] PROBLEM - MariaDB Replica Lag: s3 on db2209 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 629.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:08] PROBLEM - MariaDB Replica Lag: s2 on db2207 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 630.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:08] PROBLEM - MariaDB Replica Lag: es6 on es2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 630.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:09] PROBLEM - MariaDB Replica Lag: x1 on db2191 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 630.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:09] PROBLEM - MariaDB Replica Lag: es6 on es2036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 630.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:10] PROBLEM - MariaDB Replica Lag: s7 on db2200 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 630.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:10] PROBLEM - MariaDB Replica Lag: s5 on db2192 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 630.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:10] !incidents [11:58:10] 4880 (ACKED) [3x] ProbeDown sre (ip4 probes/service codfw) [11:58:11] 4881 (ACKED) EtcdReplicationDown etcd sre (conf2005:8000 etcdmirror codfw) [11:58:11] 4882 (UNACKED) db2144 (paged)/MariaDB Replica Lag: x2 (paged) [11:58:11] 4879 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet esams) [11:58:11] 4859 (RESOLVED) db1219 (paged)/MariaDB Replica Lag: s1 (paged) [11:58:12] FIRING: [70x] JobUnavailable: Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:58:15] !ack 4882 [11:58:15] FIRING: [170x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:58:16] 4882 (ACKED) db2144 (paged)/MariaDB Replica Lag: x2 (paged) [11:58:29] FIRING: [33x] ProbeDown: Service doc1003.eqiad.wmnet:443 has failed probes (http_doc1003_eqiad_wmnet_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:58:36] (03PS1) 10Ilias Sarantopoulos: Revert "ml-services: outlink_topic_model from src dir" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055192 (https://phabricator.wikimedia.org/T370408) [11:58:36] PROBLEM - MariaDB Replica Lag: s2 on db2125 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 659.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:36] !incidents [11:58:37] 4880 (ACKED) [3x] ProbeDown sre (ip4 probes/service codfw) [11:58:37] 4881 (ACKED) EtcdReplicationDown etcd sre (conf2005:8000 etcdmirror codfw) [11:58:37] 4882 (ACKED) db2144 (paged)/MariaDB Replica Lag: x2 (paged) [11:58:37] 4879 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet esams) [11:58:38] 4859 (RESOLVED) db1219 (paged)/MariaDB Replica Lag: s1 (paged) [11:58:38] PROBLEM - MariaDB Replica Lag: s1 on db2173 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 660.55 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:58:44] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 22.66% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:58:55] FIRING: [7x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:59:06] RECOVERY - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [11:59:21] FIRING: [13x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:59:49] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:59:57] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29689 bytes in 5.136 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [11:59:58] RECOVERY - CirrusSearch full_text codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [12:00:02] FIRING: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:00:05] !incidents [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240718T1200) [12:00:05] 4880 (ACKED) [3x] ProbeDown sre (ip4 probes/service codfw) [12:00:06] 4881 (ACKED) EtcdReplicationDown etcd sre (conf2005:8000 etcdmirror codfw) [12:00:06] 4882 (ACKED) db2144 (paged)/MariaDB Replica Lag: x2 (paged) [12:00:06] 4879 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet esams) [12:00:06] 4859 (RESOLVED) db1219 (paged)/MariaDB Replica Lag: s1 (paged) [12:00:34] FIRING: [11x] KubernetesRsyslogDown: rsyslog on kubernetes2017:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:00:34] RECOVERY - Docker registry health on registry2004 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Docker [12:00:40] RECOVERY - MariaDB Replica IO: x1 on db2115 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:00:40] RECOVERY - MariaDB Replica IO: s1 on db2130 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:00:42] RECOVERY - Etcd cluster health on wikikube-ctrl2002 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [12:00:44] RECOVERY - MariaDB Replica IO: s7 on db2200 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:00:46] RECOVERY - MariaDB Replica IO: s5 on db2192 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:00:46] RECOVERY - MariaDB Replica IO: x1 on db2191 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:00:46] RECOVERY - MariaDB Replica IO: s3 on db2190 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:00:46] RECOVERY - MariaDB Replica IO: s8 on db2195 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:00:47] RECOVERY - MariaDB Replica IO: es6 on es2036 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:00:47] RECOVERY - MariaDB Replica IO: s6 on db2197 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:00:48] RECOVERY - MariaDB Replica IO: s4 on db2199 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:00:48] RECOVERY - MariaDB Replica IO: s1 on db2212 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:00:49] RECOVERY - MariaDB Replica IO: s1 on db2216 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:00:54] RECOVERY - MariaDB Replica IO: s5 on db2213 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:00:54] RECOVERY - Etcd cluster health on kubestagemaster2005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [12:00:54] RECOVERY - Etcd cluster health on conf2004 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [12:00:54] RECOVERY - Docker registry health on registry2003 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Docker [12:00:56] RECOVERY - MariaDB Replica IO: s3 on db2139 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:00:58] RECOVERY - MariaDB Replica IO: s4 on db2147 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:00:58] RECOVERY - MariaDB Replica IO: s1 on db2170 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:00:58] RECOVERY - MariaDB Replica IO: s4 on db2137 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:00:58] RECOVERY - MariaDB Replica IO: s1 on db2173 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:01:02] RECOVERY - MariaDB Replica IO: s8 on db2166 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:01:02] RECOVERY - MariaDB Replica IO: s3 on db2149 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:01:02] RECOVERY - MariaDB Replica IO: s7 on db2159 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:01:02] RECOVERY - MariaDB Replica IO: s1 on db2174 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:01:02] RECOVERY - MariaDB Replica IO: s6 on db2158 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:01:03] RECOVERY - MariaDB Replica IO: s8 on db2181 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:01:03] RECOVERY - MariaDB Replica IO: s5 on db2171 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:01:04] RECOVERY - MariaDB Replica Lag: s6 on db2158 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:01:04] RECOVERY - MariaDB Replica Lag: x2 #page on db2144 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:01:05] RECOVERY - MariaDB Replica Lag: x1 on db2115 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:01:05] RECOVERY - MariaDB Replica IO: x1 on db2131 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:01:06] RECOVERY - MariaDB Replica IO: s6 on db2124 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:01:06] RECOVERY - MariaDB Replica IO: s8 on db2152 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:01:06] FIRING: [15x] KubernetesRsyslogDown: rsyslog on kubernetes2017:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:01:07] RECOVERY - MariaDB Replica IO: s5 on db2128 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:01:07] RECOVERY - MariaDB Replica IO: s7 on db2121 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:01:08] RECOVERY - MariaDB Replica IO: s7 on db2122 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:01:08] RECOVERY - MariaDB Replica IO: s1 on db2116 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:01:16] FIRING: CirrusSearchUpdaterKafkaMessagesInTooLow: ... [12:01:20] The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.rc0` is too low - TODO - ... [12:01:25] https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&var-dc=codfw%2520prometheus%252Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&from=now-1h&to=now&refresh=5m&var-topic=codfw.cirrussearch.update_pipeline.update.rc0&var-topic=eqcodfw.cirrussearch.update_pipeline.update.rc0&var-topic=eqiad.cirrussearch.update_pipeline.update.rc0&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesI [12:01:30] Bot kicked for flood :D [12:01:52] (03PS1) 10Volans: Revert "Emergency depool of codfw" [dns] - 10https://gerrit.wikimedia.org/r/1055193 [12:02:13] (03CR) 10Arnaudb: [C:03+1] Revert "Emergency depool of codfw" [dns] - 10https://gerrit.wikimedia.org/r/1055193 (owner: 10Volans) [12:02:23] FIRING: [179x] KubernetesCalicoDown: kubernetes2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:02:44] FIRING: [7x] KubernetesAPINotScrapable: k8s@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [12:02:49] FIRING: [234x] KubernetesCalicoDown: kubernetes2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:02:57] FIRING: [19x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:03:14] FIRING: RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [12:03:22] (03CR) 10Volans: [C:03+2] Revert "Emergency depool of codfw" [dns] - 10https://gerrit.wikimedia.org/r/1055193 (owner: 10Volans) [12:03:28] FIRING: [19x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:03:39] FIRING: [291x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:03:44] FIRING: [90x] JobUnavailable: Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:04:02] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [12:04:10] FIRING: [24x] ProbeDown: Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:04:19] FIRING: [2x] CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [12:04:21] !incidents [12:04:21] 4880 (ACKED) [3x] ProbeDown sre (ip4 probes/service codfw) [12:04:21] 4881 (RESOLVED) EtcdReplicationDown etcd sre (conf2005:8000 etcdmirror codfw) [12:04:22] 4882 (RESOLVED) db2144 (paged)/MariaDB Replica Lag: x2 (paged) [12:04:22] 4879 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet esams) [12:04:22] 4859 (RESOLVED) db1219 (paged)/MariaDB Replica Lag: s1 (paged) [12:04:22] FIRING: [30x] ProbeDown: Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:04:26] !ack 4880 [12:04:27] 4880 (ACKED) [3x] ProbeDown sre (ip4 probes/service codfw) [12:04:29] FIRING: [24x] ProbeDown: Service gitlab2002:22 has failed probes (tcp_gitlab_wikimedia_org_ssh_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:04:32] !incidents [12:04:32] 4880 (ACKED) [3x] ProbeDown sre (ip4 probes/service codfw) [12:04:32] 4881 (RESOLVED) EtcdReplicationDown etcd sre (conf2005:8000 etcdmirror codfw) [12:04:32] 4882 (RESOLVED) db2144 (paged)/MariaDB Replica Lag: x2 (paged) [12:04:33] 4879 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet esams) [12:04:33] 4859 (RESOLVED) db1219 (paged)/MariaDB Replica Lag: s1 (paged) [12:04:39] FIRING: [240x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:04:53] FIRING: [33x] ProbeDown: Service doc1003.eqiad.wmnet:443 has failed probes (http_doc1003_eqiad_wmnet_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:05:00] RESOLVED: ZookeeperQuorumLost: Zookeeper cluster main-codfw has lost quorum - https://wikitech.wikimedia.org/wiki/Zookeeper - https://grafana.wikimedia.org/d/000000261/zookeeper?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DZookeeperQuorumLost [12:05:04] RESOLVED: EtcdReplicationDown: etcd replication down on conf2005:8000 #page - https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication - TODO - https://alerts.wikimedia.org/?q=alertname%3DEtcdReplicationDown [12:05:12] (03CR) 10Klausman: [C:03+1] Revert "ml-services: outlink_topic_model from src dir" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055192 (https://phabricator.wikimedia.org/T370408) (owner: 10Ilias Sarantopoulos) [12:05:58] !incidents [12:05:58] 4880 (RESOLVED) [3x] ProbeDown sre (ip4 probes/service codfw) [12:05:59] 4881 (RESOLVED) EtcdReplicationDown etcd sre (conf2005:8000 etcdmirror codfw) [12:05:59] 4882 (RESOLVED) db2144 (paged)/MariaDB Replica Lag: x2 (paged) [12:05:59] 4879 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet esams) [12:05:59] 4859 (RESOLVED) db1219 (paged)/MariaDB Replica Lag: s1 (paged) [12:05:59] (03CR) 10Kevin Bazira: [C:03+1] Revert "ml-services: outlink_topic_model from src dir" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055192 (https://phabricator.wikimedia.org/T370408) (owner: 10Ilias Sarantopoulos) [12:06:16] FIRING: [36x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:06:32] (03CR) 10Ilias Sarantopoulos: [C:03+2] Revert "ml-services: outlink_topic_model from src dir" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055192 (https://phabricator.wikimedia.org/T370408) (owner: 10Ilias Sarantopoulos) [12:07:06] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [12:07:40] !incidents [12:07:41] 4880 (RESOLVED) [3x] ProbeDown sre (ip4 probes/service codfw) [12:07:41] 4881 (RESOLVED) EtcdReplicationDown etcd sre (conf2005:8000 etcdmirror codfw) [12:07:41] 4882 (RESOLVED) db2144 (paged)/MariaDB Replica Lag: x2 (paged) [12:07:41] 4879 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet esams) [12:07:41] 4859 (RESOLVED) db1219 (paged)/MariaDB Replica Lag: s1 (paged) [12:07:47] (03Merged) 10jenkins-bot: Revert "ml-services: outlink_topic_model from src dir" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055192 (https://phabricator.wikimedia.org/T370408) (owner: 10Ilias Sarantopoulos) [12:07:51] FIRING: [31x] KubernetesRsyslogDown: rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:07:57] (03PS14) 10BCornwall: ncmonitor: Add public suffix list module [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (https://phabricator.wikimedia.org/T369114) [12:08:47] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [12:08:55] RESOLVED: CirrusSearchUpdaterKafkaMessagesInTooLow: ... [12:08:58] The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.rc0` is too low - TODO - ... [12:09:06] https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&var-dc=codfw%2520prometheus%252Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&from=now-1h&to=now&refresh=5m&var-topic=codfw.cirrussearch.update_pipeline.update.rc0&var-topic=eqcodfw.cirrussearch.update_pipeline.update.rc0&var-topic=eqiad.cirrussearch.update_pipeline.update.rc0&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesI [12:09:11] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [12:09:16] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [12:09:30] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [12:09:54] FIRING: SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [12:09:59] RESOLVED: [2x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:10:00] (03Abandoned) 10Effie Mouzeli: depool codfw as it looks like it is in trouble [dns] - 10https://gerrit.wikimedia.org/r/1055190 (owner: 10Effie Mouzeli) [12:10:05] FIRING: [241x] KubernetesCalicoDown: kubernetes2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:10:12] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host mw2432.codfw.wmnet with OS buster [12:10:41] RESOLVED: [8x] KubernetesAPINotScrapable: k8s-mlstaging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [12:10:50] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [12:10:54] RESOLVED: [19x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:10:58] (03PS15) 10BCornwall: ncmonitor: Add public suffix list module [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (https://phabricator.wikimedia.org/T369114) [12:11:14] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:11:27] FIRING: [3x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:11:32] FIRING: [90x] JobUnavailable: Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:11:35] RESOLVED: [303x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:11:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST events) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:12:04] RESOLVED: [3x] CalicoTyphaDown: Too few (1) calico-typha replicas running - https://wikitech.wikimedia.org/wiki/Calico#Typha" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoTyphaDown [12:12:24] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [12:12:32] RESOLVED: [29x] ProbeDown: Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:12:46] RESOLVED: [3x] CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [12:12:58] RESOLVED: [31x] ProbeDown: Service doc1003.eqiad.wmnet:443 has failed probes (http_doc1003_eqiad_wmnet_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:13:34] !incidents [12:13:34] 4880 (RESOLVED) [3x] ProbeDown sre (ip4 probes/service codfw) [12:13:34] 4881 (RESOLVED) EtcdReplicationDown etcd sre (conf2005:8000 etcdmirror codfw) [12:13:34] 4882 (RESOLVED) db2144 (paged)/MariaDB Replica Lag: x2 (paged) [12:13:35] 4879 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet esams) [12:13:35] 4859 (RESOLVED) db1219 (paged)/MariaDB Replica Lag: s1 (paged) [12:13:50] RESOLVED: [4x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:13:59] RESOLVED: [36x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:14:32] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:14:47] !log restarting sync-puppet-volatile on puppetserver2001 [12:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:59] RESOLVED: [31x] KubernetesRsyslogDown: rsyslog on kubernetes2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:15:04] FIRING: KubernetesRsyslogDown: rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=ml-staging-ctrl2001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:15:27] FIRING: RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: ... [12:15:33] Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [12:15:50] FIRING: [3x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [12:15:54] RESOLVED: [240x] KubernetesCalicoDown: kubernetes2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:16:10] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [12:16:21] RESOLVED: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [12:16:33] RESOLVED: WdqsStreamingUpdaterFlinkJobNotRunning: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [12:16:42] FIRING: [2x] RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [12:16:56] FIRING: [3x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:17:00] RESOLVED: [3x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:17:04] FIRING: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [12:17:52] RESOLVED: KubernetesRsyslogDown: rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=ml-staging-ctrl2001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:18:09] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3307/co" [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (https://phabricator.wikimedia.org/T369114) (owner: 10BCornwall) [12:18:15] RESOLVED: RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: ... [12:18:18] (03CR) 10BCornwall: ncmonitor: Add public suffix list module (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (https://phabricator.wikimedia.org/T369114) (owner: 10BCornwall) [12:18:21] (03CR) 10Peter Fischer: [C:03+2] Search update pipeline: let wikidatawiki bypass optimization (deduplication) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054850 (https://phabricator.wikimedia.org/T365831) (owner: 10Peter Fischer) [12:18:21] Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [12:18:25] FIRING: [4x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [12:18:29] RESOLVED: [2x] RdfStreamingUpdaterFlinkJobUnstable: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [12:18:38] RESOLVED: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [12:19:15] (03Merged) 10jenkins-bot: Search update pipeline: let wikidatawiki bypass optimization (deduplication) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054850 (https://phabricator.wikimedia.org/T365831) (owner: 10Peter Fischer) [12:19:24] RESOLVED: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:21:32] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, as a followup review please consider reformatting expressions to be more readable (e.g. check out PrometheusLowRetention)" [alerts] - 10https://gerrit.wikimedia.org/r/1055155 (https://phabricator.wikimedia.org/T369720) (owner: 10Arnaudb) [12:21:41] (03PS16) 10BCornwall: ncmonitor: Add public suffix list module [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (https://phabricator.wikimedia.org/T369114) [12:22:31] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3308/co" [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (https://phabricator.wikimedia.org/T369114) (owner: 10BCornwall) [12:23:22] FIRING: [4x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [12:23:44] (03CR) 10Arnaudb: [C:03+2] mariadb: reducing pt-heartbeat monitoring [alerts] - 10https://gerrit.wikimedia.org/r/1055155 (https://phabricator.wikimedia.org/T369720) (owner: 10Arnaudb) [12:24:11] (03PS1) 10Cathal Mooney: Disable BGP peering from cr2-codfw to ssw1-d8-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1055205 (https://phabricator.wikimedia.org/T366941) [12:24:46] (03CR) 10CI reject: [V:04-1] Disable BGP peering from cr2-codfw to ssw1-d8-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1055205 (https://phabricator.wikimedia.org/T366941) (owner: 10Cathal Mooney) [12:25:27] !log update spicerack to 8.8.0 on cumin1002 [12:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:22] Dreamy_Jazz: You can go ahead with your deployment [12:26:29] Thanks! [12:26:39] Sorry for the delay [12:26:46] No problem. Site incidents come first. [12:26:56] I can also wait till the window, as I had previously added it there. [12:27:17] jouncebot: nowandnext [12:27:18] For the next 0 hour(s) and 32 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240718T1200) [12:27:18] In 0 hour(s) and 32 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240718T1300) [12:27:35] Yeah, as you wish [12:27:50] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2432.codfw.wmnet with reason: host reimage [12:28:04] (03PS3) 10Dreamy Jazz: [GlobalBlocking] Enable global account blocks on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049915 (https://phabricator.wikimedia.org/T356924) [12:28:18] (03CR) 10BCornwall: [V:03+2] "Verified the script and service unit on ncmonitor1001 to behave as expected" [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (https://phabricator.wikimedia.org/T369114) (owner: 10BCornwall) [12:28:22] RESOLVED: [4x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [12:29:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049915 (https://phabricator.wikimedia.org/T356924) (owner: 10Dreamy Jazz) [12:29:24] (03PS1) 10Daimona Eaytoy: [arwiki] Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055206 (https://phabricator.wikimedia.org/T370066) [12:29:37] (03PS1) 10Brouberol: datahub-next: upgrade datahub to 0.13.3 (latest version) with upgraded jetty [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055207 (https://phabricator.wikimedia.org/T363461) [12:29:44] (03CR) 10Ayounsi: [C:03+1] "+1 once CI is happy" [homer/public] - 10https://gerrit.wikimedia.org/r/1055205 (https://phabricator.wikimedia.org/T366941) (owner: 10Cathal Mooney) [12:30:04] (03Merged) 10jenkins-bot: [GlobalBlocking] Enable global account blocks on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049915 (https://phabricator.wikimedia.org/T356924) (owner: 10Dreamy Jazz) [12:30:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055206 (https://phabricator.wikimedia.org/T370066) (owner: 10Daimona Eaytoy) [12:30:52] !log dreamyjazz@deploy1002 Started scap sync-world: Backport for [[gerrit:1049915|[GlobalBlocking] Enable global account blocks on all wikis (T356924)]] [12:30:59] T356924: Deploy global account blocks to WMF wikis - https://phabricator.wikimedia.org/T356924 [12:31:10] (03Abandoned) 10Cathal Mooney: Disable BGP peering from cr2-codfw to ssw1-d8-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1055205 (https://phabricator.wikimedia.org/T366941) (owner: 10Cathal Mooney) [12:31:41] (03CR) 10Brouberol: [C:03+2] datahub-next: upgrade datahub to 0.13.3 (latest version) with upgraded jetty [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055207 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [12:32:16] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [12:32:21] !log pfischer@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [12:32:55] !log pfischer@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:33:20] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2432.codfw.wmnet with reason: host reimage [12:34:01] !log dreamyjazz@deploy1002 dreamyjazz: Backport for [[gerrit:1049915|[GlobalBlocking] Enable global account blocks on all wikis (T356924)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:35:02] !log dreamyjazz@deploy1002 dreamyjazz: Continuing with sync [12:35:26] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [12:35:42] (03CR) 10Filippo Giunchedi: [C:04-1] mariadb: observability - adds shard information on recording rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054884 (https://phabricator.wikimedia.org/T367283) (owner: 10Arnaudb) [12:36:45] (03CR) 10DCausse: "lgtm, left a question" [cookbooks] - 10https://gerrit.wikimedia.org/r/1053205 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [12:39:39] (03PS1) 10Peter Fischer: Search update pipeline: let wikidatawiki bypass optimization (deduplication) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055209 (https://phabricator.wikimedia.org/T365831) [12:39:59] (03CR) 10Filippo Giunchedi: mariadb: tweaks monitoring thresholds for replication lag (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1054893 (https://phabricator.wikimedia.org/T367279) (owner: 10Arnaudb) [12:40:03] !log dreamyjazz@deploy1002 Finished scap: Backport for [[gerrit:1049915|[GlobalBlocking] Enable global account blocks on all wikis (T356924)]] (duration: 09m 10s) [12:40:08] T356924: Deploy global account blocks to WMF wikis - https://phabricator.wikimedia.org/T356924 [12:41:03] (03CR) 10Peter Fischer: [C:03+2] Search update pipeline: let wikidatawiki bypass optimization (deduplication) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055209 (https://phabricator.wikimedia.org/T365831) (owner: 10Peter Fischer) [12:42:44] (03Merged) 10jenkins-bot: Search update pipeline: let wikidatawiki bypass optimization (deduplication) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055209 (https://phabricator.wikimedia.org/T365831) (owner: 10Peter Fischer) [12:43:08] (03CR) 10Klausman: "I am available as well!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052701 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [12:43:47] (03PS2) 10Arnaudb: mariadb: observability - adds shard information on recording rule [puppet] - 10https://gerrit.wikimedia.org/r/1054884 (https://phabricator.wikimedia.org/T367283) [12:46:31] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [12:48:26] (03CR) 10Arnaudb: mariadb: observability - adds shard information on recording rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054884 (https://phabricator.wikimedia.org/T367283) (owner: 10Arnaudb) [12:49:05] (03CR) 10Elukey: [C:03+1] Netbox 4: point prod service to new servers [puppet] - 10https://gerrit.wikimedia.org/r/1055187 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [12:50:24] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove entries for IRB ints on row D spines - cmooney@cumin1002" [12:51:19] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove entries for IRB ints on row D spines - cmooney@cumin1002" [12:51:19] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:52:36] !log re-enabling BGP between spine-layer switches in codfw (problematic IP interfaces have been deleted) T366941 [12:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:40] T366941: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941 [12:53:43] (03PS2) 10Klausman: team-dcops/mgmt: Change runbook link to one with BMC info [alerts] - 10https://gerrit.wikimedia.org/r/1032406 [12:53:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:53:47] (03CR) 10Filippo Giunchedi: [C:03+1] "Copied votes on follow-up patch sets have been updated:" [alerts] - 10https://gerrit.wikimedia.org/r/1032406 (owner: 10Klausman) [12:54:55] (03PS17) 10BCornwall: ncmonitor: Add public suffix list module [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (https://phabricator.wikimedia.org/T369114) [12:54:57] (03CR) 10Klausman: [C:03+2] team-dcops/mgmt: Change runbook link to one with BMC info [alerts] - 10https://gerrit.wikimedia.org/r/1032406 (owner: 10Klausman) [12:55:24] (03CR) 10Hashar: "Hi Keith, this simplifies the git::clone for grafana/grizzly and make it closer to the defaults. I wrote the rationale in the commit messa" [puppet] - 10https://gerrit.wikimedia.org/r/1054892 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [12:55:52] !log re-enabling interface et-1/0/2 on cr2-codfw which connects to ssw1-d8-codfw (problemtic IP interfaces have been deleted) T366941 [12:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:57] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3309/co" [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (https://phabricator.wikimedia.org/T369114) (owner: 10BCornwall) [12:56:11] (03Merged) 10jenkins-bot: team-dcops/mgmt: Change runbook link to one with BMC info [alerts] - 10https://gerrit.wikimedia.org/r/1032406 (owner: 10Klausman) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240718T1300). [13:00:05] Dreamy_Jazz, DreamRimmer, and MichaelG_WMF: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:07] * MichaelG_WMF is here [13:00:13] \o [13:00:19] Already done mine just before the window. [13:04:13] Dreamy_Jazz: if you have time and are around for a bit, could you do mine as well? Seems none of the usual deployers for this window happens to be here. But if that is not possible, then that's ok too. I can also try the next window [13:04:37] Sure. [13:04:42] FIRING: [2x] SystemdUnitFailed: envoyproxy.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:49] Thank you! <3 [13:06:25] I presume you will be able to test your change? [13:07:23] 10ops-eqiad, 06SRE, 06DC-Ops: 10gbit nic option for centrallog1002 - https://phabricator.wikimedia.org/T369825#9994586 (10Papaul) No problem just give me a heads up on when y'all want to do the changes. [13:07:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:07:42] Dreamy_Jazz: yes, I will :) [13:07:49] (03PS1) 10JMeybohm: Prometheus: Add recording rules for istio ingress metrics [puppet] - 10https://gerrit.wikimedia.org/r/1055213 (https://phabricator.wikimedia.org/T369607) [13:07:59] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2432.codfw.wmnet with OS buster [13:08:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054914 (https://phabricator.wikimedia.org/T370097) (owner: 10Dreamrimmer) [13:08:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055182 (https://phabricator.wikimedia.org/T370326) (owner: 10Michael Große) [13:09:17] (03Merged) 10jenkins-bot: Allow Bureaucrats on Foundation Wiki to be able to remove Sysop rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054914 (https://phabricator.wikimedia.org/T370097) (owner: 10Dreamrimmer) [13:10:01] * Lucas_WMDE around now [13:11:06] I am deploying the two changes left in the window currently Lucas. [13:11:14] ok, thanks! [13:16:14] (03CR) 10Hashar: "In Puppet, that is the only call to `git::clone` using `mode => 0444` and it should be fine to use the default of `0755`." [puppet] - 10https://gerrit.wikimedia.org/r/1054890 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [13:16:47] (03CR) 10Brouberol: [C:03+1] "Thank you, good catch!" [alerts] - 10https://gerrit.wikimedia.org/r/1054551 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi) [13:20:09] (03CR) 10JMeybohm: Prometheus: Add recording rules for istio ingress metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1055213 (https://phabricator.wikimedia.org/T369607) (owner: 10JMeybohm) [13:21:29] gate-and-submit-wmf taking it's time... [13:23:26] (03CR) 10Filippo Giunchedi: [C:03+2] data-platform: fix datahub availability [alerts] - 10https://gerrit.wikimedia.org/r/1054551 (https://phabricator.wikimedia.org/T354255) (owner: 10Filippo Giunchedi) [13:28:30] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1055146 (https://phabricator.wikimedia.org/T365014) (owner: 10Slavina Stefanova) [13:29:41] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-conf1004.eqiad.wmnet with OS bookworm [13:29:49] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#9994678 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-conf1004.eqiad.wmnet with OS bookworm [13:30:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (GET leases) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=GET - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:33:14] (03CR) 10David Caro: [C:03+2] envvars backend: update endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1055146 (https://phabricator.wikimedia.org/T365014) (owner: 10Slavina Stefanova) [13:33:34] Any moment now 🤞 [13:33:44] (03CR) 10David Caro: [C:03+2] "Tested in toolsbeta:" [puppet] - 10https://gerrit.wikimedia.org/r/1055146 (https://phabricator.wikimedia.org/T365014) (owner: 10Slavina Stefanova) [13:33:50] (03Merged) 10jenkins-bot: fix(editor): make PageTitleControl reliably blankable [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055182 (https://phabricator.wikimedia.org/T370326) (owner: 10Michael Große) [13:33:54] yay [13:34:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#9994697 (10Jclark-ctr) @Papaul they to fail start pxe I have downgraded firmware on nic and set correct ports for pxe. but still continue to fail t... [13:34:24] !log dreamyjazz@deploy1002 Started scap sync-world: Backport for [[gerrit:1054914|Allow Bureaucrats on Foundation Wiki to be able to remove Sysop rights (T370097)]], [[gerrit:1055182|fix(editor): make PageTitleControl reliably blankable (T370326)]] [13:34:29] T370097: Allow Bureaucrats on Foundation Governance Wiki (foundation.wikimedia.org) to be able to remove Sysop/Admin rights - https://phabricator.wikimedia.org/T370097 [13:34:29] T370326: Trying to blank a field appears to succeed but no edit was saved - https://phabricator.wikimedia.org/T370326 [13:35:33] FIRING: [3x] KubernetesAPILatency: High Kubernetes API latency (GET deployments) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=GET - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:35:37] (03PS1) 10Marostegui: db1179: Status [puppet] - 10https://gerrit.wikimedia.org/r/1055221 [13:35:54] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1179 crashed - hardware issues - https://phabricator.wikimedia.org/T369855#9994710 (10Marostegui) [13:36:57] !log dreamyjazz@deploy1002 migr, dreamyjazz, dreamrimmer: Backport for [[gerrit:1054914|Allow Bureaucrats on Foundation Wiki to be able to remove Sysop rights (T370097)]], [[gerrit:1055182|fix(editor): make PageTitleControl reliably blankable (T370326)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:36:58] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#9994708 (10elukey) After a chat with Papaul, we would like to test if the Juniper DHCP injection to implement Option 82 could cause any of this (basical... [13:37:06] (03CR) 10Marostegui: [C:03+2] db1179: Status [puppet] - 10https://gerrit.wikimedia.org/r/1055221 (owner: 10Marostegui) [13:38:39] Dreamy_Jazz: my change "[[gerrit:1055182|fix(editor): make PageTitleControl reliably blankable (T370326)]]" works on the debug server! [13:38:47] Thanks! [13:39:29] !log dreamyjazz@deploy1002 migr, dreamyjazz, dreamrimmer: Continuing with sync [13:39:45] Proceeding. I verified that the config change was applied using https://foundation.wikimedia.org/wiki/Special:ListGroupRights [13:40:33] FIRING: [4x] KubernetesAPILatency: High Kubernetes API latency (GET configmaps) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=GET - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:40:39] (03PS18) 10BCornwall: ncmonitor: Add public suffix list module [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (https://phabricator.wikimedia.org/T369114) [13:41:33] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3310/co" [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (https://phabricator.wikimedia.org/T369114) (owner: 10BCornwall) [13:43:54] FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1100-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:44:23] !log dreamyjazz@deploy1002 Finished scap: Backport for [[gerrit:1054914|Allow Bureaucrats on Foundation Wiki to be able to remove Sysop rights (T370097)]], [[gerrit:1055182|fix(editor): make PageTitleControl reliably blankable (T370326)]] (duration: 09m 59s) [13:44:28] T370097: Allow Bureaucrats on Foundation Governance Wiki (foundation.wikimedia.org) to be able to remove Sysop/Admin rights - https://phabricator.wikimedia.org/T370097 [13:44:28] T370326: Trying to blank a field appears to succeed but no edit was saved - https://phabricator.wikimedia.org/T370326 [13:45:29] MichaelG_WMF: That is the change deployed. [13:45:33] FIRING: [11x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:46:25] !log Afternoon UTC backport window done [13:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:38] (03PS4) 10Slyngshede: Permissions: Allow users to request new permissions [software/bitu] - 10https://gerrit.wikimedia.org/r/1052924 [13:47:40] (03CR) 10Vgutierrez: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (https://phabricator.wikimedia.org/T369114) (owner: 10BCornwall) [13:48:03] (03CR) 10CI reject: [V:04-1] Permissions: Allow users to request new permissions [software/bitu] - 10https://gerrit.wikimedia.org/r/1052924 (owner: 10Slyngshede) [13:48:08] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#9994764 (10elukey) If we pick sretest2001, this would be the config: ` elukey@install2004:/etc/dhcp/automation/proxies$ cat ttyS1-115200.conf # Automat... [13:48:24] (03CR) 10BCornwall: [V:03+1 C:03+2] ncmonitor: Add public suffix list module [puppet] - 10https://gerrit.wikimedia.org/r/1054069 (https://phabricator.wikimedia.org/T369114) (owner: 10BCornwall) [13:48:35] jouncebot: nowandnext [13:48:35] For the next 0 hour(s) and 11 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240718T1300) [13:48:35] In 1 hour(s) and 11 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240718T1500) [13:48:47] I’ll try to roll out a change for https://phabricator.wikimedia.org/T368523 then [13:48:56] (cc jayme ^^) [13:49:17] (03PS3) 10JMeybohm: termbox: Enable mesh for termbox-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055164 (https://phabricator.wikimedia.org/T368523) [13:49:22] uh...good luck :D [13:49:37] or was that just a proof of concept that’s not ready to deploy yet? :D [13:49:59] (my plan would be to try deploying it and just revert if it doesn’t work) [13:50:33] !log Release ncmonitor 1.1.0-1 to bookworm-wikimedia [13:50:33] FIRING: [11x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:44] I think it should work :D - but I obviously did not test [13:50:53] fair ^^ [13:51:08] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Looks good to me, let’s try it… but I’ll add T355955 to the commit message, I think we’re kind of crossing over into that task ^^" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055164 (https://phabricator.wikimedia.org/T368523) (owner: 10JMeybohm) [13:51:15] (03PS4) 10Lucas Werkmeister (WMDE): termbox: Enable mesh for termbox-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055164 (https://phabricator.wikimedia.org/T355955) (owner: 10JMeybohm) [13:51:25] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] termbox: Enable mesh for termbox-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055164 (https://phabricator.wikimedia.org/T355955) (owner: 10JMeybohm) [13:51:59] Lucas_WMDE: lmk if it does not work, I can take a look then before you roll back (assuming it's fine to be down for a minute) [13:52:09] will do, thanks [13:52:20] it’s absolutely fine for it to be down for a bit, I’d say [13:52:23] (03Merged) 10jenkins-bot: termbox: Enable mesh for termbox-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055164 (https://phabricator.wikimedia.org/T355955) (owner: 10JMeybohm) [13:52:39] at least as far as product functionality goes… and I think any potential errors shouldn’t be frequent enough to cause logspam problems [13:53:38] !log lucaswerkmeister-wmde@deploy1002 helmfile [staging] START helmfile.d/services/termbox: apply [13:53:53] * Lucas_WMDE looks at long diff [13:53:53] (03CR) 10Clément Goubert: [C:03+1] profile::docker::reporter: remove unnecessary filters [puppet] - 10https://gerrit.wikimedia.org/r/1055150 (https://phabricator.wikimedia.org/T367427) (owner: 10Elukey) [13:53:56] (03PS1) 10Joely Rooke WMDE: Add wikibase client interaction stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055228 (https://phabricator.wikimedia.org/T370045) [13:53:59] (that’s a whole mesh being included, I guess ^^) [13:54:25] (03PS5) 10Slyngshede: Permissions: Allow users to request new permissions [software/bitu] - 10https://gerrit.wikimedia.org/r/1052924 [13:54:51] yesh it's a mesh [13:54:59] :D [13:55:33] FIRING: [11x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:55:59] !log lucaswerkmeister-wmde@deploy1002 helmfile [staging] DONE helmfile.d/services/termbox: apply [13:56:10] (03CR) 10Clément Goubert: [C:03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1054919 (https://phabricator.wikimedia.org/T361250) (owner: 10Dzahn) [13:56:37] I see an SSR termbox \o/ [13:57:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#9994794 (10Papaul) @Jclark-ctr that are some helpful informations I will take a look at it once on site. [13:57:32] (03CR) 10Brouberol: [C:03+1] Increase the heap for the mapreduce history service on an-master1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1055176 (https://phabricator.wikimedia.org/T369278) (owner: 10Btullis) [13:57:51] jayme: as far as I can tell it’s completely working \o/ [13:57:56] thanks a lot! [13:58:03] nice! yw [13:58:03] I’ll see if I can figure out the follow-up steps you mentioned ^^ [13:58:11] (03CR) 10Btullis: [V:03+1 C:03+2] Increase the heap for the mapreduce history service on an-master1003 [puppet] - 10https://gerrit.wikimedia.org/r/1055176 (https://phabricator.wikimedia.org/T369278) (owner: 10Btullis) [13:58:14] but let’s see if the node20 bump works this time [13:58:28] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9994796 (10elukey) It seems clear that for the foreseeable future (next 6/8 months) we will not have the DHCP hostna... [13:59:01] (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1055230 [13:59:04] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1055231 [13:59:07] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1055232 [13:59:13] (03PS4) 10Btullis: Increase the heap for the mapreduce history service on an-master1003 [puppet] - 10https://gerrit.wikimedia.org/r/1055176 (https://phabricator.wikimedia.org/T369278) [13:59:52] (03CR) 10Btullis: Increase the heap for the mapreduce history service on an-master1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1055176 (https://phabricator.wikimedia.org/T369278) (owner: 10Btullis) [14:00:28] (03Abandoned) 10BCornwall: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1050484 (owner: 10Ncmonitor) [14:00:33] RESOLVED: [7x] KubernetesAPILatency: High Kubernetes API latency (PATCH events) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:00:54] (03PS1) 10Kevin Bazira: ml-services: update articletopic-outlink images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055233 (https://phabricator.wikimedia.org/T370408) [14:01:02] (03PS1) 10Lucas Werkmeister (WMDE): Revert "termbox: revert test deployment to 2024-03-14-121904-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055234 (https://phabricator.wikimedia.org/T368523) [14:01:08] oh and I should roll out the mesh change to the other two clusters [14:01:12] if only to confirm there’s no diff [14:01:20] !log lucaswerkmeister-wmde@deploy1002 helmfile [eqiad] START helmfile.d/services/termbox: apply [14:01:27] 10ops-eqiad, 06SRE, 06DC-Ops: 10gbit nic option for centrallog1002 - https://phabricator.wikimedia.org/T369825#9994819 (10fgiunchedi) >>! In T369825#9994586, @Papaul wrote: > No problem just give me a heads up on when y'all want to do the changes. ok thank you! how's your morning tomorrow to carry out this... [14:01:34] ah, of course there’s a diff, the version gets bumped from 0.1.13 to 0.1.14 ^^ [14:01:40] !log lucaswerkmeister-wmde@deploy1002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply [14:01:40] but nothing else [14:01:44] !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] START helmfile.d/services/termbox: apply [14:01:53] !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] DONE helmfile.d/services/termbox: apply [14:02:22] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] Revert "termbox: revert test deployment to 2024-03-14-121904-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055234 (https://phabricator.wikimedia.org/T368523) (owner: 10Lucas Werkmeister (WMDE)) [14:03:16] (03Merged) 10jenkins-bot: Revert "termbox: revert test deployment to 2024-03-14-121904-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055234 (https://phabricator.wikimedia.org/T368523) (owner: 10Lucas Werkmeister (WMDE)) [14:03:22] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9994832 (10ayounsi) I'd suggest to abstract the device creation by a custom script or cookbook. This could run addit... [14:05:00] !log lucaswerkmeister-wmde@deploy1002 helmfile [staging] START helmfile.d/services/termbox: apply [14:05:19] !log lucaswerkmeister-wmde@deploy1002 helmfile [staging] DONE helmfile.d/services/termbox: apply [14:05:24] weird, why does the staging diff only show the version bump *now* and not in the previous deployment [14:06:25] (03CR) 10Ilias Sarantopoulos: "The images seem to be wrong according to the pipelines output. Please recheck! https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/i" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055233 (https://phabricator.wikimedia.org/T370408) (owner: 10Kevin Bazira) [14:06:45] works AFAICT \o/ just gonna apply eqiad and codfw again to confirm no diff [14:06:47] !log lucaswerkmeister-wmde@deploy1002 helmfile [eqiad] START helmfile.d/services/termbox: apply [14:06:49] !log lucaswerkmeister-wmde@deploy1002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply [14:06:50] thats strange indeed. It should have been part of your initial deploy, together with the new service object, all the envoy config etc [14:06:53] !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] START helmfile.d/services/termbox: apply [14:06:57] !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] DONE helmfile.d/services/termbox: apply [14:07:00] [narrator] there was no diff [14:07:05] thanks for applying the version bump as well [14:07:09] np [14:07:11] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9994847 (10cmooney) There is possibly a variant of option 1: - Create a new custom script to add devices, which has... [14:07:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:11:41] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1055176 (https://phabricator.wikimedia.org/T369278) (owner: 10Btullis) [14:12:02] (03PS2) 10Kevin Bazira: ml-services: update articletopic-outlink images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055233 (https://phabricator.wikimedia.org/T370408) [14:13:23] (03CR) 10Ilias Sarantopoulos: [C:03+1] "Seems right!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055233 (https://phabricator.wikimedia.org/T370408) (owner: 10Kevin Bazira) [14:14:34] (03CR) 10Kevin Bazira: [C:03+2] "🤦 docker-registry.wikimedia.org delays to update" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055233 (https://phabricator.wikimedia.org/T370408) (owner: 10Kevin Bazira) [14:14:55] (03CR) 10Brouberol: [C:03+1] Increase the heap for the mapreduce history service on an-master1003 [puppet] - 10https://gerrit.wikimedia.org/r/1055176 (https://phabricator.wikimedia.org/T369278) (owner: 10Btullis) [14:15:29] (03Merged) 10jenkins-bot: ml-services: update articletopic-outlink images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055233 (https://phabricator.wikimedia.org/T370408) (owner: 10Kevin Bazira) [14:16:36] (03PS1) 10Clément Goubert: kubernetes: rename 4 appservers to k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1055237 (https://phabricator.wikimedia.org/T351074) [14:17:19] !log kevinbazira@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [14:21:55] (03CR) 10CDobbins: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1054918 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [14:22:57] jayme: if you have a moment – should the test termbox already be reachable over TLS now? [14:23:07] I tried to `curl https://termbox-test.staging.svc.eqiad.wmnet:4018/termbox` from deploy1002 and got a cert error [14:23:32] (if I try to telnet or openssl s_client to it, it doesn’t connect, so I think curl is using a proxy from the default shell environment) [14:24:33] FIRING: [3x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=GET - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:25:39] (03PS1) 10Arturo Borrero Gonzalez: openstack: opentofu: git pull the repo using checkout mode [puppet] - 10https://gerrit.wikimedia.org/r/1055241 [14:26:36] (03CR) 10Kamila Součková: [C:03+1] kubernetes: rename 4 appservers to k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1055237 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [14:27:35] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1055241 (owner: 10Arturo Borrero Gonzalez) [14:28:47] (03CR) 10Elukey: [C:03+2] profile::docker::reporter: remove unnecessary filters [puppet] - 10https://gerrit.wikimedia.org/r/1055150 (https://phabricator.wikimedia.org/T367427) (owner: 10Elukey) [14:29:33] RESOLVED: [3x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=GET - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:29:43] (03CR) 10FNegri: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1055241 (owner: 10Arturo Borrero Gonzalez) [14:30:39] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: opentofu: git pull the repo using checkout mode [puppet] - 10https://gerrit.wikimedia.org/r/1055241 (owner: 10Arturo Borrero Gonzalez) [14:31:06] (03PS5) 10Ebrahim: Enable ICU provided alphabetical order in the Kurdish wikis categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054641 (https://phabricator.wikimedia.org/T48235) [14:31:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:36:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:38:26] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2433 [14:38:32] (03CR) 10Effie Mouzeli: [C:03+1] kubernetes: rename 4 appservers to k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1055237 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [14:38:40] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw2433.codfw.wmnet [14:39:19] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:20] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-conf1004.eqiad.wmnet with OS bookworm [14:40:25] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#9994990 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-conf1004.eqiad.wmnet with OS bookworm executed with errors: - an-co... [14:40:57] (03PS1) 10Volans: mysql_legacy: fix Instance's upgrade path [software/spicerack] - 10https://gerrit.wikimedia.org/r/1055244 (https://phabricator.wikimedia.org/T367496) [14:41:33] FIRING: [4x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=GET - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:43:45] (03CR) 10Arnaudb: [C:03+1] mysql_legacy: fix Instance's upgrade path [software/spicerack] - 10https://gerrit.wikimedia.org/r/1055244 (https://phabricator.wikimedia.org/T367496) (owner: 10Volans) [14:44:05] `curl -k 'https://termbox-test.staging.svc.eqiad.wmnet:4018/termbox?language=de&entity=Q123&revision=1134&editLink=/edit/Q123&preferredLanguages=de|en'` (disabling TLS verification) works, FWIW ^^ [14:45:14] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9995019 (10elukey) @Papaul the proposal that would be the best compromise is to add a "mgmt mac-address" field to ht... [14:46:33] FIRING: [6x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:47:38] (03CR) 10CDobbins: "Understood. I think the part I'm struggling with is how to dynamically return a response using statically generated templates." [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [14:47:47] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts mw2433.codfw.wmnet [14:47:48] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host mw2433.codfw.wmnet [14:47:53] (03CR) 10Volans: [C:03+2] mysql_legacy: fix Instance's upgrade path [software/spicerack] - 10https://gerrit.wikimedia.org/r/1055244 (https://phabricator.wikimedia.org/T367496) (owner: 10Volans) [14:47:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T365998 - depooling db1195 - s1 db1202 - s7 db1203 - s8', diff saved to https://phabricator.wikimedia.org/P66816 and previous config saved to /var/cache/conftool/dbconfig/20240718-144754-arnaudb.json [14:47:59] T365998: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998 [14:49:44] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9995033 (10ABran-WMF) data-persistence hosts handled, ready whenever you are @cmooney [14:52:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 25%: maintenance rescheduled', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240718-145157-arnaudb.json [14:52:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 25%: maintenance rescheduled', diff saved to https://phabricator.wikimedia.org/P66817 and previous config saved to /var/cache/conftool/dbconfig/20240718-145214-arnaudb.json [14:52:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 25%: maintenance rescheduled', diff saved to https://phabricator.wikimedia.org/P66818 and previous config saved to /var/cache/conftool/dbconfig/20240718-145232-arnaudb.json [14:53:13] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#9995085 (10ayounsi) That's a great idea ! Happy to help if needed. `fixed-address sretest2001.codfw.wmnet;` this needs to be `fixed-address $some-ip-add... [14:54:11] (03Merged) 10jenkins-bot: mysql_legacy: fix Instance's upgrade path [software/spicerack] - 10https://gerrit.wikimedia.org/r/1055244 (https://phabricator.wikimedia.org/T367496) (owner: 10Volans) [14:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (PATCH events) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:57:22] I'm rolling the train to group1 again now. [14:57:55] oh, someone already did.. how nice. [14:58:15] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Platform-SRE: Streamline Data Platform access approvals for WMF staff - https://phabricator.wikimedia.org/T370424 (10Ottomata) 03NEW [14:58:28] Thanks andre! [14:58:57] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host mw2433.codfw.wmnet [14:58:57] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.convert-disks (exit_code=99) for host mw2433 [14:59:19] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:05] dancy and andre: gettimeofday() says it's time for Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240718T1500) [15:03:48] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2433 [15:07:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 50%: maintenance rescheduled', diff saved to https://phabricator.wikimedia.org/P66819 and previous config saved to /var/cache/conftool/dbconfig/20240718-150708-arnaudb.json [15:07:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 50%: maintenance rescheduled', diff saved to https://phabricator.wikimedia.org/P66820 and previous config saved to /var/cache/conftool/dbconfig/20240718-150720-arnaudb.json [15:07:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 50%: maintenance rescheduled', diff saved to https://phabricator.wikimedia.org/P66821 and previous config saved to /var/cache/conftool/dbconfig/20240718-150737-arnaudb.json [15:08:31] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@cde3c31]: (no justification provided) [15:09:02] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@cde3c31]: (no justification provided) (duration: 00m 30s) [15:09:26] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055249 [15:10:10] (03CR) 10Btullis: [C:03+2] Increase the heap for the mapreduce history service on an-master1003 [puppet] - 10https://gerrit.wikimedia.org/r/1055176 (https://phabricator.wikimedia.org/T369278) (owner: 10Btullis) [15:10:37] 10ops-eqiad, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus100[78] - https://phabricator.wikimedia.org/T370426 (10RobH) 03NEW [15:12:09] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.convert-disks (exit_code=99) for host mw2433 [15:12:17] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on cr[1-2]-codfw,ssw1-d[1,8]-codfw with reason: Move asw-c-codfw and asw-d-codfw CR uplinks [15:12:25] 10ops-eqiad, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus100[78] - https://phabricator.wikimedia.org/T370426#9995230 (10RobH) [15:12:33] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cr[1-2]-codfw,ssw1-d[1,8]-codfw with reason: Move asw-c-codfw and asw-d-codfw CR uplinks [15:12:42] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9995231 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8062b5f0-d6f0-401c-9dfd-590a5facd0ad) set by cmooney@cumin... [15:13:03] !log elukey@cumin1002 START - Cookbook sre.hosts.dhcp for host sretest2001.codfw.wmnet [15:14:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (PATCH events) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:17:02] !log disabling interface et-1/1/0 on cr1-codfw (facing asw-c-codfw) T366941 [15:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:06] T366941: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941 [15:19:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (PATCH events) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:19:35] !log disabling interface et-1/1/3 on cr1-codfw (facing asw-d-codfw) T366941 [15:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 75%: maintenance rescheduled', diff saved to https://phabricator.wikimedia.org/P66822 and previous config saved to /var/cache/conftool/dbconfig/20240718-152213-arnaudb.json [15:22:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 75%: maintenance rescheduled', diff saved to https://phabricator.wikimedia.org/P66823 and previous config saved to /var/cache/conftool/dbconfig/20240718-152225-arnaudb.json [15:22:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 75%: maintenance rescheduled', diff saved to https://phabricator.wikimedia.org/P66824 and previous config saved to /var/cache/conftool/dbconfig/20240718-152243-arnaudb.json [15:23:24] !log cgoubert@cumin1002 START - Cookbook sre.hosts.provision for host mw2433.mgmt.codfw.wmnet with reboot policy GRACEFUL [15:23:29] 10ops-codfw, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus200[78] - https://phabricator.wikimedia.org/T370429 (10RobH) 03NEW [15:26:21] 10ops-codfw, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus200[78] - https://phabricator.wikimedia.org/T370429#9995329 (10RobH) [15:27:26] (03PS1) 10Vgutierrez: hiera: Extend bwlim experiment to upload@ulsfo|eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1055252 (https://phabricator.wikimedia.org/T317799) [15:27:45] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1055252 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [15:29:47] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2433.mgmt.codfw.wmnet with reboot policy GRACEFUL [15:35:31] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host sretest2001.codfw.wmnet [15:36:59] (03PS1) 10MVernon: Prepare for more new-style ms-be nodes [puppet] - 10https://gerrit.wikimedia.org/r/1055254 (https://phabricator.wikimedia.org/T368928) [15:37:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 100%: maintenance rescheduled', diff saved to https://phabricator.wikimedia.org/P66825 and previous config saved to /var/cache/conftool/dbconfig/20240718-153718-arnaudb.json [15:37:22] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1055254 (https://phabricator.wikimedia.org/T368928) (owner: 10MVernon) [15:37:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 100%: maintenance rescheduled', diff saved to https://phabricator.wikimedia.org/P66826 and previous config saved to /var/cache/conftool/dbconfig/20240718-153731-arnaudb.json [15:37:33] FIRING: [4x] KubernetesAPILatency: High Kubernetes API latency (PATCH events) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:37:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 100%: maintenance rescheduled', diff saved to https://phabricator.wikimedia.org/P66827 and previous config saved to /var/cache/conftool/dbconfig/20240718-153748-arnaudb.json [15:42:33] FIRING: [4x] KubernetesAPILatency: High Kubernetes API latency (PATCH events) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:45:37] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:47:48] RESOLVED: [4x] KubernetesAPILatency: High Kubernetes API latency (PATCH events) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:48:55] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [15:49:18] FIRING: KubernetesAPILatency: High Kubernetes API latency (PATCH events) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:52:33] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [15:52:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (PATCH events) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:52:42] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Platform-SRE: Streamline Data Platform access approvals for WMF staff - https://phabricator.wikimedia.org/T370424#9995479 (10odimitrijevic) Thanks @Ottomata. Ftr, I approve the proposal. [15:56:58] (03PS2) 10Vgutierrez: hiera: Extend bwlim experiment to upload@ulsfo|eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1055252 (https://phabricator.wikimedia.org/T317799) [15:58:49] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1055252 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [15:59:45] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [16:00:05] jhathaway and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240718T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:59] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [16:06:56] (03PS1) 10MVernon: Thanos: use new-style swift storage layout for forthcoming backends [puppet] - 10https://gerrit.wikimedia.org/r/1055255 (https://phabricator.wikimedia.org/T368445) [16:07:09] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:20:00 on cloudsw1-b1-codfw.mgmt,pfw3-codfw with reason: bouncing line card on cr1-codfw [16:07:23] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1055255 (https://phabricator.wikimedia.org/T368445) (owner: 10MVernon) [16:07:23] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on cloudsw1-b1-codfw.mgmt,pfw3-codfw with reason: bouncing line card on cr1-codfw [16:07:38] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9995538 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fdebcc6c-adaa-42f3-809d-4ec381a4798d) set by cmooney@cumin... [16:09:05] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1055255 (https://phabricator.wikimedia.org/T368445) (owner: 10MVernon) [16:10:35] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2433 [16:10:38] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw2433.codfw.wmnet [16:10:48] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts mw2433.codfw.wmnet [16:10:49] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host mw2433.codfw.wmnet [16:10:55] (03PS2) 10CDobbins: purged: revert use_pki flag [puppet] - 10https://gerrit.wikimedia.org/r/1054918 (https://phabricator.wikimedia.org/T360506) [16:11:24] (03PS3) 10CDobbins: purged: revert use_pki flag [puppet] - 10https://gerrit.wikimedia.org/r/1054918 (https://phabricator.wikimedia.org/T360506) [16:11:24] (03CR) 10CI reject: [V:04-1] purged: revert use_pki flag [puppet] - 10https://gerrit.wikimedia.org/r/1054918 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [16:11:54] (03PS1) 10Scott French: wmnet: direct appservers-ro DYNA record to failoid [dns] - 10https://gerrit.wikimedia.org/r/1055256 [16:11:55] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransw200[1-3].frack.codfw.wmnet - https://phabricator.wikimedia.org/T367800#9995545 (10Jhancock.wm) a:03Jhancock.wm [16:12:06] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frqueue2003, pay-lb2001, pay-lb2002 - https://phabricator.wikimedia.org/T369566#9995553 (10Jhancock.wm) [16:12:36] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9995556 (10cmooney) [16:13:12] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mw2433.codfw.wmnet [16:13:57] 10ops-eqiad, 06SRE, 06DC-Ops: 10gbit nic option for centrallog1002 - https://phabricator.wikimedia.org/T369825#9995558 (10VRiley-WMF) That works for me. I'm fully available for this tomorrow morning! [16:20:57] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#9995584 (10Jhancock.wm) [16:20:58] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#9995582 (10Jhancock.wm) a:03Jhancock.wm @Dzahn heads up, rack D8 is currently a 1G rack, but we are in the process of upgrading that whole row to 10G. So it would be tempor... [16:21:02] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:20:00 on cloudsw1-b1-codfw.mgmt,pfw3-codfw with reason: bouncing line card on cr1-codfw [16:21:06] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on cloudsw1-b1-codfw.mgmt,pfw3-codfw with reason: bouncing line card on cr1-codfw [16:21:15] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9995596 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1b177f94-1995-41ab-90b9-673cef9dbf94) set by cmooney@cumin... [16:23:48] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.convert-disks (exit_code=99) for host mw2433 [16:23:53] !log disable option 82 on lsw1-b7-codfw to test pxe boot issue [16:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [16:25:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [16:25:46] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2438 [16:26:00] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw2438.codfw.wmnet [16:27:17] (03PS1) 10Cparle: Reduce weight of 'main subject' as it's used inconsistently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055258 (https://phabricator.wikimedia.org/T367774) [16:29:58] (03CR) 10Scott French: "Brandon, if you have cycles to review, that would be greatly appreciated." [dns] - 10https://gerrit.wikimedia.org/r/1055256 (owner: 10Scott French) [16:30:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [16:32:45] !log re-enable option 82 on lsw1-b7-codfw [16:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:23] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:20:00 on ssw1-a1-codfw.mgmt with reason: bouncing line card on cr1-codfw [16:34:37] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on ssw1-a1-codfw.mgmt with reason: bouncing line card on cr1-codfw [16:34:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [16:34:48] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9995659 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f32e4714-9c03-456e-bc05-238c01bacbca) set by cmooney@cumin... [16:35:13] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts mw2438.codfw.wmnet [16:35:14] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host mw2438.codfw.wmnet [16:37:51] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mw2438.codfw.wmnet [16:39:30] !log resetting line card 1/1 on cr1-codfw (T366941) [16:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:47] T366941: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941 [16:41:40] (03PS3) 10JHathaway: expose_agent_certs: use ssldir exclusively [puppet] - 10https://gerrit.wikimedia.org/r/1054951 (https://phabricator.wikimedia.org/T367547) [16:42:26] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1054951 (https://phabricator.wikimedia.org/T367547) (owner: 10JHathaway) [16:44:19] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:46:40] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9995727 (10cmooney) [16:52:50] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.convert-disks (exit_code=99) for host mw2438 [16:59:44] (03PS17) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [17:00:05] bd808: It is that lovely time of the day again! You are hereby commanded to deploy Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240718T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240718T1700) [17:01:40] nothing special for me to do today so my window is closed [17:04:07] (03CR) 10BCornwall: [C:03+1] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1055232 (owner: 10Ncmonitor) [17:04:42] FIRING: [2x] SystemdUnitFailed: envoyproxy.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:09:43] Hi [17:09:54] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2438 [17:10:07] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw2438.codfw.wmnet [17:10:15] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts mw2438.codfw.wmnet [17:10:15] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.convert-disks (exit_code=99) for host mw2438 [17:11:04] Is it possible to cancel translations made through translatewiki.net, without waiting a week, because it will cause problems? [17:11:11] (03CR) 10BBlack: [C:03+1] "This is all kinda untested, but seems sane!" [dns] - 10https://gerrit.wikimedia.org/r/1055256 (owner: 10Scott French) [17:12:24] Gerges: what does "cancel" mean here? Is the translation already on the this weeks train? [17:12:48] (03CR) 10BCornwall: [C:03+1] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1055231 (owner: 10Ncmonitor) [17:13:29] Translations were published on the train this week. We would like to change them without waiting for next week [17:14:04] (03PS2) 10Scott French: wmnet: direct appservers-ro DYNA record to failoid [dns] - 10https://gerrit.wikimedia.org/r/1055256 (https://phabricator.wikimedia.org/T367949) [17:15:04] technically you could propose patches to the main branches that are affected and then ask for them to be backported to the released version. This is not a common practice to avoid edit wars with the TWN export bot, but if you have already gotten things fixed at TWN it seems possible. [17:15:10] Gerges: ^ [17:15:15] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2438 [17:15:18] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw2438.codfw.wmnet [17:15:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts mw2438.codfw.wmnet [17:15:25] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.convert-disks (exit_code=99) for host mw2438 [17:16:11] (03PS1) 10Cathal Mooney: Add new cr1-codfw sub-ints for rows c/d to DHCP relay and RA gen [homer/public] - 10https://gerrit.wikimedia.org/r/1055262 (https://phabricator.wikimedia.org/T366941) [17:16:44] Gerges: The train is blocked right now so this would be a great time to prepare the backport [17:16:54] Yes, I changed the translation from http://translatewiki.net, because this translation may cause some tools to break [17:17:32] (03CR) 10Cathal Mooney: [C:03+2] Add new cr1-codfw sub-ints for rows c/d to DHCP relay and RA gen [homer/public] - 10https://gerrit.wikimedia.org/r/1055262 (https://phabricator.wikimedia.org/T366941) (owner: 10Cathal Mooney) [17:17:52] Do I make a patch in the master branch now ? [17:18:01] (03Merged) 10jenkins-bot: Add new cr1-codfw sub-ints for rows c/d to DHCP relay and RA gen [homer/public] - 10https://gerrit.wikimedia.org/r/1055262 (https://phabricator.wikimedia.org/T366941) (owner: 10Cathal Mooney) [17:18:08] Gerges: Yes [17:19:17] Gerges: meet dancy who is this week's deployment train conductor. :) [17:19:19] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:20:44] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2438 [17:20:47] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw2438.codfw.wmnet [17:20:49] Hi dancy [17:20:54] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts mw2438.codfw.wmnet [17:24:42] Gerges: I'm stepping out for a break. Lemme know when the fixes are merged into master (and point me to them) and we can figure out where to go from there. [17:24:49] !log making cr1-codfw interfaces connecting ssw1-d1-codfw VRRP master for row c & d vlans T366941 [17:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:52] T366941: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941 [17:26:17] dancy: I changed the translation messages for the names of the Arabic months, and this may cause problems with some tools and bots [17:26:57] Not I, I mean someone [17:27:36] Nod. I saw your message about that on the train blockers task. [17:28:49] !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.convert-disks (exit_code=97) for host mw2438 [17:29:03] !log cgoubert@cumin1002 START - Cookbook sre.hosts.convert-disks for host mw2438 [17:29:06] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw2438.codfw.wmnet [17:29:11] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts mw2438.codfw.wmnet [17:35:51] 10SRE-swift-storage, 10ConfirmEdit (CAPTCHA extension), 06Editing-team, 07Wikimedia-production-error: PHP Warning: Cannot modify header information - headers already sent by (output started at /srv/mediawiki/php-1.43.0-wmf.11/includes/libs/http/MultiHttpCli... - https://phabricator.wikimedia.org/T369186#9995940 [17:37:14] (03CR) 10Dzahn: [C:03+2] cache::text: remove git.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006979 (https://phabricator.wikimedia.org/T323073) (owner: 10Dzahn) [17:38:14] !log cgoubert@cumin1002 START - Cookbook sre.hosts.provision for host mw2438.mgmt.codfw.wmnet with reboot policy GRACEFUL [17:38:33] dancy: this patch https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1055264 [17:41:37] 10SRE-swift-storage, 10MediaWiki-libs-HTTP, 06MW-Interfaces-Team, 07Wikimedia-production-error: PHP Warning: Cannot modify header information - headers already sent by (output started at /srv/mediawiki/php-1.43.0-wmf.11/includes/libs/http/MultiHttpClient.ph... - https://phabricator.wikimedia.org/T369186#9995966 [17:43:49] !log disabling cr2-codfw port et-1/1/0 connecting to asw-c-codfw T366941 [17:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:53] T366941: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941 [17:43:54] FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1100-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:45:37] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:45:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2136', diff saved to https://phabricator.wikimedia.org/P66829 and previous config saved to /var/cache/conftool/dbconfig/20240718-174547-root.json [17:53:08] (03PS1) 10Cathal Mooney: Move RA generation and dhcp relay to ssw1-d facing ports [homer/public] - 10https://gerrit.wikimedia.org/r/1055266 (https://phabricator.wikimedia.org/T366941) [17:54:39] (03CR) 10Scott French: "Thank you for the review!" [dns] - 10https://gerrit.wikimedia.org/r/1055256 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [17:55:05] (03CR) 10Scott French: [C:03+2] wmnet: direct appservers-ro DYNA record to failoid [dns] - 10https://gerrit.wikimedia.org/r/1055256 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [17:56:20] (03CR) 10Cathal Mooney: [C:03+2] Move RA generation and dhcp relay to ssw1-d facing ports [homer/public] - 10https://gerrit.wikimedia.org/r/1055266 (https://phabricator.wikimedia.org/T366941) (owner: 10Cathal Mooney) [17:56:49] (03Merged) 10jenkins-bot: Move RA generation and dhcp relay to ssw1-d facing ports [homer/public] - 10https://gerrit.wikimedia.org/r/1055266 (https://phabricator.wikimedia.org/T366941) (owner: 10Cathal Mooney) [18:00:05] dancy and andre: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240718T1800). [18:01:17] !log aokoth@cumin1002 START - Cookbook sre.vrts.upgrade on VRTS host vrts1001.eqiad.wmnet [18:01:24] !log aokoth@cumin1002 END (FAIL) - Cookbook sre.vrts.upgrade (exit_code=99) on VRTS host vrts1001.eqiad.wmnet [18:03:11] !log appservers-ro.discovery.wmnet now resolves to failoid - T367949 [18:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:15] T367949: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949 [18:06:11] Why is it failing here https://integration.wikimedia.org/ci/job/mediawiki-quibble-vendor-mysql-php74/7047/console? [18:07:05] https://www.irccloud.com/pastebin/hhJZalIo/ [18:07:42] (03PS1) 10Scott French: wmnet: direct api-ro DYNA record to failoid [dns] - 10https://gerrit.wikimedia.org/r/1055268 (https://phabricator.wikimedia.org/T367949) [18:08:29] (03CR) 10Dzahn: "Hi, I was wondering what I should do next. Is it reasonable to add those to puppet deploy window? Should I ask you to deploy? Is the expec" [puppet] - 10https://gerrit.wikimedia.org/r/1053400 (https://phabricator.wikimedia.org/T367014) (owner: 10Dzahn) [18:09:53] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9996123 (10Papaul) [18:10:26] (03PS2) 10Scott French: wmnet: direct api-ro DYNA record to failoid [dns] - 10https://gerrit.wikimedia.org/r/1055268 (https://phabricator.wikimedia.org/T367949) [18:10:48] (03CR) 10Bearloga: "Just want to note that currently the requested values are:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055228 (https://phabricator.wikimedia.org/T370045) (owner: 10Joely Rooke WMDE) [18:11:39] (03CR) 10BBlack: [C:03+1] wmnet: direct api-ro DYNA record to failoid [dns] - 10https://gerrit.wikimedia.org/r/1055268 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [18:12:09] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9996125 (10Papaul) 05Open→03Resolved @Marostegui this is ready for you. All the server are on 10G.I hope we fix this pxe boot issue on the 10G before we think... [18:13:11] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9996129 (10Papaul) ok +1 for /25 so we all okay thanks [18:13:23] (03CR) 10Scott French: "Thank you!" [dns] - 10https://gerrit.wikimedia.org/r/1055268 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [18:13:27] (03CR) 10Scott French: [C:03+2] wmnet: direct api-ro DYNA record to failoid [dns] - 10https://gerrit.wikimedia.org/r/1055268 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [18:14:30] dancy: I don't understand what this error, can you explain? [18:15:41] (03PS5) 10Dzahn: mediawiki/geoip: make loading geoip data from puppetserver optional [puppet] - 10https://gerrit.wikimedia.org/r/1026193 (https://phabricator.wikimedia.org/T363415) [18:16:06] (03PS6) 10Dzahn: mediawiki/geoip: make loading geoip data from puppetserver optional [puppet] - 10https://gerrit.wikimedia.org/r/1026193 (https://phabricator.wikimedia.org/T363415) [18:16:31] (03CR) 10Dzahn: "Done. Removed puppet 7 reference from the message." [puppet] - 10https://gerrit.wikimedia.org/r/1026193 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [18:16:51] (03PS1) 10AOkoth: vrts: fix runuser errors [cookbooks] - 10https://gerrit.wikimedia.org/r/1055270 (https://phabricator.wikimedia.org/T366078) [18:17:10] !log api-ro.discovery.wmnet now resolves to failoid - T367949 [18:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:14] T367949: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949 [18:18:23] Gerges: In short, the changes made in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1054298 need to be reverted. [18:20:49] If possible, this patch is reverted, and then we merge my patch [18:21:05] You should include the fix in your patch. [18:21:20] Reverting the other one by itself won't pass CI either. [18:21:37] (03CR) 10AOkoth: "Merging and retrying before maintenance window closes." [cookbooks] - 10https://gerrit.wikimedia.org/r/1055270 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [18:21:42] (unless forced through, but I don't want to do that). We want a single, working commit. [18:21:47] (03CR) 10AOkoth: [C:03+2] vrts: fix runuser errors [cookbooks] - 10https://gerrit.wikimedia.org/r/1055270 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [18:22:14] I can attempt to update your change for you if that will help. [18:24:29] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Spin down api_appserver and appserver clusters - https://phabricator.wikimedia.org/T367949#9996177 (10Scott_French) `appservers-ro.discovery.wmnet` and `api-ro.discovery.wmnet` now resolve to failoid, by way of manually updating their `DYNA` record... [18:25:45] (03Merged) 10jenkins-bot: vrts: fix runuser errors [cookbooks] - 10https://gerrit.wikimedia.org/r/1055270 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [18:26:03] 10ops-codfw, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452 (10RobH) 03NEW [18:26:27] dancy: I updated the patch [18:26:34] Taking a look [18:26:36] 10ops-codfw, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#9996203 (10RobH) [18:27:06] I haven't opened a ticket in phabricator, can I open a ticket now? [18:27:12] Absolutely! [18:27:22] (03PS1) 10Dwisehaupt: crm: decrease mariadb innodb_buffer_pool_size [puppet] - 10https://gerrit.wikimedia.org/r/1055272 (https://phabricator.wikimedia.org/T343486) [18:27:54] !log aokoth@cumin1002 START - Cookbook sre.vrts.upgrade on VRTS host vrts1001.eqiad.wmnet [18:28:05] 10ops-eqiad, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be1005 - https://phabricator.wikimedia.org/T370453 (10RobH) 03NEW [18:28:22] 10ops-eqiad, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be1005 - https://phabricator.wikimedia.org/T370453#9996221 (10RobH) [18:28:57] (03CR) 10Brennen Bearnes: [C:03+1] mediawiki/geoip: make loading geoip data from puppetserver optional [puppet] - 10https://gerrit.wikimedia.org/r/1026193 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [18:30:13] !log aokoth@cumin1002 END (FAIL) - Cookbook sre.vrts.upgrade (exit_code=99) on VRTS host vrts1001.eqiad.wmnet [18:30:18] dancy: ^ [18:32:10] Hmm? [18:33:11] dancy: yw (was bored this morning so I deployed to group1 :P ) [18:33:22] Very much appreciated [18:33:30] After the patch update is published, what should I do, should I open a ticket in phabricator or explain it in a patch [18:34:27] Please create a phabricator ticket right now, add the explanatory notes, then edit the commit message of your change to reference it and re-push. [18:34:56] cf https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines and https://www.mediawiki.org/wiki/Gerrit/Tutorial#Amending_a_change_.28your_own_or_someone_else.27s.29 [18:35:21] Gerges ^ [18:35:31] I'll merge after CI passes, and I'll cherry-pick down to the wmf/1.43.0-wmf.14 branch and deploy it for you. [18:36:41] (03CR) 10JHathaway: admin: add dcops to the system adm POSIX group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [18:38:00] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransc2001 - https://phabricator.wikimedia.org/T367816#9996286 (10Papaul) ` papaul@fasw-c-codfw# show | compare [edit interfaces interface-range disabled] - member ge-0/0/15; - member ge-1/0/15; [edit interfaces interface-ran... [18:41:12] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9996287 (10cmooney) 05Open→03Resolved Work completed, traffic is currently bridged through the two spine switches over the AEs... [18:41:30] (03PS1) 10Scardenasmolinar: Fix guard clause in Revision Hook Handler and Precheck [extensions/AutoModerator] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055274 (https://phabricator.wikimedia.org/T370161) [18:44:36] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#9996322 (10Jhancock.wm) ++ for /25 from me as well [18:45:39] Gerges: Feel free to complain in the ticket about the fact that test-breaking changes were allowed to be merged by l10n-bot. [18:45:52] which I think is a root cause. [18:46:07] (03PS1) 10Ebernhardson: Produce a limited set of event streams on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) [18:47:29] (03CR) 10Ebernhardson: [C:04-2] "Do not merge until the dependent patches are deployed and unlikely to be rolled back." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) (owner: 10Ebernhardson) [18:47:52] dancy: I created a ticket T370456, is there something missing? [18:47:53] T370456: Recovering the names of the Arabic months - https://phabricator.wikimedia.org/T370456 [18:48:06] Nope, that's just right. I'm about to +2 your change [18:49:12] Gerges: The full process will take a while. I'll let you know when it is deployed. [18:49:49] OK, take your time [18:50:03] It's the computers taking their time. :-) [18:50:11] (03PS2) 10Ebernhardson: Produce a limited set of event streams on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) [18:52:49] (03PS1) 10Papaul: Add fransc2001 to dns file [dns] - 10https://gerrit.wikimedia.org/r/1055278 [18:53:28] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9996358 (10cmooney) [18:56:37] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9996362 (10cmooney) GNMI stats proved very helpful to keep an eye on the bandwidth shifting around {F56509244 width=600} {F56509... [18:57:40] (03PS1) 10Jdlrobson: Fixes client preferences error [extensions/MobileFrontend] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055279 (https://phabricator.wikimedia.org/T370441) [18:57:45] (03CR) 10JHathaway: [C:03+2] expose_agent_certs: use ssldir exclusively [puppet] - 10https://gerrit.wikimedia.org/r/1054951 (https://phabricator.wikimedia.org/T367547) (owner: 10JHathaway) [18:59:19] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:04:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/AutoModerator] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055274 (https://phabricator.wikimedia.org/T370161) (owner: 10Scardenasmolinar) [19:05:24] (03PS1) 10Ahmon Dancy: [i18n] Change the names of the Arabic months [core] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055280 (https://phabricator.wikimedia.org/T370456) [19:06:10] (03PS7) 10Dzahn: mediawiki/geoip: make loading geoip data from puppetserver optional [puppet] - 10https://gerrit.wikimedia.org/r/1026193 (https://phabricator.wikimedia.org/T363415) [19:07:54] !log dancy@deploy1002 Installing scap version "4.93.0" for 232 hosts [19:09:27] (03PS1) 10Cathal Mooney: Enable BGP between cr1-codfw and ssw1-d1-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1055283 (https://phabricator.wikimedia.org/T369274) [19:10:32] (03CR) 10Cathal Mooney: [C:03+2] Enable BGP between cr1-codfw and ssw1-d1-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1055283 (https://phabricator.wikimedia.org/T369274) (owner: 10Cathal Mooney) [19:11:14] (03Merged) 10jenkins-bot: Enable BGP between cr1-codfw and ssw1-d1-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1055283 (https://phabricator.wikimedia.org/T369274) (owner: 10Cathal Mooney) [19:12:25] !log enabling BGP session from cr1-codfw to ssw1-d1-codfw [19:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:26] (03CR) 10Dzahn: "compiling on entire class geoip - https://puppet-compiler.wmflabs.org/output/1026193/3313/" [puppet] - 10https://gerrit.wikimedia.org/r/1026193 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [19:18:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055280 (https://phabricator.wikimedia.org/T370456) (owner: 10Ahmon Dancy) [19:18:38] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1179 crashed - hardware issues - https://phabricator.wikimedia.org/T369855#9996563 (10VRiley-WMF) Attempted bare minimum setup (CPU1, A1 RAM, no additional cards) no change. Attempted swapping out the power button module still no change. Will attempt swapping out MB [19:23:11] (03CR) 10Cathal Mooney: Add monitoring checks for codfw row D spines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1055169 (https://phabricator.wikimedia.org/T366941) (owner: 10Cathal Mooney) [19:23:12] (03CR) 10Cathal Mooney: [C:03+2] Add monitoring checks for codfw row D spines [puppet] - 10https://gerrit.wikimedia.org/r/1055169 (https://phabricator.wikimedia.org/T366941) (owner: 10Cathal Mooney) [19:25:36] (03CR) 10Dzahn: [V:03+1] "compile on profile::mediawiki::common: https://puppet-compiler.wmflabs.org/output/1026193/3314/" [puppet] - 10https://gerrit.wikimedia.org/r/1026193 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [19:27:25] (03CR) 10Dzahn: [V:03+1] "the error on deploy1002 is unrelated - already fails before this patch" [puppet] - 10https://gerrit.wikimedia.org/r/1026193 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [19:31:37] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1026193/3313/" [puppet] - 10https://gerrit.wikimedia.org/r/1026193 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [19:31:59] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on ssw1-a[1,8]-codfw.mgmt,ssw1-d[1,8]-codfw.mgmt with reason: Migrate codfw row c and d IP GWs from CRs to Spines [19:32:14] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ssw1-a[1,8]-codfw.mgmt,ssw1-d[1,8]-codfw.mgmt with reason: Migrate codfw row c and d IP GWs from CRs to Spines [19:32:20] 06SRE, 06Infrastructure-Foundations, 10netops: Move IP gateways for codfw row C/D vlans to EVPN Anycast GW - https://phabricator.wikimedia.org/T369274#9996630 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d6a640fd-d19e-4aa8-930d-6c260b51a4c3) set by cmooney@cumin1002 for 3:00:00 on 4 ho... [19:34:26] !log disable BGP between spine switches in rows A and row D prior to enabling IP GW (T369274) [19:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:30] T369274: Move IP gateways for codfw row C/D vlans to EVPN Anycast GW - https://phabricator.wikimedia.org/T369274 [19:37:08] !log Send SIGQUIT signal to the benthos service after a goroutine was waiting forever in webrequest_live.yaml - T369256 [19:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:12] T369256: Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic - https://phabricator.wikimedia.org/T369256 [19:37:14] !log add IRB int on public1-c-codfw vlan to ssw1-d1-codfw and ssw1-d8-codfw T369274 [19:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:28] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [19:39:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T367856)', diff saved to https://phabricator.wikimedia.org/P66831 and previous config saved to /var/cache/conftool/dbconfig/20240718-193927-marostegui.json [19:39:32] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [19:41:14] (03CR) 10Andrew Bogott: [C:03+2] Beta: update deployment-deploy04 IP [puppet] - 10https://gerrit.wikimedia.org/r/1053995 (owner: 10Thcipriani) [19:42:00] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for new IRB interfaces codfw - cmooney@cumin1002" [19:43:02] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for new IRB interfaces codfw - cmooney@cumin1002" [19:43:02] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:43:24] (03Merged) 10jenkins-bot: [i18n] Change the names of the Arabic months [core] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055280 (https://phabricator.wikimedia.org/T370456) (owner: 10Ahmon Dancy) [19:43:54] !log dancy@deploy1002 Started scap sync-world: Backport for [[gerrit:1055280|[i18n] Change the names of the Arabic months (T370456)]] [19:43:58] T370456: Recovering the names of the Arabic months - https://phabricator.wikimedia.org/T370456 [19:46:43] !log dancy@deploy1002 dancy: Backport for [[gerrit:1055280|[i18n] Change the names of the Arabic months (T370456)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:47:01] !log dancy@deploy1002 dancy: Continuing with sync [19:47:20] dancy: Is it deployed on mwdebug? [19:47:30] Gerges: Yep! [19:49:55] dancy: Everything is fine on mwdebug [19:50:05] Excellent! [19:50:59] Close the ticket now? [19:51:17] Full deployment is still happening. We'll get a notification when it's done [19:51:28] ~5-10 minutes [19:51:31] Then close. [19:51:55] Ok [19:54:17] !log dancy@deploy1002 Finished scap: Backport for [[gerrit:1055280|[i18n] Change the names of the Arabic months (T370456)]] (duration: 10m 23s) [19:54:22] T370456: Recovering the names of the Arabic months - https://phabricator.wikimedia.org/T370456 [19:54:39] Success! [19:54:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P66832 and previous config saved to /var/cache/conftool/dbconfig/20240718-195434-marostegui.json [19:55:49] Thanks [19:56:25] (03PS1) 10Ahmon Dancy: .gitignore: Claim all of /php-*/ for "scap prep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055286 [19:58:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1002 using scap backport" [extensions/MobileFrontend] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055279 (https://phabricator.wikimedia.org/T370441) (owner: 10Jdlrobson) [19:58:33] (03PS1) 10JHathaway: pcc-db1002: add hiera data [puppet] - 10https://gerrit.wikimedia.org/r/1055287 (https://phabricator.wikimedia.org/T367547) [19:59:04] (03CR) 10JHathaway: [C:03+2] pcc-db1002: add hiera data [puppet] - 10https://gerrit.wikimedia.org/r/1055287 (https://phabricator.wikimedia.org/T367547) (owner: 10JHathaway) [19:59:45] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240718T2000). Please do the needful. [20:00:04] katherine_g: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:19] here! [20:00:25] i can deploy! [20:00:33] thanks [20:01:58] (03CR) 10Andrew Bogott: [C:03+1] "yes please! I have already worked around this a few times." [puppet] - 10https://gerrit.wikimedia.org/r/1026193 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [20:01:59] oh - maybe i can't - there seems to be a scap lock [20:02:14] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [20:03:06] katherine_g: looks like there might be something underway - might have to wait a few minutes [20:03:23] ok, no worries, I'll be here [20:03:50] cool [20:04:49] !log enabling IPv6 RA generation for public1-c-codfw on ssw1-d1-codfw and ssw1-d8-codfw T369274 [20:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:53] dancy: looks like your backport is still in progress - is it ok to start the backport window once you're done? [20:05:03] T369274: Move IP gateways for codfw row C/D vlans to EVPN Anycast GW - https://phabricator.wikimedia.org/T369274 [20:05:18] Hello. I'm handling a train-blocking backport at the moment. Then I intend to roll the train forward, then I'll hand over to you. [20:05:38] sounds good thanks! katherine_g ^^ [20:06:10] thanks for update [20:07:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:09:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P66833 and previous config saved to /var/cache/conftool/dbconfig/20240718-200941-marostegui.json [20:18:12] !log disable IPv6 RA generation for public1-c-codfw on cr1-codfw and cr2-codfw T369274 [20:19:21] 06SRE, 06collaboration-services, 06serviceops: add bullseye support to deployment server puppet role - upgrade deployment server in devtools - https://phabricator.wikimedia.org/T363415#9996755 (10Dzahn) This latest merge today finally fixed the puppet runs on `deploy-1006` (T370436) [20:20:07] (03CR) 10Dzahn: [V:03+1 C:03+2] "@Andrew :) I confirm it works. I have set Hiera key "profile::mediawiki::common::load_geoip_data_from_puppetserver" to false in Horizon an" [puppet] - 10https://gerrit.wikimedia.org/r/1026193 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [20:22:41] (03CR) 10Dzahn: [C:03+2] crm: decrease mariadb innodb_buffer_pool_size [puppet] - 10https://gerrit.wikimedia.org/r/1055272 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [20:24:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T367856)', diff saved to https://phabricator.wikimedia.org/P66835 and previous config saved to /var/cache/conftool/dbconfig/20240718-202449-marostegui.json [20:24:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [20:24:57] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [20:25:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [20:25:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2172 (T367856)', diff saved to https://phabricator.wikimedia.org/P66836 and previous config saved to /var/cache/conftool/dbconfig/20240718-202511-marostegui.json [20:25:22] (03CR) 10CI reject: [V:04-1] Fixes client preferences error [extensions/MobileFrontend] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055279 (https://phabricator.wikimedia.org/T370441) (owner: 10Jdlrobson) [20:28:33] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Move sretest2002 primary uplink to asw-d4-codfw - https://phabricator.wikimedia.org/T370475 (10cmooney) 03NEW p:05Triage→03Medium [20:29:47] (03CR) 10Ahmon Dancy: [C:03+2] Fixes client preferences error [extensions/MobileFrontend] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055279 (https://phabricator.wikimedia.org/T370441) (owner: 10Jdlrobson) [20:29:48] (03CR) 10Dwisehaupt: "Thanks. The config was updated with a puppet run and I have restarted mariadb. Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1055272 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [20:35:36] (03PS18) 10CDobbins: varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) [20:36:50] (03CR) 10Krinkle: [C:03+1] .gitignore: Claim all of /php-*/ for "scap prep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055286 (owner: 10Ahmon Dancy) [20:37:23] (03CR) 10Ahmon Dancy: [C:03+2] .gitignore: Claim all of /php-*/ for "scap prep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055286 (owner: 10Ahmon Dancy) [20:38:11] (03Merged) 10jenkins-bot: .gitignore: Claim all of /php-*/ for "scap prep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055286 (owner: 10Ahmon Dancy) [20:40:12] (03CR) 10Ahmon Dancy: [V:03+2 C:03+2] Fixes client preferences error [extensions/MobileFrontend] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055279 (https://phabricator.wikimedia.org/T370441) (owner: 10Jdlrobson) [20:41:01] !log dancy@deploy1002 Started scap sync-world: Backport for [[gerrit:1055279|Fixes client preferences error (T370441)]] [20:41:06] T370441: Mobile Dark Mode always on in Beta - https://phabricator.wikimedia.org/T370441 [20:43:54] !log dancy@deploy1002 dancy, jdlrobson: Backport for [[gerrit:1055279|Fixes client preferences error (T370441)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:44:05] Jdlrobson: ^^ [20:47:30] !log dancy@deploy1002 dancy, jdlrobson: Continuing with sync [20:47:52] (03CR) 10Ahmon Dancy: [C:03+2] Fix guard clause in Revision Hook Handler and Precheck [extensions/AutoModerator] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055274 (https://phabricator.wikimedia.org/T370161) (owner: 10Scardenasmolinar) [20:48:18] cjming: I just +2'd https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AutoModerator/+/1055274 [20:48:31] dancy: thanks! [20:48:49] Thanks! [20:49:57] !log remove VRRP for public1-c-codfw vlan from cr1-codfw and cr2-codfw T369274 [20:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:02] T369274: Move IP gateways for codfw row C/D vlans to EVPN Anycast GW - https://phabricator.wikimedia.org/T369274 [20:52:00] dancy: do you want to lmk when it's ok to scap? not sure if you're done [20:52:24] !log dancy@deploy1002 Finished scap: Backport for [[gerrit:1055279|Fixes client preferences error (T370441)]] (duration: 11m 22s) [20:52:28] T370441: Mobile Dark Mode always on in Beta - https://phabricator.wikimedia.org/T370441 [20:52:56] cjming: I'm going to roll the train now and then hand over to you [20:53:09] (03PS1) 10TrainBranchBot: group2 to 1.43.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055293 (https://phabricator.wikimedia.org/T366959) [20:53:14] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.43.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055293 (https://phabricator.wikimedia.org/T366959) (owner: 10TrainBranchBot) [20:53:23] great thanks [20:53:33] cjming: Or I can handle the backport if you want to be released [20:53:51] (03Merged) 10jenkins-bot: group2 to 1.43.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055293 (https://phabricator.wikimedia.org/T366959) (owner: 10TrainBranchBot) [20:55:12] dancy: that would be great if you don't mind - i have to run soon [20:55:24] OK. Be free [21:01:05] (03Merged) 10jenkins-bot: Fix guard clause in Revision Hook Handler and Precheck [extensions/AutoModerator] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055274 (https://phabricator.wikimedia.org/T370161) (owner: 10Scardenasmolinar) [21:01:31] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 to 1.43.0-wmf.14 refs T366959 [21:01:35] T366959: 1.43.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T366959 [21:02:18] Perfect timing [21:02:19] !log dancy@deploy1002 Started scap sync-world: Backport for [[gerrit:1055274|Fix guard clause in Revision Hook Handler and Precheck (T370161)]] [21:02:23] T370161: "Call to a member function equals() on null" when rolling back a change with a suppressed username - https://phabricator.wikimedia.org/T370161 [21:04:38] !log dancy@deploy1002 suecarmol, dancy: Backport for [[gerrit:1055274|Fix guard clause in Revision Hook Handler and Precheck (T370161)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:04:42] FIRING: [2x] SystemdUnitFailed: envoyproxy.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:05:44] katherine_g: Ready for testing [21:05:52] ok thanks [21:07:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:08:37] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_eqiad [21:08:40] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_eqiad [21:09:02] I'm good to sync [21:09:16] ok, proceeding [21:09:22] !log dancy@deploy1002 suecarmol, dancy: Continuing with sync [21:12:55] (03CR) 10Volans: "FYI as mentioned in an earlier CR, you can test a cookbook CR with https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Test_before_mer" [cookbooks] - 10https://gerrit.wikimedia.org/r/1055270 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [21:13:22] (03PS1) 10AOkoth: vrts: use curl with -x flag for proxy [cookbooks] - 10https://gerrit.wikimedia.org/r/1055297 (https://phabricator.wikimedia.org/T366078) [21:14:21] !log dancy@deploy1002 Finished scap: Backport for [[gerrit:1055274|Fix guard clause in Revision Hook Handler and Precheck (T370161)]] (duration: 12m 02s) [21:14:25] T370161: "Call to a member function equals() on null" when rolling back a change with a suppressed username - https://phabricator.wikimedia.org/T370161 [21:14:37] katherine_g: Deployed [21:14:43] thanks! [21:15:54] (03CR) 10Ottomata: Produce a limited set of event streams on private wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) (owner: 10Ebernhardson) [21:17:23] (03CR) 10AOkoth: [C:03+2] "I think I missed this. Thank you." [cookbooks] - 10https://gerrit.wikimedia.org/r/1055270 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [21:21:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#9996920 (10Papaul) @Jclark-ctr I checked 1004 PXE boot was set on both the 1G and 10G I disable it on the 1G you should be good now. You can check the oth... [21:21:58] !log enable IPv6 RA generation on ssw1-d1-codfw and ssw1-d8-codfw for public1-d-codfw vlan T369274 [21:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:02] T369274: Move IP gateways for codfw row C/D vlans to EVPN Anycast GW - https://phabricator.wikimedia.org/T369274 [21:23:17] (03CR) 10Papaul: [C:03+2] Add fransc2001 to dns file [dns] - 10https://gerrit.wikimedia.org/r/1055278 (owner: 10Papaul) [21:25:07] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install fransc2001 - https://phabricator.wikimedia.org/T367816#9996928 (10Papaul) [21:26:31] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install fransc2001 - https://phabricator.wikimedia.org/T367816#9996930 (10Papaul) a:05Papaul→03Dwisehaupt @Dwisehaupt switch port setup and DNS entries done. All yours [21:34:15] 10ops-eqiad, 06SRE, 06DC-Ops: 10gbit nic option for centrallog1002 - https://phabricator.wikimedia.org/T369825#9996951 (10Papaul) @VRiley-WMF can you provide me with the rack and switch port the server is in and connected to right now and the rack and switch port where the server will be moved to on the task... [21:36:12] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install fransc2001 - https://phabricator.wikimedia.org/T367816#9996966 (10Dwisehaupt) Thanks. [21:39:39] !log disable IPv6 RA generation on cr1-codfw and cr2-codfw for public1-d-codfw vlan T369274 [21:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:44] T369274: Move IP gateways for codfw row C/D vlans to EVPN Anycast GW - https://phabricator.wikimedia.org/T369274 [21:41:37] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06serviceops-radar, 10Event-Platform: Configuration Management for Kafka settings - https://phabricator.wikimedia.org/T276088#9996973 (10bking) >>! In T276088#8443651, @Ottomata wrote: > To do the ACLs right we also need some authentication for Kafka. We ca... [21:43:54] FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1100-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [21:57:00] !log bking@cumin2002 START - Cookbook sre.elasticsearch.force-shard-allocation [21:57:06] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [21:58:40] !log remove VRRP group on cr1-codfw and cr2-codfw for public1-d-codfw vlan T369274 [21:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:44] T369274: Move IP gateways for codfw row C/D vlans to EVPN Anycast GW - https://phabricator.wikimedia.org/T369274 [22:03:33] !log move GW IPs for public1-d-codfw vlan to ssw1-d1-codfw and ssw1-d8-codfw T369274 [22:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:52] (03PS3) 10Ebernhardson: Produce a limited set of event streams on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) [22:12:52] (03CR) 10Ebernhardson: Produce a limited set of event streams on private wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) (owner: 10Ebernhardson) [22:13:39] FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1100-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [22:15:16] !log add IP interfaces for private1-c-codfw vlan to ssw1-d1-codfw and ssw1-d8-codfw [22:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:39] FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1100-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [22:19:33] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on elastic1100.eqiad.wmnet with reason: catch up on indexing [22:19:48] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on elastic1100.eqiad.wmnet with reason: catch up on indexing [22:23:39] RESOLVED: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1101-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [22:33:42] !log Disable IPv6 RA generation for private1-c-codfw vlan on cr1-codfw and cr2-codfw T369274 [22:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:46] T369274: Move IP gateways for codfw row C/D vlans to EVPN Anycast GW - https://phabricator.wikimedia.org/T369274 [22:38:54] 10ops-eqiad, 06SRE, 06DC-Ops: 10gbit nic option for centrallog1002 - https://phabricator.wikimedia.org/T369825#9997128 (10VRiley-WMF) @Papaul Current location is Rack: B 1 U 36 Proposed Location Rack: B4 U 20 [22:41:59] 10ops-eqiad, 06SRE, 06DC-Ops: 10gbit nic option for centrallog1002 - https://phabricator.wikimedia.org/T369825#9997137 (10VRiley-WMF) Forgot to add the switch ports. Port: Current switch port 36 Proposed switchport 32 [22:43:14] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1179 crashed - hardware issues - https://phabricator.wikimedia.org/T369855#9997141 (10VRiley-WMF) After swapping the Mainboard, it was finally able to boot. It currently is having issues with RAM which I will continue to troubleshoot tomorrow. [22:45:37] FIRING: [6x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable