[00:06:38] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:08:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P75247 and previous config saved to /var/cache/conftool/dbconfig/20250418-000838-fceratto.json [00:10:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1137375 [00:10:46] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1137375 (owner: 10TrainBranchBot) [00:13:30] RECOVERY - Disk space on idp-test2005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=idp-test2005&var-datasource=codfw+prometheus/ops [00:17:40] 06SRE, 10Wikimedia-Mailing-lists, 10Catalyst (olin): Create a PatchDemo/Catalyst mailing list - https://phabricator.wikimedia.org/T388922#10754187 (10thcipriani) Thanks @Dzahn ! I can see the list https://lists.wikimedia.org/postorius/lists/patchdemo.lists.wikimedia.org/ but I don't see it under https://list... [00:22:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2097-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [00:22:54] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch2097 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [00:22:54] PROBLEM - OpenSearch health check for shards on 9400 on cirrussearch2097 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [00:23:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [00:23:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T391056)', diff saved to https://phabricator.wikimedia.org/P75248 and previous config saved to /var/cache/conftool/dbconfig/20250418-002344-fceratto.json [00:23:49] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [00:24:00] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2172.codfw.wmnet with reason: Maintenance [00:24:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2172 (T391056)', diff saved to https://phabricator.wikimedia.org/P75249 and previous config saved to /var/cache/conftool/dbconfig/20250418-002408-fceratto.json [00:24:27] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2097:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [00:26:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 21 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137349 (https://phabricator.wikimedia.org/T392239) (owner: 10Robertsky) [00:29:18] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1137375 (owner: 10TrainBranchBot) [00:30:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T391056)', diff saved to https://phabricator.wikimedia.org/P75250 and previous config saved to /var/cache/conftool/dbconfig/20250418-003016-fceratto.json [00:30:21] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [00:45:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P75251 and previous config saved to /var/cache/conftool/dbconfig/20250418-004524-fceratto.json [00:55:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [01:00:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P75252 and previous config saved to /var/cache/conftool/dbconfig/20250418-010030-fceratto.json [01:01:36] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/e9d907110525893c9ea5dd22b9bbafb2d7eb39057804478f32b27dd8a46ea1e3/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:03:39] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:15:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T391056)', diff saved to https://phabricator.wikimedia.org/P75253 and previous config saved to /var/cache/conftool/dbconfig/20250418-011536-fceratto.json [01:15:41] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [01:15:52] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2179.codfw.wmnet with reason: Maintenance [01:15:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2179 (T391056)', diff saved to https://phabricator.wikimedia.org/P75254 and previous config saved to /var/cache/conftool/dbconfig/20250418-011558-fceratto.json [01:21:36] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:22:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T391056)', diff saved to https://phabricator.wikimedia.org/P75255 and previous config saved to /var/cache/conftool/dbconfig/20250418-012207-fceratto.json [01:22:12] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [01:37:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P75256 and previous config saved to /var/cache/conftool/dbconfig/20250418-013714-fceratto.json [01:45:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [01:52:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P75257 and previous config saved to /var/cache/conftool/dbconfig/20250418-015221-fceratto.json [01:53:40] FIRING: [2x] ProbeDown: Service restbase1028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:05:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:07:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T391056)', diff saved to https://phabricator.wikimedia.org/P75258 and previous config saved to /var/cache/conftool/dbconfig/20250418-020728-fceratto.json [02:07:32] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [02:07:44] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2199.codfw.wmnet with reason: Maintenance [02:11:16] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2206.codfw.wmnet with reason: Maintenance [02:11:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2206 (T391056)', diff saved to https://phabricator.wikimedia.org/P75259 and previous config saved to /var/cache/conftool/dbconfig/20250418-021122-fceratto.json [02:11:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:16:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:16:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T391056)', diff saved to https://phabricator.wikimedia.org/P75260 and previous config saved to /var/cache/conftool/dbconfig/20250418-021655-fceratto.json [02:16:59] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [02:32:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P75261 and previous config saved to /var/cache/conftool/dbconfig/20250418-023202-fceratto.json [02:35:25] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:47:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P75262 and previous config saved to /var/cache/conftool/dbconfig/20250418-024709-fceratto.json [02:56:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [03:02:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T391056)', diff saved to https://phabricator.wikimedia.org/P75263 and previous config saved to /var/cache/conftool/dbconfig/20250418-030216-fceratto.json [03:02:21] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [03:02:32] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2210.codfw.wmnet with reason: Maintenance [03:02:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2210 (T391056)', diff saved to https://phabricator.wikimedia.org/P75264 and previous config saved to /var/cache/conftool/dbconfig/20250418-030239-fceratto.json [03:07:18] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:08:16] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:08:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T391056)', diff saved to https://phabricator.wikimedia.org/P75265 and previous config saved to /var/cache/conftool/dbconfig/20250418-030820-fceratto.json [03:08:25] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [03:15:25] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:23:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P75266 and previous config saved to /var/cache/conftool/dbconfig/20250418-032327-fceratto.json [03:38:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P75267 and previous config saved to /var/cache/conftool/dbconfig/20250418-033834-fceratto.json [03:41:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [03:46:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [03:51:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [03:53:40] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:53:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T391056)', diff saved to https://phabricator.wikimedia.org/P75268 and previous config saved to /var/cache/conftool/dbconfig/20250418-035342-fceratto.json [03:53:47] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [03:53:59] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2219.codfw.wmnet with reason: Maintenance [03:54:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2219 (T391056)', diff saved to https://phabricator.wikimedia.org/P75269 and previous config saved to /var/cache/conftool/dbconfig/20250418-035406-fceratto.json [04:00:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T391056)', diff saved to https://phabricator.wikimedia.org/P75270 and previous config saved to /var/cache/conftool/dbconfig/20250418-040001-fceratto.json [04:00:05] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [04:15:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P75271 and previous config saved to /var/cache/conftool/dbconfig/20250418-041508-fceratto.json [04:22:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2097-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [04:23:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [04:24:27] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2097:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [04:30:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P75272 and previous config saved to /var/cache/conftool/dbconfig/20250418-043015-fceratto.json [04:45:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T391056)', diff saved to https://phabricator.wikimedia.org/P75273 and previous config saved to /var/cache/conftool/dbconfig/20250418-044523-fceratto.json [04:45:27] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [04:45:39] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2236.codfw.wmnet with reason: Maintenance [04:45:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2236 (T391056)', diff saved to https://phabricator.wikimedia.org/P75274 and previous config saved to /var/cache/conftool/dbconfig/20250418-044545-fceratto.json [04:51:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T391056)', diff saved to https://phabricator.wikimedia.org/P75275 and previous config saved to /var/cache/conftool/dbconfig/20250418-045127-fceratto.json [04:51:32] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [05:03:39] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:06:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P75276 and previous config saved to /var/cache/conftool/dbconfig/20250418-050635-fceratto.json [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:21:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P75277 and previous config saved to /var/cache/conftool/dbconfig/20250418-052141-fceratto.json [05:30:18] (03CR) 10Arnaudb: "thanks for your feedback!" [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [05:36:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T391056)', diff saved to https://phabricator.wikimedia.org/P75278 and previous config saved to /var/cache/conftool/dbconfig/20250418-053648-fceratto.json [05:36:54] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [05:37:05] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2237.codfw.wmnet with reason: Maintenance [05:37:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2237 (T391056)', diff saved to https://phabricator.wikimedia.org/P75279 and previous config saved to /var/cache/conftool/dbconfig/20250418-053713-fceratto.json [05:43:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T391056)', diff saved to https://phabricator.wikimedia.org/P75280 and previous config saved to /var/cache/conftool/dbconfig/20250418-054309-fceratto.json [05:43:13] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [05:53:40] FIRING: [2x] ProbeDown: Service restbase1028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:55:48] PROBLEM - Disk space on analytics1072 is CRITICAL: DISK CRITICAL - free space: / 2124 MB (3% inode=95%): /tmp 2124 MB (3% inode=95%): /var/tmp 2124 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1072&var-datasource=eqiad+prometheus/ops [05:58:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P75281 and previous config saved to /var/cache/conftool/dbconfig/20250418-055816-fceratto.json [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250418T0600) [06:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:08:10] PROBLEM - Hadoop NodeManager on an-worker1158 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:13:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P75282 and previous config saved to /var/cache/conftool/dbconfig/20250418-061324-fceratto.json [06:14:54] PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:15:10] RECOVERY - Hadoop NodeManager on an-worker1158 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:19:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10754475 (10ayounsi) a:05ayounsi→03None [06:28:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T391056)', diff saved to https://phabricator.wikimedia.org/P75283 and previous config saved to /var/cache/conftool/dbconfig/20250418-062830-fceratto.json [06:28:35] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [06:28:46] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2239.codfw.wmnet with reason: Maintenance [06:31:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10754484 (10ayounsi) @robh, thanks for that task, well summed up ! One more point is that we need to account for expected overall growth of the number of servers to... [06:32:54] RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:35:25] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:35:48] PROBLEM - Disk space on analytics1072 is CRITICAL: DISK CRITICAL - free space: / 2121 MB (3% inode=95%): /tmp 2121 MB (3% inode=95%): /var/tmp 2121 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1072&var-datasource=eqiad+prometheus/ops [06:45:23] (03PS3) 10Federico Ceratto: values.yaml: Update deployment for zarcillo in aux-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137314 (https://phabricator.wikimedia.org/T384212) [06:45:45] (03CR) 10Federico Ceratto: [C:03+1] "Copied votes on follow-up patch sets have been updated:" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137314 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [06:45:50] (03PS4) 10Federico Ceratto: values.yaml: Update deployment for zarcillo in aux-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137314 (https://phabricator.wikimedia.org/T384212) [06:46:34] (03PS5) 10Federico Ceratto: values.yaml: Update deployment for zarcillo in aux-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137314 (https://phabricator.wikimedia.org/T384212) [06:46:57] (03CR) 10Federico Ceratto: "Rebased" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137314 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [06:52:13] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10754495 (10wiki_willy) Hey @ayounsi - after some feedback from my staff meeting earlier today, I reached out to Equinix to see if there's any way we'd be able to ad... [06:53:20] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:53:20] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:54:14] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:54:14] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:58:00] (03CR) 10Federico Ceratto: [C:03+2] values.yaml: Update deployment for zarcillo in aux-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137314 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [06:59:19] (03Merged) 10jenkins-bot: values.yaml: Update deployment for zarcillo in aux-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137314 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250418T0700) [07:02:18] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:02:18] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:03:14] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:03:16] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:15:25] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:22:07] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [07:22:08] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1184.eqiad.wmnet with OS bullseye [07:22:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10754508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1184.eqiad.wmnet with OS bulls... [07:27:13] 06SRE, 06Infrastructure-Foundations, 10netops: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769#10754513 (10ayounsi) @BBlack ping ? :) [07:30:17] (03CR) 10Ayounsi: [C:03+1] Netbox: simplify query [alerts] - 10https://gerrit.wikimedia.org/r/1136989 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [07:45:36] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1179.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [07:52:32] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1179.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [07:53:40] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:56:56] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1182.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [07:57:20] (03CR) 10Clément Goubert: [C:03+2] dumps enterprise copy update [puppet] - 10https://gerrit.wikimedia.org/r/1135522 (owner: 10Creynolds) [07:57:55] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1182.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [07:58:58] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1182.eqiad.wmnet with OS bullseye [07:59:06] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10754514 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1182.eqiad.wmnet with OS b... [08:14:36] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1182.eqiad.wmnet with reason: host reimage [08:18:07] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1182.eqiad.wmnet with reason: host reimage [08:22:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2097-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [08:23:28] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [08:23:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [08:23:57] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [08:24:27] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2097:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [08:30:06] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [08:30:11] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: Use LAG interface MAC address field to store LACP system-id for MC-LAG - https://phabricator.wikimedia.org/T392056#10754547 (10ayounsi) Some other thoughts : * There is also a netbox plugin, but that doesn't seen like a great way : https://github.com/pv... [08:35:04] (03PS1) 10Superpes15: [u4cwiki] Add signature button to edit toolbar in Case namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137428 (https://phabricator.wikimedia.org/T392286) [08:36:17] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [ms-fe1015] - vriley@cumin1002" [08:36:22] 06SRE, 06Infrastructure-Foundations, 10netops: Create alerting for saturation on sub-rated interfaces - https://phabricator.wikimedia.org/T374614#10754568 (10ayounsi) For the record, as mentioned on IRC, if the description parsing doesn't work, we can also set it in the UI. For example in https://librenms.wi... [08:36:23] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [ms-fe1015] - vriley@cumin1002" [08:36:23] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:37:01] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ms-fe1015 [08:37:11] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-fe1015 [08:39:48] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:40:40] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [08:43:46] vriley@cumin1002 reimage (PID 750383) is awaiting input [08:45:31] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [08:45:31] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1182.eqiad.wmnet with OS bullseye [08:45:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10754571 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1182.eqiad.wmnet with OS bulls... [08:46:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10754572 (10VRiley-WMF) [08:51:46] (03CR) 10Superpes15: [C:03+1] wikimaniawiki: update logo to 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137349 (https://phabricator.wikimedia.org/T392239) (owner: 10Robertsky) [08:53:40] FIRING: [4x] ProbeDown: Service restbase1028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:00:34] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:02:30] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ms-fe1015.eqiad.wmnet with OS bullseye [09:02:41] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10754592 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ms-fe1015.eqiad.wmnet with OS bullseye [09:03:39] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:10:25] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:38:19] !log vriley@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe1015'] [09:38:49] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-fe1015'] [09:39:03] !log vriley@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe1015'] [09:39:42] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-fe1015'] [09:40:40] 06SRE, 10ChangeProp, 06cloud-services-team, 06collaboration-services, and 10 others: Figure out a plan to move forward with regarding Redis License changes - https://phabricator.wikimedia.org/T360596#10754632 (10taavi) Untagging #Quarry which has an active subtask. [09:40:45] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10754636 (10VRiley-WMF) [09:43:55] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [09:47:59] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [ms-fe1016] - vriley@cumin1002" [09:48:05] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [ms-fe1016] - vriley@cumin1002" [09:48:05] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:49:15] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1016.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:50:12] FIRING: ProbeDown: Service aux-k8s-ctrl1002:6443 has failed probes (http_aux_k8s_eqiad_kube_apiserver_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#aux-k8s-ctrl1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:51:29] it's up, it just restarted [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:51:35] bet it's cert renewal agai [09:51:37] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1016.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:51:40] again [09:52:27] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ms-fe1016 [09:52:35] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-fe1016 [09:53:17] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1016.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:53:55] !incidents [09:53:56] 6044 (UNACKED) ProbeDown sre (2620:0:861:101:10:64:0:107 ip6 aux-k8s-ctrl1002:6443 probes/custom http_aux_k8s_eqiad_kube_apiserver_ip6 eqiad) [09:54:01] !ack 6044 [09:54:01] 6044 (ACKED) ProbeDown sre (2620:0:861:101:10:64:0:107 ip6 aux-k8s-ctrl1002:6443 probes/custom http_aux_k8s_eqiad_kube_apiserver_ip6 eqiad) [09:55:12] RESOLVED: ProbeDown: Service aux-k8s-ctrl1002:6443 has failed probes (http_aux_k8s_eqiad_kube_apiserver_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#aux-k8s-ctrl1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:56:04] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1016.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:57:01] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ms-fe1016.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:57:36] claime: o/ I saw it happening on other clusters as well, I'd try to bump the cpus to all the control plane vms [09:57:58] I'll open a task [09:58:03] <3 [09:58:11] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10754650 (10VRiley-WMF) [10:04:38] https://phabricator.wikimedia.org/T392289 [10:04:40] 06SRE, 07Kubernetes: Increase vcpus on K8s control plane VMs - https://phabricator.wikimedia.org/T392289 (10elukey) 03NEW [10:07:22] yeah aux-ctrl has nproc=1 [10:07:30] very easy to end up in pages [10:08:32] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe1016.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [10:10:26] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10754671 (10VRiley-WMF) [10:14:17] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10754678 (10VRiley-WMF) Hey @MatthewVernon I have been trying to install these. However, would you be able to check the preseed?I could be wrong, but I wasn't able to find that... [10:35:25] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:42:06] (03PS1) 10Filippo Giunchedi: pontoon: add acme-chief to o11y-filippo [puppet] - 10https://gerrit.wikimedia.org/r/1137436 [10:44:01] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: add acme-chief to o11y-filippo [puppet] - 10https://gerrit.wikimedia.org/r/1137436 (owner: 10Filippo Giunchedi) [10:59:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250418T0700) [11:00:05] jelto, arnoldokoth, and mutante: OwO what's this, a deployment window?? GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250418T1100). nyaa~ [11:07:49] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10754717 (10ayounsi) [11:09:48] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10754719 (10ayounsi) [11:17:30] !log cmooney@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 200132 [11:18:14] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 200132 [11:19:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:50:19] 10SRE-tools, 10homer, 06Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415#10754744 (10ayounsi) Indeed ! Very useful ! One risk is that the "all-and-skip" doesn't incite us to maintain a "no outstanding changes" policy. Otherwise maybe the "all-and-sk... [11:53:40] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:22:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2097-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [12:23:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [12:24:27] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2097:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:28:15] FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:30:10] Hmm [12:30:50] is now returning to normal rate [12:31:09] yeah but [12:31:10] looks similar to what happened at 10:15UTC [12:31:19] it's coinciding with the mediawiki_job_growthexperiments-updateMenteeData-s1.timer run [12:32:20] Active: activating (start) since Fri 2025-04-18 12:15:00 UTC; 16min ago [12:33:15] RESOLVED: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:37:08] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T391903#10754751 (10Jclark-ctr) IDRAC hardware inventory SerialNumber KN09N7919I0709R2U Slot 6 ` sdf KN09N7919I0709R2U ├─sdf1 └─sdf2 ` [12:46:19] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:46:21] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:48:17] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:48:19] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:48:42] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T391903#10754754 (10Jclark-ctr) @Eevans Removed a failed drive and inserted a replacement drive from a decommissioned server. It appears that md126 and md127 were automatically assembled from existing data on the SSD.... [12:52:43] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T391903#10754757 (10Jclark-ctr) updated IDRAC Firmware Version 4.40.00.00 <-> 7.00.00.181 [12:53:40] FIRING: [4x] ProbeDown: Service restbase1028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:57:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:58:39] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:00:29] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53800 bytes in 0.307 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:00:35] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:03:40] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:10:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:20:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10754764 (10Jclark-ctr) @fnegri. I installed several blanking panels on the hot aisle to help prevent hot air bleeding into cold aisle be... [13:22:02] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10754765 (10Jclark-ctr) Also updated idrac firmware from 5.00.20.00 to 7.00.00.181 [13:29:33] (03PS1) 10Giuseppe Lavagetto: growthexperiments: tempoarily disable listTaskCount [puppet] - 10https://gerrit.wikimedia.org/r/1137442 [13:30:25] FIRING: SystemdUnitFailed: mediawiki_job_growthexperiments-listTaskCounts.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:30:52] (03PS2) 10Giuseppe Lavagetto: growthexperiments: tempoarily disable listTaskCount [puppet] - 10https://gerrit.wikimedia.org/r/1137442 [13:31:10] (03CR) 10Clément Goubert: [C:03+1] growthexperiments: tempoarily disable listTaskCount [puppet] - 10https://gerrit.wikimedia.org/r/1137442 (owner: 10Giuseppe Lavagetto) [13:33:04] (03CR) 10Giuseppe Lavagetto: [C:03+2] growthexperiments: tempoarily disable listTaskCount [puppet] - 10https://gerrit.wikimedia.org/r/1137442 (owner: 10Giuseppe Lavagetto) [13:45:25] RESOLVED: SystemdUnitFailed: mediawiki_job_growthexperiments-listTaskCounts.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:57:29] (03PS1) 10C. Scott Ananian: Remove ParserMigration configuration that matches defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137443 [13:58:29] (03CR) 10C. Scott Ananian: [C:04-2] "C-2 until the dependent patch rides the train." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137443 (owner: 10C. Scott Ananian) [14:25:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission cloudelastic100[5-6] - https://phabricator.wikimedia.org/T380937#10754866 (10Jhancock.wm) this task was marked as complete but the servers still have a status of decommissioining instead of offline. @Jclark-ctr or @VRiley-WMF no rush... [14:30:10] 10ops-magru, 06DC-Ops, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10754869 (10tappof) @wiki_willy I've just finished updating the dashboard to include the information scraped from Magru's PDU.... [14:32:11] (03PS1) 10Clément Goubert: Revert "growthexperiments: tempoarily disable listTaskCount" [puppet] - 10https://gerrit.wikimedia.org/r/1137449 [14:34:19] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:35:15] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:35:25] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:38:21] (03PS15) 10Tiziano Fogli: prometheus/alerts: define alert rules directly in puppet [puppet] - 10https://gerrit.wikimedia.org/r/1101066 (https://phabricator.wikimedia.org/T381665) [14:40:14] <_joe_> !log enabled slow query log on db1218, investigating T390510 [14:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:18] T390510: Fatal DBUnexpectedError: "Database servers in extension1 are overloaded" - https://phabricator.wikimedia.org/T390510 [14:45:17] (03CR) 10Clément Goubert: [C:03+2] Revert "growthexperiments: tempoarily disable listTaskCount" [puppet] - 10https://gerrit.wikimedia.org/r/1137449 (owner: 10Clément Goubert) [14:45:44] !log rebooting aqs1015.eqiad.wmnet (drive detection/ordering) — T391903 [14:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:49] T391903: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T391903 [14:45:57] !log eevans@cumin1002 START - Cookbook sre.hosts.reboot-single for host aqs1015.eqiad.wmnet [14:46:19] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T391903#10754920 (10ops-monitoring-bot) Host rebooted by eevans@cumin1002 with reason: None [14:49:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10754936 (10RobH) a:05Jclark-ctr→03wiki_willy Assigning from John to Willy pending EQ update on pricing and build out of an additional rack in the new expansion. [14:50:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10754938 (10RobH) 05Open→03Invalid Please note that after further deliberation and investigation, it has been determined that migration out o... [14:50:59] 10ops-eqiad, 06SRE, 06DC-Ops: relocate sretest1002 out of D6 - https://phabricator.wikimedia.org/T391602#10754941 (10RobH) 05Open→03Invalid Please note that after further deliberation and investigation, it has been determined that migration out of D6 is non-ideal. Full details can be read on T390240... [14:51:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: relocate (3) service-ops hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391599#10754946 (10RobH) 05Stalled→03Invalid Please note that after further deliberation and investigation, it has been determined that migration out of D6 is non-ideal.... [14:51:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): relocate (1) discovery-search elastic1067 out of eqiad D6 - https://phabricator.wikimedia.org/T391542#10754951 (10RobH) 05Stalled→03Invalid Please note that after further deliberation and investigation, it has been determi... [14:51:26] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: relocate (10) data-persistence hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391540#10754956 (10RobH) 05Stalled→03Invalid Please note that after further deliberation and investigation, it has been determined that migration out of D6 is... [14:51:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): relocate (4) data-platform-sre hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391539#10754961 (10RobH) 05Stalled→03Invalid Please note that after further deliberation and investigation, it has been determined t... [14:51:37] 10ops-eqiad, 06SRE, 06DC-Ops: example sub-task for relocation out of D6 - https://phabricator.wikimedia.org/T390243#10754966 (10RobH) 05Open→03Invalid Please note that after further deliberation and investigation, it has been determined that migration out of D6 is non-ideal. Full details can be read... [14:51:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: second frack parent tracking task - https://phabricator.wikimedia.org/T392006#10754971 (10RobH) [14:52:01] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10754972 (10RobH) [14:52:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: second frack parent tracking task - https://phabricator.wikimedia.org/T392006#10754975 (10RobH) >>! In T392006#10745097, @RobH wrote: > Please note I've tied original task T390240 to this for ease of tracking. If rack D6 is not sel... [14:53:25] FIRING: SystemdUnitFailed: mediawiki_job_growthexperiments-listTaskCounts.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:53:28] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs1015.eqiad.wmnet [14:53:40] FIRING: [8x] ProbeDown: Service aqs1015-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:54:48] SystemdUnitFailed expected, just did a reset-failed, it'll run next iteration [14:58:25] RESOLVED: SystemdUnitFailed: mediawiki_job_growthexperiments-listTaskCounts.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:58:49] FIRING: [8x] ProbeDown: Service aqs1015-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:01:29] (03CR) 10C. Scott Ananian: "My recollection was that you were doing something slighly "evil" w/r/t overriding [[...]] link syntax, which Parsoid doesn't support. ..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137330 (owner: 10Jforrester) [15:02:22] (03CR) 10Jforrester: [wikifunctionswiki] Enable Parsoid in wikitext articles (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137330 (owner: 10Jforrester) [15:04:35] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T391903#10755005 (10Eevans) Thanks @Jclark-ctr We're still down one drive though I'm afraid: scsi@7:0.0.0 (physical ID 2 on the second controller). I've rebooted the host via the cookbook (and there was indeed some d... [15:05:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2045.codfw.wmnet with OS bookworm [15:05:26] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10755008 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:15:06] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10755047 (10Jhancock.wm) @MoritzMuehlenhoff could you check the preseed file is correct for me? I'm getting an error on the partitioning section of the installer. I... [15:27:01] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10755098 (10Jhancock.wm) [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:53:40] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:58:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [16:01:37] PROBLEM - Juniper alarms on cr2-codfw is CRITICAL: JNX_ALARMS CRITICAL - 2 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [16:02:35] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [16:03:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [16:14:27] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1137463 [16:14:33] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1137464 [16:21:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [16:22:32] 06SRE-OnFire, 10Incident Tooling, 13Patch-For-Review: Incident documents are less visible with Corto - https://phabricator.wikimedia.org/T390126#10755216 (10Eevans) 05Open→03Resolved a:03Eevans >>! In T390126#10753805, @jhathaway wrote: > @Eevans based on some help from ITS I was able to get the ro... [16:22:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2097-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:22:57] (03CR) 10BCornwall: "It looks like pywikipedia.org is another externally-managed domain but it directly points to the load balancers. So it makes sense to remo" [puppet] - 10https://gerrit.wikimedia.org/r/1137463 (owner: 10Ncmonitor) [16:23:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [16:24:27] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2097:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [16:26:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [16:31:08] jhancock@cumin2002 reimage (PID 3142572) is awaiting input [16:40:11] (03CR) 10Pppery: "How is this different from the existing domain `pywikibot.org`, which is on the ignore list, other than the fact that the external DNS is " [puppet] - 10https://gerrit.wikimedia.org/r/1137463 (owner: 10Ncmonitor) [16:41:09] (03CR) 10Pppery: "Also, what changed since the previous ncmonitor run, which doesn't appear to have tried to remove this?" [puppet] - 10https://gerrit.wikimedia.org/r/1137463 (owner: 10Ncmonitor) [16:41:52] (03CR) 10Pppery: "See also T388809" [puppet] - 10https://gerrit.wikimedia.org/r/1137463 (owner: 10Ncmonitor) [16:41:53] 06SRE, 10Pywikibot, 13Patch-For-Review: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809#10755232 (10Pppery) [16:43:06] (03CR) 10BCornwall: "The difference in configuration is the difference. 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1137463 (owner: 10Ncmonitor) [16:44:37] RECOVERY - Juniper alarms on cr2-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [16:45:18] 06SRE, 10Pywikibot, 13Patch-For-Review: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809#10755234 (10BCornwall) It appears that pywikipedia.org isn't actually pointing to ncredir but instead the load balancers. I missed that fact and just assumed it w... [16:45:44] (03CR) 10BCornwall: "*funnels, not filters" [puppet] - 10https://gerrit.wikimedia.org/r/1137463 (owner: 10Ncmonitor) [16:46:02] (03CR) 10Pppery: "It seems to me that pywikipedia.org should have been handled in the same way as pywikibot.org was in T257536, and should still be handled " [puppet] - 10https://gerrit.wikimedia.org/r/1137463 (owner: 10Ncmonitor) [16:49:10] (03CR) 10BCornwall: "I've no real opinion past "it's been broken for a long time without much hubbub so realistically it isn't used. I see that pywikibot used " [puppet] - 10https://gerrit.wikimedia.org/r/1137463 (owner: 10Ncmonitor) [16:49:42] (03CR) 10Pppery: "To clarify my position, I don't think you should let this silently bitrot. If the Pywikibot team wants to abandon this domain, then so be " [puppet] - 10https://gerrit.wikimedia.org/r/1137463 (owner: 10Ncmonitor) [16:51:11] 06SRE, 10Pywikibot, 13Patch-For-Review: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809#10755247 (10BCornwall) 05Open→03In progress p:05Triage→03Low [16:51:26] 06SRE, 10Pywikibot, 13Patch-For-Review: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809#10755249 (10BCornwall) I'll get in contact with the pywikibot team to see where they want to go with this [16:52:35] RESOLVED: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [16:55:01] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10755252 (10Jhancock.wm) [17:03:40] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:10:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:35:41] 10ops-magru, 06DC-Ops, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10755312 (10wiki_willy) Hi @tappof - great job and thank you so much for working on this! It looks like I'm able to see all th... [17:56:04] (03CR) 10Dzahn: "ACK, sounds good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [17:57:57] (03CR) 10Dzahn: "it still sets the replica to 2003 though, in this change" [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [17:59:45] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10755349 (10phaultfinder) [18:02:21] (03CR) 10Dzahn: "Is there a story here why/how this domain was removed from MarkMonitor? I agree with Pppery here. Let's call the patch what it is, "remove" [puppet] - 10https://gerrit.wikimedia.org/r/1137463 (owner: 10Ncmonitor) [18:13:02] (03CR) 10Dreamy Jazz: Remove wgCheckUserCentralIndexRangesToExclude definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134203 (https://phabricator.wikimedia.org/T389055) (owner: 10Dreamy Jazz) [18:13:45] (03CR) 10Dreamy Jazz: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134203 (https://phabricator.wikimedia.org/T389055) (owner: 10Dreamy Jazz) [18:24:18] 06SRE, 10Wikimedia-Mailing-lists, 10Catalyst (olin): Create a PatchDemo/Catalyst mailing list - https://phabricator.wikimedia.org/T388922#10755403 (10Dzahn) @thcipriani I noticed your work email has 2 variants, "tcipriani" and "thcipriani". I used the second one, matching your user name here. But I guess the... [18:26:31] 06SRE, 10Wikimedia-Mailing-lists, 10Catalyst (olin): Create a PatchDemo/Catalyst mailing list - https://phabricator.wikimedia.org/T388922#10755405 (10Dzahn) I removed the mailing list and then created it again, this time using the email address without the "h" in it. Let's see now? [18:29:41] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10755421 (10phaultfinder) [18:35:25] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:35:45] 06SRE, 10Wikimedia-Mailing-lists, 10Catalyst (olin): Create a PatchDemo/Catalyst mailing list - https://phabricator.wikimedia.org/T388922#10755425 (10thcipriani) 05Open→03Resolved a:05thcipriani→03Dzahn That did it! The `thcipriani` one is a variant since that's my nick everywhere (and has been... [18:47:21] 06SRE-OnFire, 10Cassandra, 06MediaWiki-Platform-Team, 10Sustainability (Incident Followup): sessionstore workload observability - https://phabricator.wikimedia.org/T392182#10755429 (10Eevans) [18:49:04] 06SRE-OnFire, 10Cassandra, 06MediaWiki-Platform-Team, 10Sustainability (Incident Followup): sessionstore workload observability - https://phabricator.wikimedia.org/T392182#10755430 (10Eevans) [18:54:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10755435 (10phaultfinder) [18:58:49] FIRING: [4x] ProbeDown: Service restbase1028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:39:30] (03PS1) 10BCornwall: Add pywikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/1137477 [19:40:16] (03PS2) 10BCornwall: Add pywikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/1137477 (https://phabricator.wikimedia.org/T318804) [19:42:22] (03PS3) 10BCornwall: Add pywikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/1137477 (https://phabricator.wikimedia.org/T388809) [19:43:19] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:43:21] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:45:15] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:45:17] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:53:40] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:22:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2097-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [20:23:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [20:24:27] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2097:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:29:43] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10755580 (10phaultfinder) [20:47:16] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: NDA request coverage for KFrancis's PTO - https://phabricator.wikimedia.org/T391032#10755650 (10Dzahn) 05In progress→03Resolved a:03Dzahn Since there are no open access requests and Katie will back on Monday I am closing this now. [20:49:20] (03CR) 10Dzahn: [C:03+1] Add pywikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/1137477 (https://phabricator.wikimedia.org/T388809) (owner: 10BCornwall) [20:50:19] (03CR) 10Dzahn: "I see https://gerrit.wikimedia.org/r/c/operations/dns/+/1137477 and https://phabricator.wikimedia.org/T388809 now. +1 over there." [puppet] - 10https://gerrit.wikimedia.org/r/1137463 (owner: 10Ncmonitor) [21:03:40] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:04:49] (03PS2) 10BryanDavis: dblists: Add sul.dbexpr and generated sul.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137087 (https://phabricator.wikimedia.org/T392142) [21:04:49] (03PS1) 10BryanDavis: Use `sul` dblist in InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 [21:05:41] (03CR) 10CI reject: [V:04-1] Use `sul` dblist in InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis) [21:06:30] (03CR) 10BryanDavis: "I203ebb8f433ed0af496cb0ddc9c13e438dc109cc now proposes using the new sul.dblist in InitializeSettings.php." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137087 (https://phabricator.wikimedia.org/T392142) (owner: 10BryanDavis) [21:07:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:09:11] (03CR) 10BCornwall: [C:03+2] Add pywikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/1137477 (https://phabricator.wikimedia.org/T388809) (owner: 10BCornwall) [21:09:51] !log brett@dns1005 START - running authdns-update [21:10:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:11:49] !log brett@dns1005 END - running authdns-update [21:12:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:15:55] (03PS1) 10BCornwall: acmechief: Add pywikipedia.org to the cert list [puppet] - 10https://gerrit.wikimedia.org/r/1137481 (https://phabricator.wikimedia.org/T257536) [21:16:53] (03PS1) 10BCornwall: ncmonitor: Add pywikipedia.org to ignored domains [puppet] - 10https://gerrit.wikimedia.org/r/1137482 (https://phabricator.wikimedia.org/T257536) [21:19:08] (03PS2) 10BryanDavis: Use `sul` dblist in InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 [21:19:20] (03PS2) 10BCornwall: ncmonitor: Add pywikipedia.org to ignored domains [puppet] - 10https://gerrit.wikimedia.org/r/1137482 (https://phabricator.wikimedia.org/T388809) [21:19:24] (03PS2) 10BCornwall: acmechief: Add pywikipedia.org to the cert list [puppet] - 10https://gerrit.wikimedia.org/r/1137481 (https://phabricator.wikimedia.org/T388809) [21:21:54] (03CR) 10Pppery: [C:03+1] ncmonitor: Add pywikipedia.org to ignored domains [puppet] - 10https://gerrit.wikimedia.org/r/1137482 (https://phabricator.wikimedia.org/T388809) (owner: 10BCornwall) [21:26:03] (03CR) 10BCornwall: [C:03+2] ncmonitor: Add pywikipedia.org to ignored domains [puppet] - 10https://gerrit.wikimedia.org/r/1137482 (https://phabricator.wikimedia.org/T388809) (owner: 10BCornwall) [21:26:44] (03PS3) 10BryanDavis: Use `sul` dblist in InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 [21:31:17] (03PS11) 10Andrew Bogott: dynamicproxy: Provision AAAA records [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [21:31:17] (03PS1) 10Andrew Bogott: invisible-unicorn: Delete dns entries before removing proxy records [puppet] - 10https://gerrit.wikimedia.org/r/1137483 (https://phabricator.wikimedia.org/T391718) [21:31:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:36:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:42:32] (03PS1) 10Dzahn: miscweb: remove static-rt profile from legacy miscweb role [puppet] - 10https://gerrit.wikimedia.org/r/1137484 [21:43:17] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:44:17] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:44:31] (03PS1) 10Dzahn: cache/text: remove commented reference to static-rt from hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1137485 [21:45:22] (03PS2) 10Dzahn: miscweb: remove static-rt profile from legacy miscweb role [puppet] - 10https://gerrit.wikimedia.org/r/1137484 [21:47:32] (03PS1) 10Dzahn: microsites/backup: remove rt-static backup fileset [puppet] - 10https://gerrit.wikimedia.org/r/1137486 [21:48:32] (03CR) 10Dzahn: [C:03+1] acmechief: Add pywikipedia.org to the cert list [puppet] - 10https://gerrit.wikimedia.org/r/1137481 (https://phabricator.wikimedia.org/T388809) (owner: 10BCornwall) [21:48:33] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [21:49:07] (03CR) 10Dzahn: [C:03+1] ncmonitor: Add pywikipedia.org to ignored domains [puppet] - 10https://gerrit.wikimedia.org/r/1137482 (https://phabricator.wikimedia.org/T388809) (owner: 10BCornwall) [21:51:25] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:52:48] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1179 [21:53:44] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1179 [21:54:31] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1179.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:56:34] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1179.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:33:40] FIRING: [6x] ProbeDown: Service restbase1028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:35:25] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:55:49] PROBLEM - Disk space on analytics1072 is CRITICAL: DISK CRITICAL - free space: / 2125 MB (3% inode=95%): /tmp 2125 MB (3% inode=95%): /var/tmp 2125 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1072&var-datasource=eqiad+prometheus/ops [23:04:00] (03PS3) 10Dzahn: phabricator::migration: add scap::target, remove scap bin symlink [puppet] - 10https://gerrit.wikimedia.org/r/1135841 (https://phabricator.wikimedia.org/T377889) [23:06:13] (03CR) 10Dzahn: "different error now, but https://puppet-compiler.wmflabs.org/output/1135841/5319/phab1005.eqiad.wmnet/change.phab1005.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/1135841 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [23:09:41] (03PS4) 10Dzahn: phabricator::migration: add scap::target, add deploy scripts, rm symlink [puppet] - 10https://gerrit.wikimedia.org/r/1135841 (https://phabricator.wikimedia.org/T377889) [23:13:08] (03CR) 10Dzahn: "this looks better now https://puppet-compiler.wmflabs.org/output/1135841/5320/phab1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1135841 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [23:14:19] (03PS5) 10Dzahn: phabricator::migration: add scap::target, add deploy scripts, rm symlink [puppet] - 10https://gerrit.wikimedia.org/r/1135841 (https://phabricator.wikimedia.org/T377889) [23:15:17] (03CR) 10Dzahn: "This is cherry-picking parts of existing phabricator puppet code. Not reinventing the wheel. Just trying to get scap working without the r" [puppet] - 10https://gerrit.wikimedia.org/r/1135841 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [23:16:27] (03CR) 10CI reject: [V:04-1] phabricator::migration: add scap::target, add deploy scripts, rm symlink [puppet] - 10https://gerrit.wikimedia.org/r/1135841 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [23:16:50] (03CR) 10Dzahn: [C:03+1] "The interesting part is this change:" [puppet] - 10https://gerrit.wikimedia.org/r/1135841 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [23:18:15] (03PS6) 10Dzahn: phabricator::migration: add scap::target, add deploy scripts, rm symlink [puppet] - 10https://gerrit.wikimedia.org/r/1135841 (https://phabricator.wikimedia.org/T377889) [23:30:14] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1179.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:38:21] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:39:17] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:41:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1137494 [23:41:19] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1137494 (owner: 10TrainBranchBot) [23:43:46] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1179.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:52:03] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1137494 (owner: 10TrainBranchBot) [23:53:40] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:58:01] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1179.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:59:21] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1179.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED