[00:01:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:03:29] PROBLEM - SSH on puppetserver1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:03:40] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2121.codfw.wmnet with reason: Maintenance [00:03:49] PROBLEM - SSH on puppetserver1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:03:53] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2121.codfw.wmnet with reason: Maintenance [00:04:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2121 (T371742)', diff saved to https://phabricator.wikimedia.org/P68387 and previous config saved to /var/cache/conftool/dbconfig/20240831-000400-ladsgroup.json [00:04:05] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [00:04:11] (03PS2) 10Bartosz Dziewoński: Remove labs settings for $wmgExtraLogFile that have no effect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069321 [00:04:12] (03PS1) 10Bartosz Dziewoński: logging: Replace 'blackhole' handler with no handlers at all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069344 [00:05:06] (03CR) 10Bartosz Dziewoński: "Again, please check that this is really correct." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069344 (owner: 10Bartosz Dziewoński) [00:06:10] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:15] (03CR) 10Jdlrobson: [C:03+1] "Steph, you can +2 this one I believe anytime with your deployer rights (even outside a window) as it is beta only. Please check with anoth" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069334 (https://phabricator.wikimedia.org/T372757) (owner: 10Stoyofuku-wmf) [00:06:39] RECOVERY - SSH on puppetserver1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:06:45] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1069326 (owner: 10TrainBranchBot) [00:07:58] FIRING: JobUnavailable: Reduced availability for job jmx_puppetserver in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:11:28] RESOLVED: JobUnavailable: Reduced availability for job jmx_puppetserver in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:13:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P68388 and previous config saved to /var/cache/conftool/dbconfig/20240831-001312-ladsgroup.json [00:18:30] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:21:30] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 4.045s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:23:06] FIRING: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:23:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:25:53] RESOLVED: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:26:30] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 3.368s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:28:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T370903)', diff saved to https://phabricator.wikimedia.org/P68389 and previous config saved to /var/cache/conftool/dbconfig/20240831-002819-ladsgroup.json [00:28:21] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1235.eqiad.wmnet with reason: Maintenance [00:28:25] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [00:28:35] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1235.eqiad.wmnet with reason: Maintenance [00:28:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T370903)', diff saved to https://phabricator.wikimedia.org/P68390 and previous config saved to /var/cache/conftool/dbconfig/20240831-002842-ladsgroup.json [00:32:25] RECOVERY - SSH on puppetserver1003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:47:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T370903)', diff saved to https://phabricator.wikimedia.org/P68391 and previous config saved to /var/cache/conftool/dbconfig/20240831-004732-ladsgroup.json [00:47:37] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [00:49:42] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [00:56:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:58:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:01:10] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:02:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P68392 and previous config saved to /var/cache/conftool/dbconfig/20240831-010239-ladsgroup.json [01:05:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T371742)', diff saved to https://phabricator.wikimedia.org/P68393 and previous config saved to /var/cache/conftool/dbconfig/20240831-010548-ladsgroup.json [01:05:53] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [01:06:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:17:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P68394 and previous config saved to /var/cache/conftool/dbconfig/20240831-011746-ladsgroup.json [01:20:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P68395 and previous config saved to /var/cache/conftool/dbconfig/20240831-012055-ladsgroup.json [01:32:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T370903)', diff saved to https://phabricator.wikimedia.org/P68396 and previous config saved to /var/cache/conftool/dbconfig/20240831-013254-ladsgroup.json [01:32:56] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1239.eqiad.wmnet with reason: Maintenance [01:32:59] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [01:33:09] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1239.eqiad.wmnet with reason: Maintenance [01:33:40] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:36:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P68397 and previous config saved to /var/cache/conftool/dbconfig/20240831-013603-ladsgroup.json [01:37:19] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:40:24] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:51:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T371742)', diff saved to https://phabricator.wikimedia.org/P68398 and previous config saved to /var/cache/conftool/dbconfig/20240831-015110-ladsgroup.json [01:51:12] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2122.codfw.wmnet with reason: Maintenance [01:51:15] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [01:51:25] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2122.codfw.wmnet with reason: Maintenance [01:51:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2122 (T371742)', diff saved to https://phabricator.wikimedia.org/P68399 and previous config saved to /var/cache/conftool/dbconfig/20240831-015132-ladsgroup.json [01:51:33] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1240.eqiad.wmnet with reason: Maintenance [01:51:46] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1240.eqiad.wmnet with reason: Maintenance [01:58:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:03:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:06:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:15:28] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [02:15:31] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [02:18:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:19:10] (03CR) 10Krinkle: Replace confusing uses of $wgDebugLogFile with $wmgExtraLogFile (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069320 (owner: 10Bartosz Dziewoński) [02:28:02] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2116.codfw.wmnet with reason: Maintenance [02:28:15] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2116.codfw.wmnet with reason: Maintenance [02:28:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2116 (T370903)', diff saved to https://phabricator.wikimedia.org/P68400 and previous config saved to /var/cache/conftool/dbconfig/20240831-022822-ladsgroup.json [02:28:27] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [02:33:40] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:33:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:36:28] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:19] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:38:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:43:42] 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T373727 (10phaultfinder) 03NEW [02:47:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T370903)', diff saved to https://phabricator.wikimedia.org/P68401 and previous config saved to /var/cache/conftool/dbconfig/20240831-024727-ladsgroup.json [02:47:32] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [02:48:54] (03PS1) 10Dzahn: network: add a list of friendly networks [puppet] - 10https://gerrit.wikimedia.org/r/1069387 (https://phabricator.wikimedia.org/T365259) [02:49:10] (03PS2) 10Dzahn: network: introduce a list of friendly networks [puppet] - 10https://gerrit.wikimedia.org/r/1069387 (https://phabricator.wikimedia.org/T365259) [02:52:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T371742)', diff saved to https://phabricator.wikimedia.org/P68402 and previous config saved to /var/cache/conftool/dbconfig/20240831-025248-ladsgroup.json [02:52:53] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [02:54:00] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [02:58:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:59:00] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [03:01:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:01:28] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:02:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P68403 and previous config saved to /var/cache/conftool/dbconfig/20240831-030234-ladsgroup.json [03:03:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:06:10] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:07:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P68404 and previous config saved to /var/cache/conftool/dbconfig/20240831-030755-ladsgroup.json [03:17:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P68405 and previous config saved to /var/cache/conftool/dbconfig/20240831-031741-ladsgroup.json [03:19:23] RECOVERY - Host an-worker1165 is UP: PING WARNING - Packet loss = 77%, RTA = 0.17 ms [03:19:37] PROBLEM - SSH on an-worker1165 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:23:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P68406 and previous config saved to /var/cache/conftool/dbconfig/20240831-032302-ladsgroup.json [03:25:47] PROBLEM - Host an-worker1165 is DOWN: PING CRITICAL - Packet loss = 100% [03:32:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T370903)', diff saved to https://phabricator.wikimedia.org/P68407 and previous config saved to /var/cache/conftool/dbconfig/20240831-033248-ladsgroup.json [03:32:50] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2130.codfw.wmnet with reason: Maintenance [03:32:53] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [03:33:03] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2130.codfw.wmnet with reason: Maintenance [03:33:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2130 (T370903)', diff saved to https://phabricator.wikimedia.org/P68408 and previous config saved to /var/cache/conftool/dbconfig/20240831-033310-ladsgroup.json [03:38:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T371742)', diff saved to https://phabricator.wikimedia.org/P68409 and previous config saved to /var/cache/conftool/dbconfig/20240831-033809-ladsgroup.json [03:38:12] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2150.codfw.wmnet with reason: Maintenance [03:38:14] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [03:38:25] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2150.codfw.wmnet with reason: Maintenance [03:38:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T371742)', diff saved to https://phabricator.wikimedia.org/P68410 and previous config saved to /var/cache/conftool/dbconfig/20240831-033831-ladsgroup.json [03:41:02] 10ops-codfw, 06SRE, 06DC-Ops: ganeti2009 - ManagementSSHDown - https://phabricator.wikimedia.org/T373727#10107978 (10Dzahn) [03:42:33] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: ganeti2009 - ManagementSSHDown - https://phabricator.wikimedia.org/T373727#10107979 (10Dzahn) [03:42:54] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: ganeti2009 - ManagementSSHDown - https://phabricator.wikimedia.org/T373727#10107980 (10Dzahn) p:05Triage→03Low [03:45:53] (03CR) 10Dzahn: [C:03+1] "checked LDAP user, uid/email. has group approval on ticket, lgtm. just might need manager approval and then good to go." [puppet] - 10https://gerrit.wikimedia.org/r/1069175 (https://phabricator.wikimedia.org/T373666) (owner: 10Ssingh) [03:47:35] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for zoe - https://phabricator.wikimedia.org/T373666#10107981 (10Dzahn) 05Open→03In progress [03:51:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T370903)', diff saved to https://phabricator.wikimedia.org/P68411 and previous config saved to /var/cache/conftool/dbconfig/20240831-035100-ladsgroup.json [03:51:01] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for zoe - https://phabricator.wikimedia.org/T373666#10107986 (10Dzahn) a:03VPuffetMichel [03:51:03] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for zoe - https://phabricator.wikimedia.org/T373666#10107987 (10Dzahn) p:05Triage→03Medium [03:51:07] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [03:52:45] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for zoe - https://phabricator.wikimedia.org/T373666#10107985 (10Dzahn) Hi @VPuffetMichel , looks like this is just pending your approval and then it's good to go. Cheers. [03:57:55] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:58:23] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:59:15] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52482 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:59:45] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:06:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P68412 and previous config saved to /var/cache/conftool/dbconfig/20240831-040607-ladsgroup.json [04:10:41] PROBLEM - SSH on puppetserver1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:21:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P68413 and previous config saved to /var/cache/conftool/dbconfig/20240831-042114-ladsgroup.json [04:27:00] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [04:27:47] RECOVERY - SSH on puppetserver1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:31:51] PROBLEM - SSH on puppetserver1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:32:45] RECOVERY - SSH on puppetserver1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:35:51] PROBLEM - SSH on puppetserver1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:36:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T370903)', diff saved to https://phabricator.wikimedia.org/P68414 and previous config saved to /var/cache/conftool/dbconfig/20240831-043621-ladsgroup.json [04:36:23] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2141.codfw.wmnet with reason: Maintenance [04:36:26] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [04:36:36] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2141.codfw.wmnet with reason: Maintenance [04:39:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T371742)', diff saved to https://phabricator.wikimedia.org/P68415 and previous config saved to /var/cache/conftool/dbconfig/20240831-043954-ladsgroup.json [04:39:59] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [04:42:00] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [04:49:42] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [04:54:15] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2145.codfw.wmnet with reason: Maintenance [04:54:28] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2145.codfw.wmnet with reason: Maintenance [04:54:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T370903)', diff saved to https://phabricator.wikimedia.org/P68416 and previous config saved to /var/cache/conftool/dbconfig/20240831-045435-ladsgroup.json [04:54:40] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [04:55:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P68417 and previous config saved to /var/cache/conftool/dbconfig/20240831-045501-ladsgroup.json [05:01:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:03:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:06:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:10:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P68418 and previous config saved to /var/cache/conftool/dbconfig/20240831-051009-ladsgroup.json [05:11:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T370903)', diff saved to https://phabricator.wikimedia.org/P68419 and previous config saved to /var/cache/conftool/dbconfig/20240831-051157-ladsgroup.json [05:12:02] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [05:16:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:16:47] PROBLEM - Disk space on mw1445 is CRITICAL: DISK CRITICAL - free space: / 753 MB (0% inode=99%): /tmp 753 MB (0% inode=99%): /var/tmp 753 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1445&var-datasource=eqiad+prometheus/ops [05:16:55] PROBLEM - OpenSearch health check for shards on 9200 on logstash2036 is CRITICAL: CRITICAL - elasticsearch inactive shards 561 threshold =0.34 breach: cluster_name: production-elk7-codfw, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 408, active_shards: 1022, relocating_shards: 1, initializing_shards: 3, unassigned_shards: 558 [05:16:55] d_unassigned_shards: 0, number_of_pending_tasks: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 66327, active_shards_percent_as_number: 64.56096020214783 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:16:55] PROBLEM - OpenSearch health check for shards on 9200 on logging-hd2003 is CRITICAL: CRITICAL - elasticsearch inactive shards 561 threshold =0.34 breach: cluster_name: production-elk7-codfw, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 408, active_shards: 1022, relocating_shards: 1, initializing_shards: 3, unassigned_shards: 5 [05:16:55] yed_unassigned_shards: 0, number_of_pending_tasks: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 66335, active_shards_percent_as_number: 64.56096020214783 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:16:57] PROBLEM - OpenSearch health check for shards on 9200 on logstash2037 is CRITICAL: CRITICAL - elasticsearch inactive shards 561 threshold =0.34 breach: cluster_name: production-elk7-codfw, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 408, active_shards: 1022, relocating_shards: 1, initializing_shards: 20, unassigned_shards: 54 [05:16:57] ed_unassigned_shards: 0, number_of_pending_tasks: 18, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 68644, active_shards_percent_as_number: 64.56096020214783 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:16:57] PROBLEM - OpenSearch health check for shards on 9200 on logstash2033 is CRITICAL: CRITICAL - elasticsearch inactive shards 561 threshold =0.34 breach: cluster_name: production-elk7-codfw, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 408, active_shards: 1022, relocating_shards: 1, initializing_shards: 20, unassigned_shards: 54 [05:16:57] ed_unassigned_shards: 0, number_of_pending_tasks: 18, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 68706, active_shards_percent_as_number: 64.56096020214783 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:16:57] PROBLEM - OpenSearch health check for shards on 9200 on logstash2034 is CRITICAL: CRITICAL - elasticsearch inactive shards 561 threshold =0.34 breach: cluster_name: production-elk7-codfw, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 408, active_shards: 1022, relocating_shards: 1, initializing_shards: 20, unassigned_shards: 54 [05:16:58] ed_unassigned_shards: 0, number_of_pending_tasks: 20, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 68900, active_shards_percent_as_number: 64.56096020214783 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:16:58] PROBLEM - OpenSearch health check for shards on 9200 on logstash2035 is CRITICAL: CRITICAL - elasticsearch inactive shards 561 threshold =0.34 breach: cluster_name: production-elk7-codfw, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 408, active_shards: 1022, relocating_shards: 1, initializing_shards: 20, unassigned_shards: 54 [05:16:59] ed_unassigned_shards: 0, number_of_pending_tasks: 20, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 68900, active_shards_percent_as_number: 64.56096020214783 https://wikitech.wikimedia.org/wiki/Search%23Administration [05:16:59] PROBLEM - OpenSearch health check for shards on 9200 on logging-hd2001 is CRITICAL: CRITICAL - elasticsearch inactive shards 554 threshold =0.34 breach: cluster_name: production-elk7-codfw, status: red, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 414, active_shards: 1029, relocating_shards: 1, initializing_shards: 13, unassigned_shards: [05:25:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T371742)', diff saved to https://phabricator.wikimedia.org/P68420 and previous config saved to /var/cache/conftool/dbconfig/20240831-052516-ladsgroup.json [05:25:19] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2159.codfw.wmnet with reason: Maintenance [05:25:22] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [05:25:32] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2159.codfw.wmnet with reason: Maintenance [05:25:34] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [05:25:36] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [05:25:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T371742)', diff saved to https://phabricator.wikimedia.org/P68421 and previous config saved to /var/cache/conftool/dbconfig/20240831-052543-ladsgroup.json [05:27:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P68422 and previous config saved to /var/cache/conftool/dbconfig/20240831-052704-ladsgroup.json [05:36:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:40:24] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:41:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:42:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P68423 and previous config saved to /var/cache/conftool/dbconfig/20240831-054211-ladsgroup.json [05:57:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T370903)', diff saved to https://phabricator.wikimedia.org/P68424 and previous config saved to /var/cache/conftool/dbconfig/20240831-055719-ladsgroup.json [05:57:21] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2146.codfw.wmnet with reason: Maintenance [05:57:24] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [05:57:34] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2146.codfw.wmnet with reason: Maintenance [05:57:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T370903)', diff saved to https://phabricator.wikimedia.org/P68425 and previous config saved to /var/cache/conftool/dbconfig/20240831-055741-ladsgroup.json [06:01:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:04:36] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:06:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:09:36] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:17:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T370903)', diff saved to https://phabricator.wikimedia.org/P68426 and previous config saved to /var/cache/conftool/dbconfig/20240831-061749-ladsgroup.json [06:17:54] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [06:18:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:27:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T371742)', diff saved to https://phabricator.wikimedia.org/P68427 and previous config saved to /var/cache/conftool/dbconfig/20240831-062659-ladsgroup.json [06:27:04] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [06:32:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P68428 and previous config saved to /var/cache/conftool/dbconfig/20240831-063256-ladsgroup.json [06:36:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:41:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:42:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P68429 and previous config saved to /var/cache/conftool/dbconfig/20240831-064207-ladsgroup.json [06:44:12] 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T373731 (10phaultfinder) 03NEW [06:48:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P68430 and previous config saved to /var/cache/conftool/dbconfig/20240831-064803-ladsgroup.json [06:57:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P68431 and previous config saved to /var/cache/conftool/dbconfig/20240831-065714-ladsgroup.json [07:01:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:03:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T370903)', diff saved to https://phabricator.wikimedia.org/P68432 and previous config saved to /var/cache/conftool/dbconfig/20240831-070311-ladsgroup.json [07:03:13] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2153.codfw.wmnet with reason: Maintenance [07:03:26] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2153.codfw.wmnet with reason: Maintenance [07:03:28] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [07:03:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T370903)', diff saved to https://phabricator.wikimedia.org/P68433 and previous config saved to /var/cache/conftool/dbconfig/20240831-070333-ladsgroup.json [07:03:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:08:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:12:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T371742)', diff saved to https://phabricator.wikimedia.org/P68434 and previous config saved to /var/cache/conftool/dbconfig/20240831-071221-ladsgroup.json [07:12:23] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [07:12:26] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [07:12:36] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [07:12:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2168 (T371742)', diff saved to https://phabricator.wikimedia.org/P68435 and previous config saved to /var/cache/conftool/dbconfig/20240831-071243-ladsgroup.json [07:13:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:16:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:21:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T370903)', diff saved to https://phabricator.wikimedia.org/P68436 and previous config saved to /var/cache/conftool/dbconfig/20240831-072138-ladsgroup.json [07:21:44] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [07:36:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P68437 and previous config saved to /var/cache/conftool/dbconfig/20240831-073645-ladsgroup.json [07:51:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P68438 and previous config saved to /var/cache/conftool/dbconfig/20240831-075152-ladsgroup.json [08:01:40] FIRING: SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:03:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:07:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T370903)', diff saved to https://phabricator.wikimedia.org/P68439 and previous config saved to /var/cache/conftool/dbconfig/20240831-080700-ladsgroup.json [08:07:02] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance [08:07:05] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [08:07:26] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance [08:07:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T370903)', diff saved to https://phabricator.wikimedia.org/P68440 and previous config saved to /var/cache/conftool/dbconfig/20240831-080733-ladsgroup.json [08:08:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:11:40] RESOLVED: SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:16:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T371742)', diff saved to https://phabricator.wikimedia.org/P68441 and previous config saved to /var/cache/conftool/dbconfig/20240831-081611-ladsgroup.json [08:16:17] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [08:21:40] FIRING: SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:22:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T370903)', diff saved to https://phabricator.wikimedia.org/P68442 and previous config saved to /var/cache/conftool/dbconfig/20240831-082256-ladsgroup.json [08:23:01] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [08:26:40] RESOLVED: SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:31:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P68443 and previous config saved to /var/cache/conftool/dbconfig/20240831-083118-ladsgroup.json [08:31:40] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:36:40] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:38:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P68444 and previous config saved to /var/cache/conftool/dbconfig/20240831-083803-ladsgroup.json [08:38:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:41:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:46:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P68445 and previous config saved to /var/cache/conftool/dbconfig/20240831-084626-ladsgroup.json [08:46:40] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:49:42] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:51:40] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:53:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P68446 and previous config saved to /var/cache/conftool/dbconfig/20240831-085310-ladsgroup.json [08:56:40] RESOLVED: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:01:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T371742)', diff saved to https://phabricator.wikimedia.org/P68447 and previous config saved to /var/cache/conftool/dbconfig/20240831-090133-ladsgroup.json [09:01:35] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2182.codfw.wmnet with reason: Maintenance [09:01:38] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [09:01:48] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2182.codfw.wmnet with reason: Maintenance [09:01:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T371742)', diff saved to https://phabricator.wikimedia.org/P68448 and previous config saved to /var/cache/conftool/dbconfig/20240831-090155-ladsgroup.json [09:06:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:08:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T370903)', diff saved to https://phabricator.wikimedia.org/P68449 and previous config saved to /var/cache/conftool/dbconfig/20240831-090817-ladsgroup.json [09:08:20] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2173.codfw.wmnet with reason: Maintenance [09:08:22] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [09:08:33] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2173.codfw.wmnet with reason: Maintenance [09:08:34] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance [09:08:37] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance [09:08:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:08:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T370903)', diff saved to https://phabricator.wikimedia.org/P68450 and previous config saved to /var/cache/conftool/dbconfig/20240831-090843-ladsgroup.json [09:11:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:11:40] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:16:40] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:25:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T370903)', diff saved to https://phabricator.wikimedia.org/P68451 and previous config saved to /var/cache/conftool/dbconfig/20240831-092548-ladsgroup.json [09:25:53] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [09:26:40] RESOLVED: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:38:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:40:24] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:40:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P68452 and previous config saved to /var/cache/conftool/dbconfig/20240831-094055-ladsgroup.json [09:43:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:56:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P68453 and previous config saved to /var/cache/conftool/dbconfig/20240831-095602-ladsgroup.json [10:03:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T371742)', diff saved to https://phabricator.wikimedia.org/P68454 and previous config saved to /var/cache/conftool/dbconfig/20240831-100308-ladsgroup.json [10:03:13] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [10:06:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:11:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T370903)', diff saved to https://phabricator.wikimedia.org/P68455 and previous config saved to /var/cache/conftool/dbconfig/20240831-101109-ladsgroup.json [10:11:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:11:11] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2174.codfw.wmnet with reason: Maintenance [10:11:15] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [10:11:25] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2174.codfw.wmnet with reason: Maintenance [10:11:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T370903)', diff saved to https://phabricator.wikimedia.org/P68456 and previous config saved to /var/cache/conftool/dbconfig/20240831-101131-ladsgroup.json [10:16:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:18:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:18:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P68457 and previous config saved to /var/cache/conftool/dbconfig/20240831-101815-ladsgroup.json [10:21:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:29:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T370903)', diff saved to https://phabricator.wikimedia.org/P68458 and previous config saved to /var/cache/conftool/dbconfig/20240831-102944-ladsgroup.json [10:29:50] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [10:33:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P68459 and previous config saved to /var/cache/conftool/dbconfig/20240831-103322-ladsgroup.json [10:41:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:43:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:44:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P68460 and previous config saved to /var/cache/conftool/dbconfig/20240831-104452-ladsgroup.json [10:48:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T371742)', diff saved to https://phabricator.wikimedia.org/P68461 and previous config saved to /var/cache/conftool/dbconfig/20240831-104829-ladsgroup.json [10:48:32] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2198.codfw.wmnet with reason: Maintenance [10:48:34] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [10:48:45] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2198.codfw.wmnet with reason: Maintenance [10:55:30] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:58:07] FIRING: [3x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:58:30] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web (k8s) 1.215s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:59:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P68462 and previous config saved to /var/cache/conftool/dbconfig/20240831-105959-ladsgroup.json [11:00:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:00:53] RESOLVED: [3x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:03:30] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web (k8s) 1.176s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:06:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:08:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:11:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:15:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T370903)', diff saved to https://phabricator.wikimedia.org/P68463 and previous config saved to /var/cache/conftool/dbconfig/20240831-111506-ladsgroup.json [11:15:08] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2176.codfw.wmnet with reason: Maintenance [11:15:11] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [11:15:21] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2176.codfw.wmnet with reason: Maintenance [11:15:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T370903)', diff saved to https://phabricator.wikimedia.org/P68464 and previous config saved to /var/cache/conftool/dbconfig/20240831-111528-ladsgroup.json [11:16:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:21:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:33:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T370903)', diff saved to https://phabricator.wikimedia.org/P68465 and previous config saved to /var/cache/conftool/dbconfig/20240831-113355-ladsgroup.json [11:34:01] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [11:39:24] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2200.codfw.wmnet with reason: Maintenance [11:39:26] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2200.codfw.wmnet with reason: Maintenance [11:49:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P68466 and previous config saved to /var/cache/conftool/dbconfig/20240831-114902-ladsgroup.json [12:04:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P68467 and previous config saved to /var/cache/conftool/dbconfig/20240831-120409-ladsgroup.json [12:06:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:08:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:11:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:18:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:19:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T370903)', diff saved to https://phabricator.wikimedia.org/P68468 and previous config saved to /var/cache/conftool/dbconfig/20240831-121915-ladsgroup.json [12:19:17] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2188.codfw.wmnet with reason: Maintenance [12:19:21] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [12:19:30] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2188.codfw.wmnet with reason: Maintenance [12:19:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T370903)', diff saved to https://phabricator.wikimedia.org/P68469 and previous config saved to /var/cache/conftool/dbconfig/20240831-121937-ladsgroup.json [12:21:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:28:41] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2208.codfw.wmnet with reason: Maintenance [12:28:54] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2208.codfw.wmnet with reason: Maintenance [12:29:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T371742)', diff saved to https://phabricator.wikimedia.org/P68470 and previous config saved to /var/cache/conftool/dbconfig/20240831-122900-ladsgroup.json [12:29:05] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [12:33:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T370903)', diff saved to https://phabricator.wikimedia.org/P68471 and previous config saved to /var/cache/conftool/dbconfig/20240831-123346-ladsgroup.json [12:33:51] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [12:48:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P68472 and previous config saved to /var/cache/conftool/dbconfig/20240831-124853-ladsgroup.json [12:49:42] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [13:04:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P68473 and previous config saved to /var/cache/conftool/dbconfig/20240831-130400-ladsgroup.json [13:08:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:11:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:18:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:19:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T370903)', diff saved to https://phabricator.wikimedia.org/P68474 and previous config saved to /var/cache/conftool/dbconfig/20240831-131907-ladsgroup.json [13:19:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2202.codfw.wmnet with reason: Maintenance [13:19:13] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [13:19:23] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2202.codfw.wmnet with reason: Maintenance [13:23:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:24:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T371742)', diff saved to https://phabricator.wikimedia.org/P68475 and previous config saved to /var/cache/conftool/dbconfig/20240831-132428-ladsgroup.json [13:24:33] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [13:33:29] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2212.codfw.wmnet with reason: Maintenance [13:33:42] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2212.codfw.wmnet with reason: Maintenance [13:33:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2212 (T370903)', diff saved to https://phabricator.wikimedia.org/P68476 and previous config saved to /var/cache/conftool/dbconfig/20240831-133349-ladsgroup.json [13:33:55] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [13:39:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P68477 and previous config saved to /var/cache/conftool/dbconfig/20240831-133935-ladsgroup.json [13:40:24] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:41:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:46:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:48:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T370903)', diff saved to https://phabricator.wikimedia.org/P68478 and previous config saved to /var/cache/conftool/dbconfig/20240831-134804-ladsgroup.json [13:48:09] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [13:54:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P68479 and previous config saved to /var/cache/conftool/dbconfig/20240831-135442-ladsgroup.json [14:03:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P68480 and previous config saved to /var/cache/conftool/dbconfig/20240831-140311-ladsgroup.json [14:08:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:09:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T371742)', diff saved to https://phabricator.wikimedia.org/P68481 and previous config saved to /var/cache/conftool/dbconfig/20240831-140949-ladsgroup.json [14:09:51] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2218.codfw.wmnet with reason: Maintenance [14:09:54] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [14:10:05] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2218.codfw.wmnet with reason: Maintenance [14:10:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2218 (T371742)', diff saved to https://phabricator.wikimedia.org/P68482 and previous config saved to /var/cache/conftool/dbconfig/20240831-141011-ladsgroup.json [14:11:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:11:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T367856)', diff saved to https://phabricator.wikimedia.org/P68483 and previous config saved to /var/cache/conftool/dbconfig/20240831-141150-marostegui.json [14:11:55] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [14:13:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:18:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:18:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P68484 and previous config saved to /var/cache/conftool/dbconfig/20240831-141819-ladsgroup.json [14:26:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P68485 and previous config saved to /var/cache/conftool/dbconfig/20240831-142657-marostegui.json [14:33:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T370903)', diff saved to https://phabricator.wikimedia.org/P68486 and previous config saved to /var/cache/conftool/dbconfig/20240831-143326-ladsgroup.json [14:33:28] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2216.codfw.wmnet with reason: Maintenance [14:33:31] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [14:33:41] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2216.codfw.wmnet with reason: Maintenance [14:33:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T370903)', diff saved to https://phabricator.wikimedia.org/P68487 and previous config saved to /var/cache/conftool/dbconfig/20240831-143348-ladsgroup.json [14:36:28] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:42:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P68488 and previous config saved to /var/cache/conftool/dbconfig/20240831-144204-marostegui.json [14:43:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:46:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:47:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T370903)', diff saved to https://phabricator.wikimedia.org/P68489 and previous config saved to /var/cache/conftool/dbconfig/20240831-144748-ladsgroup.json [14:47:53] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [14:57:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T367856)', diff saved to https://phabricator.wikimedia.org/P68490 and previous config saved to /var/cache/conftool/dbconfig/20240831-145712-marostegui.json [14:57:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 7:00:00 on db2181.codfw.wmnet with reason: Maintenance [14:57:17] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [14:57:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 7:00:00 on db2181.codfw.wmnet with reason: Maintenance [14:57:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2181 (T367856)', diff saved to https://phabricator.wikimedia.org/P68491 and previous config saved to /var/cache/conftool/dbconfig/20240831-145733-marostegui.json [15:01:28] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P68492 and previous config saved to /var/cache/conftool/dbconfig/20240831-150254-ladsgroup.json [15:08:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T371742)', diff saved to https://phabricator.wikimedia.org/P68493 and previous config saved to /var/cache/conftool/dbconfig/20240831-150810-ladsgroup.json [15:08:15] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [15:11:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:16:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:18:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P68494 and previous config saved to /var/cache/conftool/dbconfig/20240831-151802-ladsgroup.json [15:21:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:23:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P68495 and previous config saved to /var/cache/conftool/dbconfig/20240831-152317-ladsgroup.json [15:23:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:33:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T370903)', diff saved to https://phabricator.wikimedia.org/P68496 and previous config saved to /var/cache/conftool/dbconfig/20240831-153309-ladsgroup.json [15:33:14] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [15:38:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P68497 and previous config saved to /var/cache/conftool/dbconfig/20240831-153824-ladsgroup.json [15:43:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:48:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:53:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T371742)', diff saved to https://phabricator.wikimedia.org/P68498 and previous config saved to /var/cache/conftool/dbconfig/20240831-155331-ladsgroup.json [15:53:37] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [16:11:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:13:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:16:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:21:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:26:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:49:42] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:59:30] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:02:30] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 7.528s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:05:12] FIRING: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:05:53] FIRING: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:07:22] I do wish more of our alerts had runbooks that actually existed :( [17:08:38] here [17:09:08] I've found https://wikitech.wikimedia.org/wiki/Wikifunctions/Runbook [17:09:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:09:32] https://grafana.wikimedia.org/d/FEkiKFqVk/wikifunctions?orgId=1&var-dc=thanos&var-service=wikifunctions&var-site=eqiad&var-prometheus=k8s&var-container_name=All&var-pod_prefix=All looks unremarkable [17:10:12] RESOLVED: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:10:33] claime: ^-- looks to have self-resolved? [17:10:38] yeah [17:10:44] i'm fighting my password manager, sorry [17:13:30] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:13:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:16:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:19:06] o/ [17:19:56] Emperor, claime: hey ... I think this might just be the fact that we're running just 2 IIRC replicas of mw-wikifunctions for now [17:20:15] and that does not seem enough to sustain 25rp/s [17:20:19] jayme: yes [17:20:37] I can prepare a patch if you're not already on it [17:20:53] RESOLVED: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:23:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:27:30] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 18.55s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:40:24] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:11:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:13:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:18:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:18:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:23:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:26:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:46:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:51:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:05:30] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:09:30] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 7.684s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:10:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:13:07] FIRING: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:13:30] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:18:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:19:31] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 8.841s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:20:53] RESOLVED: ProbeDown: Service mw-wikifunctions:4451 has failed probes (http_mw-wikifunctions_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-wikifunctions:4451 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:23:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:28:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:46:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:50:34] (03CR) 10Krinkle: [C:03+1] Remove labs settings for $wmgExtraLogFile that have no effect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069321 (owner: 10Bartosz Dziewoński) [19:51:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:54:54] (03CR) 10Krinkle: Replace confusing uses of $wgDebugLogFile with $wmgExtraLogFile (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069320 (owner: 10Bartosz Dziewoński) [20:13:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:16:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:26:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:28:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:48:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:49:42] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [20:51:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:16:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:18:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:21:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:29:40] 06SRE, 06Commons, 07Wikimedia-production-error: Cannot delete file on Commons: DBQueryError (File:Logo-headlinejabar.jpg) - https://phabricator.wikimedia.org/T373748#10108586 (10Peachey88) [21:40:24] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:48:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:51:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:16:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:18:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:21:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:26:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:28:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:18:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:21:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:28:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:31:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:38:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1069640 [23:38:37] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1069640 (owner: 10TrainBranchBot) [23:51:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:53:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed