[00:00:04] (03PS1) 10Stoyofuku-wmf: Release donate link to pilot wikis (French Wikipedia and Wikifunctions) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071281 (https://phabricator.wikimedia.org/T373585) [00:00:37] (03CR) 10Stoyofuku-wmf: "for you to review at your leisure" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071281 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf) [00:02:06] FIRING: ProbeDown: Service people1004:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:02:28] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:03:40] PROBLEM - Juniper virtual chassis ports on asw2-d-eqiad is CRITICAL: CRIT: Down: 2 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [00:07:06] RESOLVED: ProbeDown: Service people1004:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:07:40] RECOVERY - Juniper virtual chassis ports on asw2-d-eqiad is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [00:09:04] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071279 (owner: 10TrainBranchBot) [00:17:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071281 (https://phabricator.wikimedia.org/T373585) (owner: 10Stoyofuku-wmf) [00:20:29] (03PS1) 10Stoyofuku-wmf: Remove donate link from sidebar menu when it is added to the user menu [extensions/WikimediaMessages] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071282 (https://phabricator.wikimedia.org/T373566) [00:20:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/WikimediaMessages] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071282 (https://phabricator.wikimedia.org/T373566) (owner: 10Stoyofuku-wmf) [00:21:06] FIRING: ProbeDown: Service people1004:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:23:48] PROBLEM - people.wikimedia.org requires authentication on people1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [00:24:38] RECOVERY - people.wikimedia.org requires authentication on people1004 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 586 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [00:24:45] (03Abandoned) 10Stoyofuku-wmf: Remove donate link from sidebar menu when it is added to the user menu [extensions/WikimediaMessages] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1071282 (https://phabricator.wikimedia.org/T373566) (owner: 10Stoyofuku-wmf) [00:26:06] RESOLVED: ProbeDown: Service people1004:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:31:14] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers mw2396.codfw.wmnet, wikikube-worker2033.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2026.codfw.wmnet, mw2338.codfw.wmnet, mw2368.codfw.wmnet, parse2003.codfw.wmnet, mw2315.codfw.wmnet, wikikube-worker2071.codfw.wmnet, parse2004.codfw.wmnet, kubernetes2050.codfw.wmnet, wikikube-worker2022.codfw.wmnet, mw [00:31:14] fw.wmnet, parse2020.codfw.wmnet, wikikube-worker2043.codfw.wmnet, wikikube-worker2096.codfw.wmnet, wikikube-worker2097.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikikube-worker2023.codfw.wmnet, wikikube-worker2041.codfw.wmnet, mw2449.codfw.wmnet, wikikube-worker2014.codfw.wmnet, mw2304.codfw.wmnet, kubernetes2036.codfw.wmnet, wikikube-worker2048.codfw.wmnet, wikikube-worker2073.codfw.wmnet, wikikube-worker2056.codfw.wmnet, mw2376.codf [00:31:14] wikikube-worker2066.codfw.wmnet, wikikube-worker2007.codfw.wmnet, wikikube-worker2094.codfw.wmnet, wikikube-worker2034.codfw.wmnet, mw2437.codfw.wmnet, kubernetes2047.codfw.wmnet, mw23 https://wikitech.wikimedia.org/wiki/PyBal [00:31:26] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers parse2001.codfw.wmnet, mw2338.codfw.wmnet, parse2003.codfw.wmnet, kubernetes2059.codfw.wmnet, wikikube-worker2040.codfw.wmnet, parse2018.codfw.wmnet, kubernetes2050.codfw.wmnet, wikikube-worker2091.codfw.wmnet, mw2431.codfw.wmnet, mw2352.codfw.wmnet, wikikube-worker2096.codfw.wmnet, wikikube-worker2060.codfw.wmnet, mw23 [00:31:26] .wmnet, wikikube-worker2041.codfw.wmnet, mw2302.codfw.wmnet, wikikube-worker2055.codfw.wmnet, parse2013.codfw.wmnet, kubernetes2016.codfw.wmnet, wikikube-worker2050.codfw.wmnet, wikikube-worker2045.codfw.wmnet, mw2413.codfw.wmnet, mw2356.codfw.wmnet, mw2429.codfw.wmnet, kubernetes2022.codfw.wmnet, mw2451.codfw.wmnet, mw2304.codfw.wmnet, mw2444.codfw.wmnet, wikikube-worker2028.codfw.wmnet, wikikube-worker2049.codfw.wmnet, mw2372.codfw.wmne [00:31:26] 2008.codfw.wmnet, mw2395.codfw.wmnet, wikikube-worker2031.codfw.wmnet, parse2002.codfw.wmnet, wikikube-worker2100.codfw.wmnet, parse2007.codfw.wmnet, mw2374.codfw.wmnet, mw2445.codfw.wm https://wikitech.wikimedia.org/wiki/PyBal [00:32:14] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:32:26] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:41:38] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:51:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [01:45:20] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers mw2424.codfw.wmnet, parse2017.codfw.wmnet, kubernetes2024.codfw.wmnet, mw2370.codfw.wmnet, wikikube-worker2084.codfw.wmnet, wikikube-worker2099.codfw.wmnet, mw2443.codfw.wmnet, wikikube-worker2091.codfw.wmnet, wikikube-worker2077.codfw.wmnet, wikikube-worker2010.codfw.wmnet, parse2020.codfw.wmnet, wikikube-worker2027.co [01:45:20] t, wikikube-worker2030.codfw.wmnet, kubernetes2006.codfw.wmnet, wikikube-worker2088.codfw.wmnet, wikikube-worker2096.codfw.wmnet, wikikube-worker2055.codfw.wmnet, mw2353.codfw.wmnet, mw2397.codfw.wmnet, mw2413.codfw.wmnet, mw2368.codfw.wmnet, wikikube-worker2059.codfw.wmnet, mw2451.codfw.wmnet, kubernetes2044.codfw.wmnet, parse2015.codfw.wmnet, wikikube-worker2049.codfw.wmnet, parse2008.codfw.wmnet, wikikube-worker2024.codfw.wmnet, wikiku [01:45:20] r2066.codfw.wmnet, wikikube-worker2034.codfw.wmnet, mw2437.codfw.wmnet, wikikube-worker2046.codfw.wmnet, mw2366.codfw.wmnet, mw2337.codfw.wmnet, mw2354.codfw.wmnet, wikikube-worker2053. https://wikitech.wikimedia.org/wiki/PyBal [01:46:20] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:49:24] (03CR) 10Krinkle: [C:03+1] Remove unused $wgAllowRequiringEmailForResets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065299 (https://phabricator.wikimedia.org/T242406) (owner: 10Bartosz Dziewoński) [02:16:55] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:20:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:21:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 18.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:22:06] FIRING: ProbeDown: Service people1004:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:26:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 18.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:26:32] (03PS1) 10Krinkle: Remove unused wgStatsMethod, wgResourceLoaderClientPreferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071287 [02:27:06] RESOLVED: ProbeDown: Service people1004:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:28:44] (03PS2) 10Krinkle: Remove unused wgStatsMethod, wgResourceLoaderClientPreferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1071287 [02:36:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:26:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 2m 17s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:34:04] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:34:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:36:30] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [03:36:55] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:37:30] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr4-ulsfo.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [03:38:46] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:39:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:40:30] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers wikikube-worker2021.codfw.wmnet, wikikube-worker2033.codfw.wmnet, parse2017.codfw.wmnet, kubernetes2056.codfw.wmnet, parse2006.codfw.wmnet, mw2375.codfw.wmnet, wikikube-worker2026.codfw.wmnet, mw2447.codfw.wmnet, wikikube-worker2077.codfw.wmnet, kubernetes2014.codfw.wmnet, mw2443.codfw.wmnet, wikikube-worker2091.codfw.w [03:40:30] kikube-worker2040.codfw.wmnet, wikikube-worker2083.codfw.wmnet, wikikube-worker2071.codfw.wmnet, wikikube-worker2010.codfw.wmnet, wikikube-worker2030.codfw.wmnet, wikikube-worker2052.codfw.wmnet, kubernetes2006.codfw.wmnet, wikikube-worker2065.codfw.wmnet, mw2302.codfw.wmnet, kubernetes2039.codfw.wmnet, mw2394.codfw.wmnet, mw2314.codfw.wmnet, wikikube-worker2059.codfw.wmnet, parse2012.codfw.wmnet, mw2444.codfw.wmnet, wikikube-worker2101.c [03:40:30] et, wikikube-worker2018.codfw.wmnet, wikikube-worker2048.codfw.wmnet, kubernetes2051.codfw.wmnet, wikikube-worker2049.codfw.wmnet, wikikube-worker2003.codfw.wmnet, parse2007.codfw.wmnet https://wikitech.wikimedia.org/wiki/PyBal [03:41:30] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:41:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [03:42:30] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr4-ulsfo.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [03:51:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 39.64s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:52:45] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:52:45] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:57:45] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:57:45] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:59:08] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:59:14] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:01:00] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52630 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:01:04] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:02:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:07:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:13:38] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers kubernetes2046.codfw.wmnet, wikikube-worker2033.codfw.wmnet, kubernetes2056.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2026.codfw.wmnet, parse2009.codfw.wmnet, wikikube-worker2084.codfw.wmnet, kubernetes2014.codfw.wmnet, mw2443.codfw.wmnet, wikikube-worker2091.codfw.wmnet, wikikube-worker2076.codfw.wmn [04:13:38] rnetes2059.codfw.wmnet, parse2018.codfw.wmnet, mw2315.codfw.wmnet, wikikube-worker2071.codfw.wmnet, kubernetes2050.codfw.wmnet, wikikube-worker2010.codfw.wmnet, wikikube-worker2022.codfw.wmnet, wikikube-worker2027.codfw.wmnet, wikikube-worker2097.codfw.wmnet, mw2398.codfw.wmnet, wikikube-worker2041.codfw.wmnet, mw2313.codfw.wmnet, wikikube-worker2090.codfw.wmnet, kubernetes2039.codfw.wmnet, mw2397.codfw.wmnet, wikikube-worker2059.codfw.wm [04:13:38] ernetes2042.codfw.wmnet, wikikube-worker2098.codfw.wmnet, kubernetes2013.codfw.wmnet, wikikube-worker2103.codfw.wmnet, parse2012.codfw.wmnet, mw2399.codfw.wmnet, kubernetes2036.codfw.wm https://wikitech.wikimedia.org/wiki/PyBal [04:13:42] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers wikikube-worker2021.codfw.wmnet, wikikube-worker2033.codfw.wmnet, parse2017.codfw.wmnet, wikikube-worker2086.codfw.wmnet, wikikube-worker2102.codfw.wmnet, mw2375.codfw.wmnet, mw2427.codfw.wmnet, wikikube-worker2026.codfw.wmnet, kubernetes2024.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2338.codfw.wmnet, parse2009.co [04:13:42] t, wikikube-worker2084.codfw.wmnet, kubernetes2052.codfw.wmnet, mw2443.codfw.wmnet, kubernetes2048.codfw.wmnet, wikikube-worker2091.codfw.wmnet, wikikube-worker2076.codfw.wmnet, kubernetes2059.codfw.wmnet, parse2018.codfw.wmnet, wikikube-worker2083.codfw.wmnet, mw2315.codfw.wmnet, parse2004.codfw.wmnet, wikikube-worker2010.codfw.wmnet, mw2351.codfw.wmnet, wikikube-worker2022.codfw.wmnet, kubernetes2056.codfw.wmnet, parse2020.codfw.wmnet, [04:13:42] -worker2030.codfw.wmnet, kubernetes2006.codfw.wmnet, wikikube-worker2097.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikikube-worker2060.codfw.wmnet, wikikube-worker2023.codfw.wmnet, https://wikitech.wikimedia.org/wiki/PyBal [04:14:38] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:14:42] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:31:55] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:34:04] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:36:55] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:38:46] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:51:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [05:10:06] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:30:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 2.625s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:35:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 1.875s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:49:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 3.523s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:54:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 3.523s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:10:06] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:20:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:31:50] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers wikikube-worker2083.codfw.wmnet, kubernetes2050.codfw.wmnet, mw2431.codfw.wmnet, mw2397.codfw.wmnet, kubernetes2044.codfw.wmnet, mw2395.codfw.wmnet, mw2426.codfw.wmnet, mw2371.codfw.wmnet, wikikube-worker2012.codfw.wmnet, wikikube-worker2074.codfw.wmnet, mw2412.codfw.wmnet, mw2436.codfw.wmnet, kubernetes2038.codfw.wmnet [06:31:50] etes2011.codfw.wmnet, kubernetes2043.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:32:50] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:36:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 9.583s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:41:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 9.583s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:00:54] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers mw2424.codfw.wmnet, wikikube-worker2021.codfw.wmnet, mw2396.codfw.wmnet, parse2001.codfw.wmnet, wikikube-worker2033.codfw.wmnet, parse2017.codfw.wmnet, kubernetes2056.codfw.wmnet, parse2006.codfw.wmnet, wikikube-worker2017.codfw.wmnet, wikikube-worker2046.codfw.wmnet, mw2375.codfw.wmnet, wikikube-worker2064.codfw.wmnet, [07:00:54] e-worker2036.codfw.wmnet, mw2338.codfw.wmnet, wikikube-worker2084.codfw.wmnet, mw2368.codfw.wmnet, kubernetes2014.codfw.wmnet, parse2003.codfw.wmnet, wikikube-worker2077.codfw.wmnet, kubernetes2059.codfw.wmnet, wikikube-worker2071.codfw.wmnet, parse2004.codfw.wmnet, wikikube-worker2044.codfw.wmnet, wikikube-worker2010.codfw.wmnet, wikikube-worker2086.codfw.wmnet, parse2020.codfw.wmnet, mw2352.codfw.wmnet, wikikube-worker2043.codfw.wmnet, [07:00:54] -worker2096.codfw.wmnet, wikikube-worker2097.codfw.wmnet, wikikube-worker2065.codfw.wmnet, mw2359.codfw.wmnet, mw2313.codfw.wmnet, wikikube-worker2090.codfw.wmnet, mw2302.codfw.wmnet, w https://wikitech.wikimedia.org/wiki/PyBal [07:00:54] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers kubernetes2046.codfw.wmnet, parse2017.codfw.wmnet, wikikube-worker2063.codfw.wmnet, parse2006.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2017.codfw.wmnet, wikikube-worker2026.codfw.wmnet, kubernetes2024.codfw.wmnet, wikikube-worker2036.codfw.wmnet, mw2447.codfw.wmnet, m [07:00:54] dfw.wmnet, wikikube-worker2099.codfw.wmnet, mw2443.codfw.wmnet, kubernetes2048.codfw.wmnet, wikikube-worker2076.codfw.wmnet, kubernetes2059.codfw.wmnet, wikikube-worker2040.codfw.wmnet, wikikube-worker2071.codfw.wmnet, mw2351.codfw.wmnet, mw2440.codfw.wmnet, parse2020.codfw.wmnet, wikikube-worker2039.codfw.wmnet, wikikube-worker2027.codfw.wmnet, kubernetes2042.codfw.wmnet, kubernetes2006.codfw.wmnet, wikikube-worker2097.codfw.wmnet, wikik [07:00:54] er2065.codfw.wmnet, wikikube-worker2060.codfw.wmnet, mw2398.codfw.wmnet, wikikube-worker2041.codfw.wmnet, wikikube-worker2088.codfw.wmnet, wikikube-worker2002.codfw.wmnet, mw2313.codfw. https://wikitech.wikimedia.org/wiki/PyBal [07:02:54] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:02:54] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:30:45] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:35:45] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:28:56] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers mw2396.codfw.wmnet, wikikube-worker2033.codfw.wmnet, wikikube-worker2102.codfw.wmnet, mw2375.codfw.wmnet, wikikube-worker2026.codfw.wmnet, mw2338.codfw.wmnet, mw2447.codfw.wmnet, kubernetes2052.codfw.wmnet, kubernetes2014.codfw.wmnet, mw2443.codfw.wmnet, wikikube-worker2077.codfw.wmnet, kubernetes2059.codfw.wmnet, wikik [08:28:56] er2040.codfw.wmnet, parse2018.codfw.wmnet, mw2431.codfw.wmnet, mw2351.codfw.wmnet, wikikube-worker2022.codfw.wmnet, mw2427.codfw.wmnet, mw2440.codfw.wmnet, kubernetes2042.codfw.wmnet, wikikube-worker2043.codfw.wmnet, wikikube-worker2096.codfw.wmnet, wikikube-worker2097.codfw.wmnet, wikikube-worker2041.codfw.wmnet, mw2359.codfw.wmnet, wikikube-worker2002.codfw.wmnet, kubernetes2016.codfw.wmnet, mw2449.codfw.wmnet, mw2356.codfw.wmnet, parse [08:28:56] fw.wmnet, wikikube-worker2101.codfw.wmnet, wikikube-worker2018.codfw.wmnet, wikikube-worker2028.codfw.wmnet, kubernetes2044.codfw.wmnet, mw2301.codfw.wmnet, parse2015.codfw.wmnet, mw241 https://wikitech.wikimedia.org/wiki/PyBal [08:28:56] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers wikikube-worker2021.codfw.wmnet, mw2396.codfw.wmnet, wikikube-worker2033.codfw.wmnet, parse2017.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2017.codfw.wmnet, kubernetes2024.codfw.wmnet, wikikube-worker2076.codfw.wmnet, kubernetes2059.codfw.wmnet, wikikube-worker2086.codfw.wmnet, wikikube-worker2071.codf [08:28:56] parse2004.codfw.wmnet, kubernetes2050.codfw.wmnet, wikikube-worker2010.codfw.wmnet, mw2431.codfw.wmnet, mw2427.codfw.wmnet, mw2440.codfw.wmnet, parse2020.codfw.wmnet, wikikube-worker2043.codfw.wmnet, wikikube-worker2096.codfw.wmnet, mw2398.codfw.wmnet, wikikube-worker2002.codfw.wmnet, mw2313.codfw.wmnet, wikikube-worker2090.codfw.wmnet, wikikube-worker2089.codfw.wmnet, kubernetes2039.codfw.wmnet, mw2353.codfw.wmnet, wikikube-worker2045.c [08:28:57] et, mw2394.codfw.wmnet, mw2356.codfw.wmnet, kubernetes2022.codfw.wmnet, kubernetes2042.codfw.wmnet, mw2399.codfw.wmnet, wikikube-worker2075.codfw.wmnet, wikikube-worker2087.codfw.wmnet, https://wikitech.wikimedia.org/wiki/PyBal [08:29:56] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:29:56] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:36:55] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:40:45] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:45:45] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:51:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [09:12:06] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:13:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 2m 36s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:18:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 2m 36s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:02:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 40s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:07:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 40s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:12:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 43.85s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:17:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 43.85s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:20:22] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 145239144 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:20:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:21:24] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 6521800 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:28:40] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on db1246.eqiad.wmnet with reason: https://phabricator.wikimedia.org/T374215 → server depooled has hardware issues [10:28:42] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on db1246.eqiad.wmnet with reason: https://phabricator.wikimedia.org/T374215 → server depooled has hardware issues [10:30:24] !incidents [10:30:25] 5147 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr4-ulsfo.wikimedia.org) [10:30:25] 5146 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-codfw.wikimedia.org) [10:30:25] 5142 (RESOLVED) ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw) [10:31:18] !ack 5138 [10:31:18] Attempt to ack incident 5138 failed. [10:31:59] Ah already resolved :) my vops App showed the incident as open [10:43:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:47:08] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers mw2424.codfw.wmnet, kubernetes2052.codfw.wmnet, wikikube-worker2099.codfw.wmnet, kubernetes2014.codfw.wmnet, parse2003.codfw.wmnet, wikikube-worker2077.codfw.wmnet, parse2018.codfw.wmnet, mw2315.codfw.wmnet, wikikube-worker2039.codfw.wmnet, wikikube-worker2023.codfw.wmnet, wikikube-worker2002.codfw.wmnet, wikikube-worke [10:47:08] dfw.wmnet, wikikube-worker2089.codfw.wmnet, wikikube-worker2062.codfw.wmnet, parse2012.codfw.wmnet, mw2353.codfw.wmnet, mw2314.codfw.wmnet, mw2440.codfw.wmnet, mw2304.codfw.wmnet, wikikube-worker2018.codfw.wmnet, wikikube-worker2028.codfw.wmnet, wikikube-worker2013.codfw.wmnet, kubernetes2044.codfw.wmnet, kubernetes2051.codfw.wmnet, mw2372.codfw.wmnet, wikikube-worker2035.codfw.wmnet, wikikube-worker2024.codfw.wmnet, mw2447.codfw.wmnet, m [10:47:08] dfw.wmnet, wikikube-worker2088.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2021.codfw.wmnet, parse2002.codfw.wmnet, wikikube-worker2034.codfw.wmnet, mw2437.codfw.wmnet, kubernete https://wikitech.wikimedia.org/wiki/PyBal [10:47:08] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers kubernetes2046.codfw.wmnet, mw2424.codfw.wmnet, parse2017.codfw.wmnet, wikikube-worker2086.codfw.wmnet, parse2006.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2017.codfw.wmnet, mw2370.codfw.wmnet, wikikube-worker2099.codfw.wmnet, kubernetes2014.codfw.wmnet, wikikube-worker2091.codfw.wmnet, wikikube-worke [10:47:08] dfw.wmnet, mw2431.codfw.wmnet, wikikube-worker2022.codfw.wmnet, wikikube-worker2043.codfw.wmnet, wikikube-worker2090.codfw.wmnet, mw2302.codfw.wmnet, parse2013.codfw.wmnet, kubernetes2039.codfw.wmnet, wikikube-worker2050.codfw.wmnet, mw2397.codfw.wmnet, mw2394.codfw.wmnet, mw2356.codfw.wmnet, mw2314.codfw.wmnet, mw2429.codfw.wmnet, mw2419.codfw.wmnet, mw2304.codfw.wmnet, wikikube-worker2056.codfw.wmnet, mw2301.codfw.wmnet, mw2390.codfw.wm [10:47:08] se2008.codfw.wmnet, wikikube-worker2035.codfw.wmnet, wikikube-worker2031.codfw.wmnet, wikikube-worker2003.codfw.wmnet, mw2414.codfw.wmnet, parse2002.codfw.wmnet, wikikube-worker2080.cod https://wikitech.wikimedia.org/wiki/PyBal [10:48:08] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:48:08] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:53:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:01:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: asw2-d2-eqid <-> asw2-d4-eqiad vcp link flapping - https://phabricator.wikimedia.org/T374272#10127785 (10cmooney) Thanks @cdanis and @Southparkfan for the task! Logs relate to [[ https://netbox.wikimedia.org/dcim/interfaces/18059/trac... [11:07:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:12:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:15:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:20:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 18.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:24:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:30:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 21.15s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:35:04] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:35:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 17.59s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:36:55] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:39:46] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:49:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:07:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 18.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:08:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 6.25s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:12:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 18.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:13:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-wikifunctions (k8s) 2.031s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:16:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 18.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:26:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:35:04] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:36:55] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:39:46] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:51:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:56:45] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:58:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:01:55] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:03:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:06:45] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:20:20] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [13:21:10] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29685 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [14:12:08] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [14:20:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:25:09] (03PS1) 10DDesouza: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071305 (https://phabricator.wikimedia.org/T219903) [14:31:53] (03CR) 10DDesouza: [C:03+2] miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071305 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [14:33:29] (03Merged) 10jenkins-bot: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071305 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [14:34:00] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [14:34:15] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [14:34:16] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [14:34:35] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [14:34:36] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [14:34:57] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [14:36:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:12] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [14:38:14] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [14:38:15] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [14:38:17] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [14:38:18] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [14:38:20] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [14:40:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:44:52] !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:44:56] !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:05:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:40:22] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:41:18] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:43:32] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:44:10] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52631 bytes in 2.942 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:44:14] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.235 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:44:22] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 12 Oct 2024 12:50:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:36:04] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:36:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:40:46] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:51:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [17:14:08] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [17:36:04] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:36:25] RESOLVED: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:40:46] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:20:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:48:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.27% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:53:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.09% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:13:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.58% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:18:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.58% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:46:36] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 189766864 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:47:36] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 68856 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:51:57] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [21:32:34] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers mw2396.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2017.codfw.wmnet, mw2375.codfw.wmnet, mw2338.codfw.wmnet, mw2447.codfw.wmnet, mw2370.codfw.wmnet, wikikube-worker2091.codfw.wmnet, wikikube-worker2040.codfw.wmnet, wikikube-worker2071.codfw.wmnet, wikikube-worker2010.codfw.wmnet, mw2351.codfw.wmnet, par [21:32:34] odfw.wmnet, kubernetes2006.codfw.wmnet, wikikube-worker2023.codfw.wmnet, wikikube-worker2041.codfw.wmnet, wikikube-worker2062.codfw.wmnet, mw2429.codfw.wmnet, kubernetes2042.codfw.wmnet, kubernetes2013.codfw.wmnet, mw2399.codfw.wmnet, kubernetes2036.codfw.wmnet, wikikube-worker2075.codfw.wmnet, wikikube-worker2013.codfw.wmnet, kubernetes2051.codfw.wmnet, mw2416.codfw.wmnet, mw2376.codfw.wmnet, wikikube-worker2024.codfw.wmnet, mw2426.codfw [21:32:34] wikikube-worker2003.codfw.wmnet, wikikube-worker2094.codfw.wmnet, mw2414.codfw.wmnet, mw2369.codfw.wmnet, mw2437.codfw.wmnet, wikikube-worker2042.codfw.wmnet, mw2366.codfw.wmnet, mw2425 https://wikitech.wikimedia.org/wiki/PyBal [21:32:34] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-wikifunctions_4451: Servers wikikube-worker2033.codfw.wmnet, parse2017.codfw.wmnet, wikikube-worker2086.codfw.wmnet, parse2006.codfw.wmnet, wikikube-worker2017.codfw.wmnet, wikikube-worker2036.codfw.wmnet, parse2009.codfw.wmnet, wikikube-worker2091.codfw.wmnet, parse2018.codfw.wmnet, wikikube-worker2071.codfw.wmnet, wikikube-worker2044.codfw.wmnet [21:32:34] be-worker2022.codfw.wmnet, wikikube-worker2030.codfw.wmnet, mw2352.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikikube-worker2060.codfw.wmnet, wikikube-worker2041.codfw.wmnet, wikikube-worker2055.codfw.wmnet, kubernetes2016.codfw.wmnet, wikikube-worker2045.codfw.wmnet, wikikube-worker2059.codfw.wmnet, mw2440.codfw.wmnet, mw2419.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2036.codfw.wmnet, wikikube-worker2101.codfw.wmnet, wikikub [21:32:34] 2075.codfw.wmnet, wikikube-worker2087.codfw.wmnet, wikikube-worker2028.codfw.wmnet, kubernetes2044.codfw.wmnet, parse2008.codfw.wmnet, mw2395.codfw.wmnet, mw2426.codfw.wmnet, wikikube-w https://wikitech.wikimedia.org/wiki/PyBal [21:33:34] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:33:34] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:20:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:38:12] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071334 [23:38:12] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1071334 (owner: 10TrainBranchBot)