[00:03:11] PROBLEM - Disk space on an-worker1131 is CRITICAL: DISK CRITICAL - free space: / 2102 MB (3% inode=95%): /tmp 2102 MB (3% inode=95%): /var/tmp 2102 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1131&var-datasource=eqiad+prometheus/ops [00:09:18] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1154145 [00:09:18] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1154145 (owner: 10TrainBranchBot) [00:30:06] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1154145 (owner: 10TrainBranchBot) [00:45:19] RECOVERY - Hadoop NodeManager on an-worker1157 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:48:19] PROBLEM - Hadoop NodeManager on an-worker1157 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:52:37] FIRING: SystemdUnitFailed: wmf_auto_restart_sfacctd.service on netflow7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:57:37] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [01:01:53] PROBLEM - Disk space on an-worker1154 is CRITICAL: DISK CRITICAL - free space: / 2047 MB (3% inode=95%): /tmp 2047 MB (3% inode=95%): /var/tmp 2047 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1154&var-datasource=eqiad+prometheus/ops [01:07:37] FIRING: [2x] GnmiTargetDown: asw1-b3-magru is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [01:20:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [01:25:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [01:25:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10889742 (10Stevemunene) [01:25:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10889743 (10Stevemunene) [01:28:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [01:38:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [01:43:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [01:48:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [02:03:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [02:06:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [02:09:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [02:11:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [02:14:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [02:16:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [02:36:05] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift [02:36:11] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift [02:36:21] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [02:36:21] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [02:36:31] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift [02:36:43] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:36:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:37:03] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Swift [02:37:09] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Swift [02:37:19] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Swift [02:37:19] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Swift [02:37:29] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift [02:37:43] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:37:43] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [02:37:44] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [02:41:57] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:42:43] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [02:42:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [02:44:05] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift [02:44:13] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:44:13] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:44:27] RECOVERY - Hadoop NodeManager on an-worker1158 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:44:39] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift [02:44:39] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:44:43] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2014.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:44:57] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift [02:44:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:45:05] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift [02:45:05] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [02:45:07] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:45:17] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [02:46:09] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 5.755 second response time https://wikitech.wikimedia.org/wiki/Swift [02:46:11] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [02:46:11] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 6.683 second response time https://wikitech.wikimedia.org/wiki/Swift [02:46:25] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 9.473 second response time https://wikitech.wikimedia.org/wiki/Swift [02:47:03] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift [02:47:11] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.429 second response time https://wikitech.wikimedia.org/wiki/Swift [02:47:27] PROBLEM - Hadoop NodeManager on an-worker1158 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:47:37] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:47:53] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:48:13] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [02:48:14] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [02:48:39] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 8.290 second response time https://wikitech.wikimedia.org/wiki/Swift [02:48:49] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 6.321 second response time https://wikitech.wikimedia.org/wiki/Swift [02:49:03] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [02:49:07] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 3.140 second response time https://wikitech.wikimedia.org/wiki/Swift [02:49:17] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift [02:49:42] FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:49:49] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [02:50:11] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [02:50:13] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 8.521 second response time https://wikitech.wikimedia.org/wiki/Swift [02:50:21] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift [02:51:05] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.284 second response time https://wikitech.wikimedia.org/wiki/Swift [02:51:21] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift [02:51:41] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.535 second response time https://wikitech.wikimedia.org/wiki/Swift [02:52:05] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift [02:52:17] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.486 second response time https://wikitech.wikimedia.org/wiki/Swift [02:52:23] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 4.034 second response time https://wikitech.wikimedia.org/wiki/Swift [02:53:03] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.719 second response time https://wikitech.wikimedia.org/wiki/Swift [02:53:05] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift [02:53:05] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift [02:53:23] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 4.224 second response time https://wikitech.wikimedia.org/wiki/Swift [02:54:03] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Swift [02:54:05] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.247 second response time https://wikitech.wikimedia.org/wiki/Swift [02:54:05] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.558 second response time https://wikitech.wikimedia.org/wiki/Swift [02:54:11] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 6.851 second response time https://wikitech.wikimedia.org/wiki/Swift [02:54:42] RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:54:43] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:54:49] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.268 second response time https://wikitech.wikimedia.org/wiki/Swift [02:54:53] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:55:03] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.201 second response time https://wikitech.wikimedia.org/wiki/Swift [02:55:05] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:55:15] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Swift [02:55:43] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [02:56:29] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:57:05] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift [02:57:11] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift [02:57:11] !log eevans@cumin1002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw [02:57:25] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 4.559 second response time https://wikitech.wikimedia.org/wiki/Swift [02:57:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [02:58:05] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [02:58:13] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [02:58:14] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [02:58:25] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:58:39] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:58:45] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [02:59:05] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:59:15] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Swift [02:59:43] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:59:43] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [02:59:49] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [02:59:57] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:00:03] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift [03:00:03] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Swift [03:00:29] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift [03:00:33] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift [03:00:59] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:01:13] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift [03:01:47] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:01:55] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [03:02:09] !log eevans@cumin1002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw [03:03:03] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [03:03:35] ok, that doesn't look promising [03:04:15] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift [03:04:53] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [03:05:03] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift [03:05:31] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [03:05:33] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [03:05:47] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 8.883 second response time https://wikitech.wikimedia.org/wiki/Swift [03:05:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:06:05] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:06:09] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 6.839 second response time https://wikitech.wikimedia.org/wiki/Swift [03:06:21] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [03:06:35] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift [03:06:35] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 5.745 second response time https://wikitech.wikimedia.org/wiki/Swift [03:06:43] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:07:21] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.355 second response time https://wikitech.wikimedia.org/wiki/Swift [03:08:13] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.181 second response time https://wikitech.wikimedia.org/wiki/Swift [03:08:17] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift [03:08:39] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [03:08:53] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [03:09:03] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [03:09:05] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [03:09:31] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift [03:09:43] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2016.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:09:57] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.065 second response time https://wikitech.wikimedia.org/wiki/Swift [03:10:03] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.508 second response time https://wikitech.wikimedia.org/wiki/Swift [03:10:05] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.855 second response time https://wikitech.wikimedia.org/wiki/Swift [03:10:12] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:10:17] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 5.069 second response time https://wikitech.wikimedia.org/wiki/Swift [03:10:23] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 6.999 second response time https://wikitech.wikimedia.org/wiki/Swift [03:10:37] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [03:10:41] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 7.908 second response time https://wikitech.wikimedia.org/wiki/Swift [03:10:51] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.328 second response time https://wikitech.wikimedia.org/wiki/Swift [03:10:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:11:03] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Swift [03:11:41] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 5.434 second response time https://wikitech.wikimedia.org/wiki/Swift [03:11:53] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.625 second response time https://wikitech.wikimedia.org/wiki/Swift [03:12:55] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift [03:13:03] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [03:13:03] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [03:13:11] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Swift [03:13:33] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.524 second response time https://wikitech.wikimedia.org/wiki/Swift [03:13:39] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.500 second response time https://wikitech.wikimedia.org/wiki/Swift [03:14:03] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Swift [03:14:53] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [03:14:55] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift [03:15:12] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:15:35] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.642 second response time https://wikitech.wikimedia.org/wiki/Swift [03:15:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:16:01] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:16:03] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Swift [03:16:15] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Swift [03:16:21] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift [03:16:31] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [03:16:37] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift [03:17:13] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.181 second response time https://wikitech.wikimedia.org/wiki/Swift [03:17:15] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.558 second response time https://wikitech.wikimedia.org/wiki/Swift [03:17:19] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.217 second response time https://wikitech.wikimedia.org/wiki/Swift [03:17:33] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.332 second response time https://wikitech.wikimedia.org/wiki/Swift [03:17:43] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:17:55] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.182 second response time https://wikitech.wikimedia.org/wiki/Swift [03:17:57] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 5.550 second response time https://wikitech.wikimedia.org/wiki/Swift [03:17:57] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Swift [03:18:03] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [03:18:25] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:18:47] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:18:55] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.331 second response time https://wikitech.wikimedia.org/wiki/Swift [03:19:03] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.705 second response time https://wikitech.wikimedia.org/wiki/Swift [03:19:11] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift [03:19:13] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:19:15] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Swift [03:20:15] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [03:20:21] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [03:20:31] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [03:20:43] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:20:43] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2016.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:21:55] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift [03:22:13] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Swift [03:22:41] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 5.197 second response time https://wikitech.wikimedia.org/wiki/Swift [03:22:51] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [03:23:03] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [03:23:13] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:23:15] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.078 second response time https://wikitech.wikimedia.org/wiki/Swift [03:23:29] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 9.101 second response time https://wikitech.wikimedia.org/wiki/Swift [03:24:01] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.151 second response time https://wikitech.wikimedia.org/wiki/Swift [03:24:19] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 8.868 second response time https://wikitech.wikimedia.org/wiki/Swift [03:24:37] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Swift [03:25:03] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift [03:25:12] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:25:37] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.568 second response time https://wikitech.wikimedia.org/wiki/Swift [03:25:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:26:01] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:26:17] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [03:26:21] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [03:26:53] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.539 second response time https://wikitech.wikimedia.org/wiki/Swift [03:27:13] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Swift [03:27:29] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Swift [03:27:53] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [03:27:55] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift [03:28:05] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:28:11] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.236 second response time https://wikitech.wikimedia.org/wiki/Swift [03:28:13] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:28:37] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift [03:28:55] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift [03:29:07] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.222 second response time https://wikitech.wikimedia.org/wiki/Swift [03:29:15] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [03:29:21] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.463 second response time https://wikitech.wikimedia.org/wiki/Swift [03:29:51] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.214 second response time https://wikitech.wikimedia.org/wiki/Swift [03:30:01] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:30:07] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 4.024 second response time https://wikitech.wikimedia.org/wiki/Swift [03:30:07] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 5.607 second response time https://wikitech.wikimedia.org/wiki/Swift [03:30:12] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:30:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:32:33] RECOVERY - Hadoop NodeManager on an-worker1160 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:32:37] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.190 second response time https://wikitech.wikimedia.org/wiki/Swift [03:32:39] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [03:32:43] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:33:03] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [03:33:35] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [03:33:53] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.519 second response time https://wikitech.wikimedia.org/wiki/Swift [03:34:15] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [03:34:41] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.818 second response time https://wikitech.wikimedia.org/wiki/Swift [03:34:43] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:35:12] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:35:31] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift [03:35:33] PROBLEM - Hadoop NodeManager on an-worker1160 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:35:41] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 6.946 second response time https://wikitech.wikimedia.org/wiki/Swift [03:35:43] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2016.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:35:57] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [03:36:01] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:36:05] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.589 second response time https://wikitech.wikimedia.org/wiki/Swift [03:36:55] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [03:36:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:37:13] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [03:37:35] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.737 second response time https://wikitech.wikimedia.org/wiki/Swift [03:37:39] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift [03:38:17] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [03:38:21] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:38:51] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [03:38:55] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.231 second response time https://wikitech.wikimedia.org/wiki/Swift [03:39:03] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [03:39:05] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:39:15] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift [03:39:17] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 6.004 second response time https://wikitech.wikimedia.org/wiki/Swift [03:39:29] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [03:39:43] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:39:55] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Swift [03:40:03] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift [03:40:45] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.971 second response time https://wikitech.wikimedia.org/wiki/Swift [03:42:31] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift [03:43:05] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2012.codfw.wmnet, ms-fe2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:43:53] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift [03:44:15] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [03:44:33] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.431 second response time https://wikitech.wikimedia.org/wiki/Swift [03:44:43] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2009.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2016.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:45:15] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.735 second response time https://wikitech.wikimedia.org/wiki/Swift [03:45:17] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift [03:45:33] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.180 second response time https://wikitech.wikimedia.org/wiki/Swift [03:45:53] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [03:45:57] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift [03:46:05] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [03:46:37] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift [03:46:53] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.267 second response time https://wikitech.wikimedia.org/wiki/Swift [03:47:03] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Swift [03:47:03] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift [03:47:17] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.248 second response time https://wikitech.wikimedia.org/wiki/Swift [03:47:21] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:47:33] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift [03:47:43] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.122 second response time https://wikitech.wikimedia.org/wiki/Swift [03:47:53] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.798 second response time https://wikitech.wikimedia.org/wiki/Swift [03:47:57] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.913 second response time https://wikitech.wikimedia.org/wiki/Swift [03:48:15] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift [03:48:21] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [03:48:39] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [03:48:43] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:49:05] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.893 second response time https://wikitech.wikimedia.org/wiki/Swift [03:49:23] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.612 second response time https://wikitech.wikimedia.org/wiki/Swift [03:50:37] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift [03:51:17] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.937 second response time https://wikitech.wikimedia.org/wiki/Swift [03:51:17] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift [03:51:39] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.392 second response time https://wikitech.wikimedia.org/wiki/Swift [03:51:43] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:51:43] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:51:53] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift [03:52:05] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:52:13] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:52:15] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift [03:52:37] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [03:52:39] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 5.687 second response time https://wikitech.wikimedia.org/wiki/Swift [03:52:43] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:52:43] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:52:55] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [03:53:03] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift [03:53:13] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.853 second response time https://wikitech.wikimedia.org/wiki/Swift [03:53:35] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Swift [03:53:57] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [03:54:03] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [03:54:07] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.855 second response time https://wikitech.wikimedia.org/wiki/Swift [03:54:33] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Swift [03:54:43] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [03:54:53] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift [03:54:53] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.631 second response time https://wikitech.wikimedia.org/wiki/Swift [03:54:55] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift [03:54:55] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift [03:55:03] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift [03:55:05] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2009.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2015.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:55:45] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [03:55:59] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 7.970 second response time https://wikitech.wikimedia.org/wiki/Swift [03:56:03] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [03:56:39] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 5.937 second response time https://wikitech.wikimedia.org/wiki/Swift [03:57:05] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [03:57:21] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [03:57:23] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [03:58:53] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift [03:58:53] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift [03:59:07] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.234 second response time https://wikitech.wikimedia.org/wiki/Swift [03:59:17] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.164 second response time https://wikitech.wikimedia.org/wiki/Swift [03:59:19] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 5.762 second response time https://wikitech.wikimedia.org/wiki/Swift [04:00:05] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:00:33] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Swift [04:01:03] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift [04:01:05] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:01:13] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift [04:01:21] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.984 second response time https://wikitech.wikimedia.org/wiki/Swift [04:01:31] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift [04:01:43] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:01:51] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift [04:01:53] PROBLEM - Disk space on an-worker1154 is CRITICAL: DISK CRITICAL - free space: / 2077 MB (3% inode=95%): /tmp 2077 MB (3% inode=95%): /var/tmp 2077 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1154&var-datasource=eqiad+prometheus/ops [04:01:55] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift [04:01:55] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift [04:01:57] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 5.807 second response time https://wikitech.wikimedia.org/wiki/Swift [04:02:13] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:02:43] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:02:55] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Swift [04:03:13] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.969 second response time https://wikitech.wikimedia.org/wiki/Swift [04:03:33] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [04:03:35] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 5.879 second response time https://wikitech.wikimedia.org/wiki/Swift [04:03:43] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:04:07] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.787 second response time https://wikitech.wikimedia.org/wiki/Swift [04:05:03] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift [04:05:21] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift [04:05:57] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [04:06:07] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.451 second response time https://wikitech.wikimedia.org/wiki/Swift [04:06:13] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift [04:06:29] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 9.518 second response time https://wikitech.wikimedia.org/wiki/Swift [04:06:55] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift [04:06:57] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:07:13] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:07:13] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.877 second response time https://wikitech.wikimedia.org/wiki/Swift [04:07:23] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 6.524 second response time https://wikitech.wikimedia.org/wiki/Swift [04:07:37] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:08:03] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.169 second response time https://wikitech.wikimedia.org/wiki/Swift [04:08:05] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.120 second response time https://wikitech.wikimedia.org/wiki/Swift [04:08:51] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:09:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [04:10:45] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [04:10:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:11:47] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:12:37] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:14:37] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Swift [04:14:39] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:14:53] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift [04:15:31] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.241 second response time https://wikitech.wikimedia.org/wiki/Swift [04:15:57] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:16:01] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 9.806 second response time https://wikitech.wikimedia.org/wiki/Swift [04:17:05] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:17:37] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:17:43] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2010.codfw.wmnet, ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:17:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [04:18:05] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:18:43] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:22:23] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:23:03] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [04:23:13] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [04:23:51] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:23:53] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift [04:24:53] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift [04:25:03] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift [04:25:13] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.263 second response time https://wikitech.wikimedia.org/wiki/Swift [04:25:53] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.552 second response time https://wikitech.wikimedia.org/wiki/Swift [04:25:55] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 3.211 second response time https://wikitech.wikimedia.org/wiki/Swift [04:26:11] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [04:30:27] RECOVERY - Hadoop NodeManager on an-worker1161 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:32:33] RECOVERY - Hadoop NodeManager on an-worker1160 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:32:45] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:32:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [04:33:27] PROBLEM - Hadoop NodeManager on an-worker1161 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:33:43] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:33:52] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:35:33] PROBLEM - Hadoop NodeManager on an-worker1160 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:36:01] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [04:36:51] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift [04:37:37] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:42:07] !includents [04:42:10] !incident [04:42:22] * swfrench-wmf should just stop trying to computer [04:42:25] !incidents [04:42:26] 6297 (ACKED) [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [04:42:26] 6307 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [04:42:26] 6306 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [04:42:27] 6299 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [04:42:27] 6305 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [04:42:27] 6298 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [04:42:27] 6304 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [04:42:28] 6303 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [04:42:28] 6302 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [04:42:29] 6301 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [04:42:29] 6300 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [04:42:30] 6294 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [04:42:30] 6296 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [04:42:31] 6295 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [04:42:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [04:42:58] that was the last one [04:43:13] <_joe_> urandom: don't jinx it [04:52:37] FIRING: SystemdUnitFailed: wmf_auto_restart_sfacctd.service on netflow7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:57:37] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [05:00:27] RECOVERY - Hadoop NodeManager on an-worker1161 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:03:27] PROBLEM - Hadoop NodeManager on an-worker1161 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:06:25] 10SRE-swift-storage, 06Commons: Commons: fault in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T396186#10889883 (10Pppery) [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:07:37] FIRING: [2x] GnmiTargetDown: asw1-b3-magru is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [05:08:15] RECOVERY - Hadoop NodeManager on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:11:15] PROBLEM - Hadoop NodeManager on an-worker1162 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:32:33] RECOVERY - Hadoop NodeManager on an-worker1160 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:35:33] PROBLEM - Hadoop NodeManager on an-worker1160 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250606T0600) [06:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:13:07] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device asw1-b3-magru [06:13:08] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru [06:14:54] !log stevemunene@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 [06:15:06] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10889915 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a3b2e616-b8d3-4368-a852-410542d355ee) set b... [06:15:40] !log stevemunene@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 [06:15:53] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10889916 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0743c7f4-5226-4a34-8041-3f588f10811c) set b... [06:27:24] (03PS1) 10Fabfur: hiera: x-provenance header on all DCs [puppet] - 10https://gerrit.wikimedia.org/r/1154157 (https://phabricator.wikimedia.org/T392217) [06:28:02] (03CR) 10Fabfur: [C:04-1] "Merge today if needed" [puppet] - 10https://gerrit.wikimedia.org/r/1154157 (https://phabricator.wikimedia.org/T392217) (owner: 10Fabfur) [06:28:26] (03CR) 10Fabfur: [C:04-1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154157 (https://phabricator.wikimedia.org/T392217) (owner: 10Fabfur) [06:48:15] RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 1 AdminDown: 4 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:49:14] (03PS1) 10Muehlenhoff: Remove netflow7001 [homer/public] - 10https://gerrit.wikimedia.org/r/1154161 (https://phabricator.wikimedia.org/T394263) [06:52:33] (03CR) 10Ayounsi: [C:03+1] Remove netflow7001 [homer/public] - 10https://gerrit.wikimedia.org/r/1154161 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [06:55:21] (03CR) 10Muehlenhoff: [C:03+2] Remove netflow7001 [homer/public] - 10https://gerrit.wikimedia.org/r/1154161 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [06:57:37] RESOLVED: SystemdUnitFailed: wmf_auto_restart_sfacctd.service on netflow7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250606T0700) [07:01:54] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T396190 (10Rosalie_WMDE) 03NEW [07:02:26] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Rosalie_WMDE - https://phabricator.wikimedia.org/T396190#10889993 (10Rosalie_WMDE) [07:02:38] (03PS1) 10Majavah: team-wmcs: Adapt HAProxy alerts for x3 on the replicas [alerts] - 10https://gerrit.wikimedia.org/r/1154163 (https://phabricator.wikimedia.org/T390954) [07:08:52] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for cmelo - https://phabricator.wikimedia.org/T395966#10889997 (10SLyngshede-WMF) [07:13:39] PROBLEM - MariaDB disk space on db2151 is CRITICAL: DISK CRITICAL - free space: / 2104MiB (5% inode=97%): /tmp 2104MiB (5% inode=97%): /var/tmp 2104MiB (5% inode=97%): https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:17:45] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10890001 (10SLyngshede-WMF) @KFrancis I believe this requires a signed NDA. @WMDECyn we also n... [07:22:59] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:25:20] (03PS1) 10Slyngshede: data.yaml: Add user deerbee to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1154166 (https://phabricator.wikimedia.org/T396129) [07:26:07] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Grant Access to analytics-privatedata-user for "Dena WMDE" - https://phabricator.wikimedia.org/T396129#10890004 (10SLyngshede-WMF) p:05Triage→03Medium a:03SLyngshede-WMF [07:26:13] (03CR) 10CI reject: [V:04-1] data.yaml: Add user deerbee to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1154166 (https://phabricator.wikimedia.org/T396129) (owner: 10Slyngshede) [07:26:18] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Rosalie_WMDE - https://phabricator.wikimedia.org/T396190#10890006 (10SLyngshede-WMF) p:05Triage→03Medium a:03SLyngshede-WMF [07:27:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:29:09] 10SRE-swift-storage, 06Commons: Commons: fault in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T396186#10890009 (10MatthewVernon) This should be resolved now. [07:33:14] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Rosalie_WMDE - https://phabricator.wikimedia.org/T396190#10890010 (10SLyngshede-WMF) User already have NDA and WMDE LDAP groups. [07:33:17] (03PS1) 10Slyngshede: data.yaml: Add user rosalie-wmde to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1154167 (https://phabricator.wikimedia.org/T396190) [07:36:53] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3606 MB (3% inode=98%): /tmp 3606 MB (3% inode=98%): /var/tmp 3606 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [07:37:06] (03PS2) 10Slyngshede: data.yaml: Add user deerbee to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1154166 (https://phabricator.wikimedia.org/T396129) [07:38:12] (03CR) 10Muehlenhoff: [C:03+1] "Looks good (once manager approval happened)" [puppet] - 10https://gerrit.wikimedia.org/r/1154167 (https://phabricator.wikimedia.org/T396190) (owner: 10Slyngshede) [07:39:22] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1154166 (https://phabricator.wikimedia.org/T396129) (owner: 10Slyngshede) [07:40:35] PROBLEM - Disk space on db2151 is CRITICAL: DISK CRITICAL - free space: / 1210MiB (3% inode=97%): /tmp 1210MiB (3% inode=97%): /var/tmp 1210MiB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=db2151&var-datasource=codfw+prometheus/ops [07:41:15] (03CR) 10Slyngshede: [C:03+2] data.yaml: Add user deerbee to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1154166 (https://phabricator.wikimedia.org/T396129) (owner: 10Slyngshede) [07:42:37] FIRING: ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:42:42] (03CR) 10Muehlenhoff: [C:03+2] "Merged; iptables -L remained the same" [puppet] - 10https://gerrit.wikimedia.org/r/1153970 (owner: 10Muehlenhoff) [07:43:52] RESOLVED: ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:44:19] RECOVERY - Hadoop NodeManager on an-worker1157 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:44:47] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Grant Access to analytics-privatedata-user for Rosalie_WMDE - https://phabricator.wikimedia.org/T396190#10890032 (10SLyngshede-WMF) @WMDE-leszek or @WMDECyn we just need an approval from a WMDE manager. [07:52:18] !log ryankemper@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: T394543 [07:52:21] T394543: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543 [07:52:34] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Grant Access to analytics-privatedata-user for Rosalie_WMDE - https://phabricator.wikimedia.org/T396190#10890040 (10WMDE-leszek) Thanks @SLyngshede-WMF - approved on WMDE's behalf. [07:52:37] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [07:52:40] (03PS2) 10Slyngshede: data.yaml: Add user rosalie-wmde to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1154167 (https://phabricator.wikimedia.org/T396190) [07:53:32] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10890042 (10WMDE-leszek) Hey @SLyngshede-WMF - I have approved for WMDE above T395917#10880432 [07:53:46] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Grant Access to analytics-privatedata-user for "Dena WMDE" - https://phabricator.wikimedia.org/T396129#10890043 (10SLyngshede-WMF) 05Open→03Resolved [07:54:51] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10890044 (10SLyngshede-WMF) @WMDE-leszek Sorry, I completely missed that. Thank you. [07:59:47] (03PS2) 10Muehlenhoff: New structure for sshd_config starting with trixie (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) [08:01:55] (03CR) 10CI reject: [V:04-1] New structure for sshd_config starting with trixie (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [08:02:19] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1154167 (https://phabricator.wikimedia.org/T396190) (owner: 10Slyngshede) [08:02:33] RECOVERY - Hadoop NodeManager on an-worker1160 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:03:06] (03CR) 10Slyngshede: [C:03+2] data.yaml: Add user rosalie-wmde to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1154167 (https://phabricator.wikimedia.org/T396190) (owner: 10Slyngshede) [08:04:04] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Grant Access to analytics-privatedata-user for Rosalie_WMDE - https://phabricator.wikimedia.org/T396190#10890051 (10SLyngshede-WMF) 05Open→03Resolved [08:05:26] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for kgraessle - https://phabricator.wikimedia.org/T395370#10890052 (10SLyngshede-WMF) [08:07:22] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for kgraessle - https://phabricator.wikimedia.org/T395370#10890053 (10SLyngshede-WMF) p:05Triage→03Medium a:05DMburugu→03SLyngshede-WMF [08:08:41] (03PS1) 10Slyngshede: data.yaml: Add user kgraessle to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1154229 (https://phabricator.wikimedia.org/T395370) [08:10:38] !log volans@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet [08:10:45] (03CR) 10Muehlenhoff: [C:03+1] "Looks good (once approval by Tyler happened)" [puppet] - 10https://gerrit.wikimedia.org/r/1154229 (https://phabricator.wikimedia.org/T395370) (owner: 10Slyngshede) [08:11:09] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, SSH and Kerberos for GGoncalves-WMF - https://phabricator.wikimedia.org/T395428#10890059 (10SLyngshede-WMF) [08:11:20] !log volans@cumin1003 START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet [08:17:37] RESOLVED: [2x] GnmiTargetDown: asw1-b3-magru is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [08:18:30] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, SSH and Kerberos for GGoncalves-WMF - https://phabricator.wikimedia.org/T395428#10890066 (10SLyngshede-WMF) [08:19:39] (03PS3) 10Volans: sre.hardware.upgrade-firmware: add support for SSD [cookbooks] - 10https://gerrit.wikimedia.org/r/1150728 (https://phabricator.wikimedia.org/T394543) [08:22:06] !log volans@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet [08:22:07] !log volans@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet [08:23:10] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 63199 [08:23:56] !log volans@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet [08:24:11] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 [08:24:15] !log volans@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet [08:25:09] (03CR) 10Volans: [C:03+2] "Tested on `cirrussearch2113`, merging to unblock upgrade, I'll be happy to apply changes post-merge." [cookbooks] - 10https://gerrit.wikimedia.org/r/1150728 (https://phabricator.wikimedia.org/T394543) (owner: 10Volans) [08:25:38] (03PS1) 10Slyngshede: data.yaml: Add user guilherme to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1154235 (https://phabricator.wikimedia.org/T395428) [08:25:39] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - ml-ctrl_6443: Servers ml-serve-ctrl1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:25:58] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10890078 (10Volans) Thanks @RKemper for the depool, I've performed the final run with the current PS in gerrit with test-cookbook for `ci... [08:25:58] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 199524 [08:26:39] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:27:08] (03CR) 10Slyngshede: "Tyler created the task, I took that is implicit approval :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1154229 (https://phabricator.wikimedia.org/T395370) (owner: 10Slyngshede) [08:28:22] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 [08:28:23] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users, SSH and Kerberos for GGoncalves-WMF - https://phabricator.wikimedia.org/T395428#10890082 (10SLyngshede-WMF) SSH key verified via Slack. [08:29:31] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for santhosh - https://phabricator.wikimedia.org/T394740#10890085 (10SLyngshede-WMF) p:05Triage→03Medium @Arrbee Ping [08:30:05] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 13150 [08:30:27] RECOVERY - Hadoop NodeManager on an-worker1161 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:30:41] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 [08:31:13] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: add support for SSD [cookbooks] - 10https://gerrit.wikimedia.org/r/1150728 (https://phabricator.wikimedia.org/T394543) (owner: 10Volans) [08:32:23] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 46562 [08:32:24] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 [08:33:31] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 394065 [08:33:52] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 [08:34:43] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 46997 [08:35:08] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 [08:36:50] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 398044 [08:37:05] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 [08:42:42] (03PS1) 10Federico Ceratto: mysql: Add PhabricatorTask utility [cookbooks] - 10https://gerrit.wikimedia.org/r/1154240 (https://phabricator.wikimedia.org/T395427) [08:42:42] (03CR) 10Federico Ceratto: "A little utility function that adds graceful failure modes for Phabricator tasks updates and simplifies the cookbook logic." [cookbooks] - 10https://gerrit.wikimedia.org/r/1154240 (https://phabricator.wikimedia.org/T395427) (owner: 10Federico Ceratto) [08:44:25] 10SRE-swift-storage, 06Commons: Commons: fault in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T396186#10890113 (10TheDJ) >>! In T396186#10890009, @MatthewVernon wrote: > This should be resolved now. Because you fixed it, or due to auto reconciliation ? [08:46:02] (03CR) 10CI reject: [V:04-1] mysql: Add PhabricatorTask utility [cookbooks] - 10https://gerrit.wikimedia.org/r/1154240 (https://phabricator.wikimedia.org/T395427) (owner: 10Federico Ceratto) [08:46:25] (03CR) 10Muehlenhoff: [C:03+1] "Fair enough :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1154229 (https://phabricator.wikimedia.org/T395370) (owner: 10Slyngshede) [08:47:27] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1154235 (https://phabricator.wikimedia.org/T395428) (owner: 10Slyngshede) [08:56:38] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10890120 (10Volans) Forgot to mention, this is what I used to upgrade just the SSD firmware: `cookbook sre.hardware.upgrade-firmware -c... [08:57:25] (03PS2) 10Federico Ceratto: mysql: Add PhabricatorTask utility [cookbooks] - 10https://gerrit.wikimedia.org/r/1154240 (https://phabricator.wikimedia.org/T395427) [08:57:37] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [09:00:22] (03CR) 10Slyngshede: [C:03+2] data.yaml: Add user guilherme to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1154235 (https://phabricator.wikimedia.org/T395428) (owner: 10Slyngshede) [09:03:00] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users, SSH and Kerberos for GGoncalves-WMF - https://phabricator.wikimedia.org/T395428#10890144 (10SLyngshede-WMF) 05In progress→03Resolved [09:03:11] (03CR) 10Slyngshede: [C:03+2] data.yaml: Add user kgraessle to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1154229 (https://phabricator.wikimedia.org/T395370) (owner: 10Slyngshede) [09:03:56] (03CR) 10CI reject: [V:04-1] mysql: Add PhabricatorTask utility [cookbooks] - 10https://gerrit.wikimedia.org/r/1154240 (https://phabricator.wikimedia.org/T395427) (owner: 10Federico Ceratto) [09:04:24] 10SRE-swift-storage, 06Commons: Commons: fault in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T396186#10890148 (10MatthewVernon) My colleagues in SRE applied some filtering to problematic traffic. [09:04:56] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for kgraessle - https://phabricator.wikimedia.org/T395370#10890149 (10SLyngshede-WMF) 05In progress→03Resolved [09:04:56] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:05:27] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:06:02] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for cmelo - https://phabricator.wikimedia.org/T395966#10890151 (10SLyngshede-WMF) a:05SLyngshede-WMF→03cmelo [09:09:48] 10SRE-swift-storage, 06Commons: Commons: fault in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T396186#10890177 (10TheDJ) >>! In T396186#10890148, @MatthewVernon wrote: > My colleagues in SRE applied some filtering to problematic traffic. My goodness. iI's a shame that simply keepin... [09:14:27] RECOVERY - Hadoop NodeManager on an-worker1158 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:16:53] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3438 MB (3% inode=98%): /tmp 3438 MB (3% inode=98%): /var/tmp 3438 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [09:25:03] (03PS3) 10Muehlenhoff: New structure for sshd_config starting with trixie (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) [09:28:05] (03PS3) 10Federico Ceratto: mysql: Add PhabricatorTask utility [cookbooks] - 10https://gerrit.wikimedia.org/r/1154240 (https://phabricator.wikimedia.org/T395427) [09:29:29] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10890244 (10SLyngshede-WMF) 05Open→03Resolved User can reopen if access if not working. [09:34:11] (03CR) 10FNegri: [C:03+1] "good catch! I wonder if we should just go with [a-z] instead of [sx] to capture possible future sections? I'll let you decide what you thi" [alerts] - 10https://gerrit.wikimedia.org/r/1154163 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [09:39:33] !incidents [09:39:33] You're not allowed to perform this action. [09:40:05] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up [09:40:23] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up [09:48:14] (03CR) 10Majavah: [C:03+2] team-wmcs: Adapt HAProxy alerts for x3 on the replicas [alerts] - 10https://gerrit.wikimedia.org/r/1154163 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [09:49:29] (03Merged) 10jenkins-bot: team-wmcs: Adapt HAProxy alerts for x3 on the replicas [alerts] - 10https://gerrit.wikimedia.org/r/1154163 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [10:00:41] (03PS1) 10Tiziano Fogli: prometheus::pop: deploy instances according to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1154255 (https://phabricator.wikimedia.org/T395130) [10:01:15] (03CR) 10CI reject: [V:04-1] prometheus::pop: deploy instances according to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1154255 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli) [10:02:29] (03PS2) 10Tiziano Fogli: prometheus::pop: deploy instances according to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1154255 (https://phabricator.wikimedia.org/T395130) [10:03:03] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154255 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli) [10:05:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:07:31] (03PS3) 10Tiziano Fogli: prometheus::pop: temporarily exclude ops instance from prometheus7002 [puppet] - 10https://gerrit.wikimedia.org/r/1154255 (https://phabricator.wikimedia.org/T395130) [10:08:20] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154255 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli) [10:09:04] (03CR) 10Filippo Giunchedi: [C:03+1] prometheus::pop: temporarily exclude ops instance from prometheus7002 [puppet] - 10https://gerrit.wikimedia.org/r/1154255 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli) [10:10:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:17:06] (03PS1) 10Slyngshede: Docker: Add missing bs4 package [software/bitu] - 10https://gerrit.wikimedia.org/r/1154260 (https://phabricator.wikimedia.org/T396103) [10:18:10] (03PS1) 10Muehlenhoff: ssh: Stop managing /run/sshd with Trixie and later [puppet] - 10https://gerrit.wikimedia.org/r/1154261 [10:18:25] (03CR) 10Slyngshede: [C:03+2] Docker: Add missing bs4 package [software/bitu] - 10https://gerrit.wikimedia.org/r/1154260 (https://phabricator.wikimedia.org/T396103) (owner: 10Slyngshede) [10:18:27] (03CR) 10Slyngshede: [V:03+2 C:03+2] Docker: Add missing bs4 package [software/bitu] - 10https://gerrit.wikimedia.org/r/1154260 (https://phabricator.wikimedia.org/T396103) (owner: 10Slyngshede) [10:18:43] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be2066 - https://phabricator.wikimedia.org/T395990#10890398 (10MatthewVernon) @Jhancock.wm I've had a look at the web-iDRAC, and the serial number I found above (9120A025F1QF) does correspond to the device in slot 14, which the web... [10:21:32] (03Merged) 10jenkins-bot: Docker: Add missing bs4 package [software/bitu] - 10https://gerrit.wikimedia.org/r/1154260 (https://phabricator.wikimedia.org/T396103) (owner: 10Slyngshede) [10:21:37] (03PS1) 10Slyngshede: P:idm Enable API [puppet] - 10https://gerrit.wikimedia.org/r/1154262 (https://phabricator.wikimedia.org/T364605) [10:21:58] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154261 (owner: 10Muehlenhoff) [10:22:38] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5786/co" [puppet] - 10https://gerrit.wikimedia.org/r/1154262 (https://phabricator.wikimedia.org/T364605) (owner: 10Slyngshede) [10:29:02] (03CR) 10Tiziano Fogli: [C:03+2] prometheus7002: assign replica_label [puppet] - 10https://gerrit.wikimedia.org/r/1154046 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli) [10:29:47] (03PS4) 10Tiziano Fogli: prometheus::pop: temporarily exclude ops instance from prometheus7002 [puppet] - 10https://gerrit.wikimedia.org/r/1154255 (https://phabricator.wikimedia.org/T395130) [10:29:53] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154046 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli) [10:31:29] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154255 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli) [10:36:02] (03PS5) 10Btullis: Airflow: Add local settings to enable the xcom_sidecar functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154248 (https://phabricator.wikimedia.org/T388378) [10:38:52] Hey folks I am going to run a maintance job at deployment server for afwiki [10:39:06] `extensions/ORES/maintenance/PopulateDatabase.php` [10:44:42] (03CR) 10Filippo Giunchedi: [C:03+1] ssh: Stop managing /run/sshd with Trixie and later [puppet] - 10https://gerrit.wikimedia.org/r/1154261 (owner: 10Muehlenhoff) [10:45:53] (03CR) 10Tiziano Fogli: [C:03+2] prometheus::pop: temporarily exclude ops instance from prometheus7002 [puppet] - 10https://gerrit.wikimedia.org/r/1154255 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli) [10:50:37] 06SRE, 10SRE-swift-storage: Consider increasing swift workers on proxy nodes to 32 - https://phabricator.wikimedia.org/T396203 (10MatthewVernon) 03NEW [10:56:53] RECOVERY - Disk space on archiva1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [10:59:12] (03Abandoned) 10Muehlenhoff: Also add replica label for the new upcoming prometheus7002 node [puppet] - 10https://gerrit.wikimedia.org/r/1153126 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250606T0700) [11:00:04] jelto, arnoldokoth, and mutante: Time to do the GitLab version upgrades deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250606T1100). [11:04:44] (03CR) 10JMeybohm: [C:03+2] admin_ng: Split envoyfilters installation into a separate release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153979 (https://phabricator.wikimedia.org/T389080) (owner: 10JMeybohm) [11:04:46] (03CR) 10JMeybohm: [C:03+2] admin_ng: Fix dependencies/needs of helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153982 (https://phabricator.wikimedia.org/T389080) (owner: 10JMeybohm) [11:07:48] (03PS1) 10Vgutierrez: liberica: Don't bind VIPs to lo interface with katran [puppet] - 10https://gerrit.wikimedia.org/r/1154268 (https://phabricator.wikimedia.org/T395228) [11:11:19] (03Merged) 10jenkins-bot: admin_ng: Split envoyfilters installation into a separate release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153979 (https://phabricator.wikimedia.org/T389080) (owner: 10JMeybohm) [11:11:37] (03PS1) 10Hnowlan: rest-gateway: per-route statistics option, enable for lists and wikifeeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154269 [11:11:38] (03Merged) 10jenkins-bot: admin_ng: Fix dependencies/needs of helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153982 (https://phabricator.wikimedia.org/T389080) (owner: 10JMeybohm) [11:12:03] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154268 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [11:12:42] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:12:57] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:16:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:17:07] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:17:56] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:18:33] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:18:53] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:21:08] !log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [11:21:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:22:25] !log jayme@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [11:25:54] !log jayme@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [11:26:39] !log jayme@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [11:27:17] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10890602 (10Ladsgroup) [11:27:37] !log jayme@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [11:27:41] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [11:28:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [11:28:24] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:28:28] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:28:47] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:28:51] !log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [11:30:52] !log jayme@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [11:30:56] !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [11:31:36] !log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [11:31:41] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [11:32:05] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:32:09] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:32:23] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:32:51] (03CR) 10Alexandros Kosiaris: [C:03+1] rest-gateway: per-route statistics option, enable for lists and wikifeeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154269 (owner: 10Hnowlan) [11:33:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [11:36:19] (03PS3) 10Hnowlan: imagemagick: ignore all py3exiv2 exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101154 (https://phabricator.wikimedia.org/T381594) (owner: 10AntiCompositeNumber) [11:37:09] (03PS4) 10Hnowlan: imagemagick: ignore all py3exiv2 exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101154 (https://phabricator.wikimedia.org/T381594) (owner: 10AntiCompositeNumber) [11:37:51] (03CR) 10Hnowlan: [C:03+2] rest-gateway: per-route statistics option, enable for lists and wikifeeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154269 (owner: 10Hnowlan) [11:38:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [11:39:38] (03Merged) 10jenkins-bot: rest-gateway: per-route statistics option, enable for lists and wikifeeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154269 (owner: 10Hnowlan) [11:42:06] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:42:19] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:44:53] (03CR) 10CI reject: [V:04-1] imagemagick: ignore all py3exiv2 exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101154 (https://phabricator.wikimedia.org/T381594) (owner: 10AntiCompositeNumber) [11:48:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [11:49:20] (03PS1) 10Marostegui: ms3: Migrate to parsercache role [puppet] - 10https://gerrit.wikimedia.org/r/1154276 [11:51:01] (03CR) 10Marostegui: [C:03+2] ms3: Migrate to parsercache role [puppet] - 10https://gerrit.wikimedia.org/r/1154276 (owner: 10Marostegui) [11:51:51] RECOVERY - Hadoop NodeManager on an-worker1159 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:52:20] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications [11:52:37] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [11:58:41] RECOVERY - MariaDB disk space on db2151 is OK: DISK OK https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [12:00:35] RECOVERY - Disk space on db2151 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=db2151&var-datasource=codfw+prometheus/ops [12:03:35] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [12:03:43] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [12:04:57] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1154268 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [12:19:58] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet [12:20:01] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 [12:20:19] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 [12:22:46] (03PS1) 10Sergio Gimeno: [beta] GrowthExperiments: enable limiting add a link task via config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154282 (https://phabricator.wikimedia.org/T393923) [12:23:52] (03PS2) 10Sergio Gimeno: [beta] GrowthExperiments: enable limiting add a link task via config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154282 (https://phabricator.wikimedia.org/T393769) [12:24:40] 06SRE, 06Infrastructure-Foundations, 10netops: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026#10890719 (10fgiunchedi) a:05fgiunchedi→03None [12:25:45] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10890720 (10MoritzMuehlenhoff) [12:30:53] (03PS16) 10Cathal Mooney: BGP: Adjust switch IBGP templates to support evpn and unicast ibgp [homer/public] - 10https://gerrit.wikimedia.org/r/1152272 (https://phabricator.wikimedia.org/T394530) [12:31:00] (03CR) 10CI reject: [V:04-1] BGP: Adjust switch IBGP templates to support evpn and unicast ibgp [homer/public] - 10https://gerrit.wikimedia.org/r/1152272 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [12:31:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144484 (https://phabricator.wikimedia.org/T393872) (owner: 10SD0001) [12:46:17] (03CR) 10Vgutierrez: [C:03+2] liberica: Don't bind VIPs to lo interface with katran [puppet] - 10https://gerrit.wikimedia.org/r/1154268 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [12:50:19] (03PS1) 10Tiziano Fogli: prometheus::pop: enable rsyncd on magru [puppet] - 10https://gerrit.wikimedia.org/r/1154284 (https://phabricator.wikimedia.org/T395130) [12:57:37] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [12:58:24] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns4003*} and (A:dnsbox) [12:58:24] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org [12:58:55] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns5003*} and (A:dnsbox) [12:58:55] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org [12:59:51] fceratto@cumin1002 clone (PID 2148058) is awaiting input [13:04:10] FIRING: [4x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:04:31] ^ expected, dns5003 [13:05:18] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1154284 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli) [13:05:34] (03CR) 10Tiziano Fogli: [C:03+2] prometheus::pop: enable rsyncd on magru [puppet] - 10https://gerrit.wikimedia.org/r/1154284 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli) [13:06:20] (03CR) 10Muehlenhoff: [C:03+1] "Maybe add a tick box to https://phabricator.wikimedia.org/T395130 to remove it at the end of the migration, it's quite easy to miss" [puppet] - 10https://gerrit.wikimedia.org/r/1154284 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli) [13:06:38] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [13:09:10] RESOLVED: [4x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:09:46] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org [13:09:46] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns4003*} and (A:dnsbox) [13:11:12] (03CR) 10Ayounsi: ASW Templates: modify Jinja templates step 1 (try 2) (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1153615 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [13:12:10] cmooney@cumin1003 netbox (PID 450379) is awaiting input [13:13:01] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org [13:13:01] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns5003*} and (A:dnsbox) [13:13:40] 10SRE-SLO: Add a section to the SLO template that explains Pyrra's dashboards and alerts - https://phabricator.wikimedia.org/T395920#10890859 (10akosiaris) Thanks for this! I 've also landed a round of updates today in https://wikitech.wikimedia.org/w/index.php?title=SLO/Template_instructions/Dashboards_and_aler... [13:14:10] 06SRE, 06Traffic: Upgrade pdns-recursor to 5.x on all prod DNS hosts (all C:dnsrecursor and so possibly WMCS) - https://phabricator.wikimedia.org/T381608#10890860 (10ssingh) a:03CDobbins [13:14:39] (03PS1) 10Vgutierrez: Revert^4 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154286 [13:15:10] (03PS2) 10Vgutierrez: Revert^4 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154286 (https://phabricator.wikimedia.org/T395228) [13:15:16] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154286 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [13:16:43] (03CR) 10Ssingh: [C:03+1] Revert^4 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154286 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [13:18:34] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns3003*} and (A:dnsbox) [13:18:34] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org [13:18:47] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns6001*} and (A:dnsbox) [13:18:47] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org [13:21:40] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" [13:21:44] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" [13:21:44] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:22:49] PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:22:49] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:23:04] ^ expected [13:25:49] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:27:49] RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:29:31] (03CR) 10Ssingh: "Looks good. I have two questions in-line before we add others for review to see if it is something in the data itself (and our calculation" [dns] - 10https://gerrit.wikimedia.org/r/1153334 (owner: 10CDobbins) [13:30:56] (03PS4) 10Cathal Mooney: ASW Templates: modify Jinja templates step 1 (try 2) [homer/public] - 10https://gerrit.wikimedia.org/r/1153615 (https://phabricator.wikimedia.org/T394530) [13:31:04] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org [13:31:04] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns6001*} and (A:dnsbox) [13:31:27] (03CR) 10Vgutierrez: [C:03+2] Revert^4 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154286 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [13:32:04] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org [13:32:04] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns3003*} and (A:dnsbox) [13:32:45] (03CR) 10Cathal Mooney: "Thanks for the review, should be ok in latest patchset and no-op on switches (tested different types - https://phabricator.wikimedia.org/P" [homer/public] - 10https://gerrit.wikimedia.org/r/1153615 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [13:34:57] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns1006*} and (A:dnsbox) [13:34:57] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org [13:35:02] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns2006*} and (A:dnsbox) [13:35:02] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org [13:36:34] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10890941 (10MoritzMuehlenhoff) [13:36:43] PROBLEM - Host ntp-c.anycast.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [13:36:51] ^ hmm [13:36:52] ok [13:37:10] we have redundancy here, so nothing to worry,ntp-[ab] [13:37:24] should be back up soon [13:37:30] FIRING: LibericaStaleConfig: Liberica instance lvs1013 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=eqiad&var-instance=lvs1013 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [13:38:28] (03PS5) 10Cathal Mooney: ASW Templates: modify Jinja templates step 1 (try 2) [homer/public] - 10https://gerrit.wikimedia.org/r/1153615 (https://phabricator.wikimedia.org/T394530) [13:39:10] FIRING: BFDdown: BFD session down between cr1-codfw and 208.80.153.107 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:39:17] ^ expected [13:40:24] !log vgutierrez@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran [13:40:33] !log vgutierrez@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs1013.eqiad.wmnet} and A:liberica [13:40:38] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs1013.eqiad.wmnet} and A:liberica [13:41:36] (03PS1) 10Vgutierrez: Revert^4 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154294 [13:41:48] (03PS3) 10JMeybohm: Use Wikimedia DNS IPs as mock [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153975 (https://phabricator.wikimedia.org/T396107) [13:41:50] (03PS3) 10JMeybohm: calico: Add support to manage CNI installation by daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153976 (https://phabricator.wikimedia.org/T396107) [13:41:53] (03PS3) 10JMeybohm: coredns: Run coredns on an unprivileged port (5353) instead of 53 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153977 (https://phabricator.wikimedia.org/T396107) [13:41:55] (03PS3) 10JMeybohm: cfssl-issuer: Allow to provide a custom CA certificate store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153978 (https://phabricator.wikimedia.org/T396107) [13:42:14] (03PS2) 10Vgutierrez: Revert^4 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154294 (https://phabricator.wikimedia.org/T395228) [13:42:20] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154294 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [13:42:30] RESOLVED: LibericaStaleConfig: Liberica instance lvs1013 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=eqiad&var-instance=lvs1013 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [13:42:31] (03PS6) 10Cathal Mooney: ASW Templates: modify Jinja templates step 1 (try 2) [homer/public] - 10https://gerrit.wikimedia.org/r/1153615 (https://phabricator.wikimedia.org/T394530) [13:44:08] (03CR) 10Ayounsi: [C:03+1] "lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/1153615 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [13:44:10] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.107 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:46:25] (03PS1) 10Muehlenhoff: Add puppetserver2004 [dns] - 10https://gerrit.wikimedia.org/r/1154296 (https://phabricator.wikimedia.org/T381274) [13:46:44] (03CR) 10Ssingh: [C:03+1] Revert^4 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154294 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [13:47:30] (03CR) 10Cathal Mooney: [C:03+2] ASW Templates: modify Jinja templates step 1 (try 2) [homer/public] - 10https://gerrit.wikimedia.org/r/1153615 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [13:48:04] (03Merged) 10jenkins-bot: ASW Templates: modify Jinja templates step 1 (try 2) [homer/public] - 10https://gerrit.wikimedia.org/r/1153615 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [13:49:03] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org [13:49:03] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns1006*} and (A:dnsbox) [13:49:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.107 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:49:51] RECOVERY - Host ntp-c.anycast.wmnet is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [13:50:01] (03PS2) 10Muehlenhoff: Add puppetserver2004 [dns] - 10https://gerrit.wikimedia.org/r/1154296 (https://phabricator.wikimedia.org/T381274) [13:50:58] (03CR) 10Ssingh: [C:03+1] Add puppetserver2004 [dns] - 10https://gerrit.wikimedia.org/r/1154296 (https://phabricator.wikimedia.org/T381274) (owner: 10Muehlenhoff) [13:51:12] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org [13:51:13] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns2006*} and (A:dnsbox) [13:52:14] (03PS1) 10Hnowlan: (api|rest)-gateway: remove envoyproxy annotation, scrape all ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154298 [13:55:03] (03CR) 10JHathaway: [C:03+1] Add puppetserver2004 [dns] - 10https://gerrit.wikimedia.org/r/1154296 (https://phabricator.wikimedia.org/T381274) (owner: 10Muehlenhoff) [13:57:21] (03PS5) 10Hnowlan: imagemagick: ignore all py3exiv2 exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101154 (https://phabricator.wikimedia.org/T381594) (owner: 10AntiCompositeNumber) [13:58:56] (03CR) 10A smart kitten: "Judging by T388531#10887779, it might be okay to go ahead with this patch now?" [puppet] - 10https://gerrit.wikimedia.org/r/1150624 (https://phabricator.wikimedia.org/T388531) (owner: 10Hnowlan) [14:09:36] (03PS1) 10Ebernhardson: EventStream: Enable hive ingeestion for wcqs-external.sparql-query [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154300 [14:10:07] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns2004*} and (A:dnsbox) [14:10:07] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org [14:10:15] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns1005*} and (A:dnsbox) [14:10:15] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org [14:12:47] (03CR) 10Hnowlan: imagemagick: ignore all py3exiv2 exceptions (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101154 (https://phabricator.wikimedia.org/T381594) (owner: 10AntiCompositeNumber) [14:14:54] (03PS1) 10Alexandros Kosiaris: registry: Minor Puppet cleanups [puppet] - 10https://gerrit.wikimedia.org/r/1154301 (https://phabricator.wikimedia.org/T390251) [14:14:56] (03PS1) 10Alexandros Kosiaris: docker_registry_ha: Refactor to make it docker_registry [puppet] - 10https://gerrit.wikimedia.org/r/1154302 (https://phabricator.wikimedia.org/T390251) [14:15:10] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:15:38] (03CR) 10CI reject: [V:04-1] docker_registry_ha: Refactor to make it docker_registry [puppet] - 10https://gerrit.wikimedia.org/r/1154302 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [14:18:04] (03PS2) 10Hnowlan: (api|rest)-gateway: remove envoyproxy annotation, scrape all ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154298 [14:20:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:20:36] (03CR) 10Alexandros Kosiaris: [C:03+2] registry: Minor Puppet cleanups [puppet] - 10https://gerrit.wikimedia.org/r/1154301 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [14:22:57] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org [14:22:57] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns1005*} and (A:dnsbox) [14:23:36] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org [14:23:36] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns2004*} and (A:dnsbox) [14:23:44] !log sukhe@dns1004 START - running authdns-update [14:24:23] !log sukhe@dns1004 END - running authdns-update [14:25:05] (03CR) 10Vgutierrez: [C:03+2] Revert^4 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154294 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [14:26:42] (03PS1) 10Ayounsi: [WIP] gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303 [14:27:06] (03CR) 10CI reject: [V:04-1] [WIP] gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [14:28:20] (03PS2) 10Alexandros Kosiaris: docker_registry_ha: Refactor to make it docker_registry [puppet] - 10https://gerrit.wikimedia.org/r/1154302 (https://phabricator.wikimedia.org/T390251) [14:28:33] (03PS2) 10Ayounsi: [WIP] gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303 [14:29:01] (03CR) 10CI reject: [V:04-1] [WIP] gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [14:29:39] (03PS3) 10Ayounsi: [WIP] gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303 [14:30:02] (03CR) 10CI reject: [V:04-1] [WIP] gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [14:31:17] PROBLEM - Host ms-be2066 is DOWN: PING CRITICAL - Packet loss = 100% [14:31:40] (03CR) 10Jforrester: [C:03+1] "Let's try it out!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154128 (https://phabricator.wikimedia.org/T290759) (owner: 10Arlolra) [14:32:24] (03PS4) 10Ayounsi: [WIP] gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303 [14:34:42] (03Abandoned) 10Cathal Mooney: BGP: Adjust switch IBGP templates to support evpn and unicast ibgp [homer/public] - 10https://gerrit.wikimedia.org/r/1152272 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [14:37:35] 06SRE, 10Wikimedia-Mailing-lists: Create plwiki Oversight list - https://phabricator.wikimedia.org/T396083#10891198 (10Ladsgroup) 05Open→03Resolved I created it and added all current oversights as admins (looking up their emails in the database, just in case). https://lists.wikimedia.org/postorius/list... [14:42:56] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:53:10] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:53:58] (03CR) 10CDobbins: add rest of South America (except Falkland Islands) to geo-maps (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/1153334 (owner: 10CDobbins) [14:56:16] (03CR) 10CDobbins: add rest of South America (except Falkland Islands) to geo-maps (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1153334 (owner: 10CDobbins) [15:02:58] (03PS1) 10Vgutierrez: Revert^5 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154307 [15:03:50] (03PS2) 10Vgutierrez: Revert^5 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154307 (https://phabricator.wikimedia.org/T395228) [15:03:59] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154307 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [15:04:43] (03PS1) 10Tchanders: Document that IP reveal permissions can't just be reassigned [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154308 (https://phabricator.wikimedia.org/T396217) [15:05:40] (03CR) 10Vgutierrez: [C:03+2] Revert^5 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154307 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:24] (03CR) 10JMeybohm: [C:03+1] (api|rest)-gateway: remove envoyproxy annotation, scrape all ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154298 (owner: 10Hnowlan) [15:11:42] FIRING: [2x] JobUnavailable: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:52] ^^that's me [15:16:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:19:40] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet [15:19:41] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet [15:19:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154282 (https://phabricator.wikimedia.org/T393769) (owner: 10Sergio Gimeno) [15:20:11] (03PS1) 10Vgutierrez: Revert^5 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154311 [15:21:07] (03CR) 10Ssingh: [C:03+1] Revert^5 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154311 (owner: 10Vgutierrez) [15:21:21] (03CR) 10Vgutierrez: [C:03+2] Revert^5 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154311 (owner: 10Vgutierrez) [15:21:36] (03CR) 10Ssingh: "Adding cdanis for their input as well on the general methodology for their awareness and feedback if any. Otherwise my recommendation is t" [dns] - 10https://gerrit.wikimedia.org/r/1153334 (owner: 10CDobbins) [15:23:11] PROBLEM - Disk space on an-worker1110 is CRITICAL: DISK CRITICAL - free space: / 2117 MB (3% inode=95%): /tmp 2117 MB (3% inode=95%): /var/tmp 2117 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1110&var-datasource=eqiad+prometheus/ops [15:23:55] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs1013.eqiad.wmnet} and A:liberica [15:24:11] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs1013.eqiad.wmnet} and A:liberica [15:24:25] (03PS4) 10Ssingh: sre.cdn.roll-restart-ats: add cookbook for restarting ATS [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 [15:24:33] (03CR) 10Ssingh: "Thinking about this a bit more and based on some conversations here on this CR and on IRC: I am going to merge this one and we can come ba" [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh) [15:26:18] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker2324:9290 - https://phabricator.wikimedia.org/T396067#10891404 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated power cable on both ends. alert cleared [15:26:39] RECOVERY - Host ms-be2066 is UP: PING OK - Packet loss = 0%, RTA = 30.24 ms [15:26:42] (03CR) 10Ssingh: "For further clarity: one of the reasons _against_ merging the two is the different purpose they serve. And at least in the Traffic realm, " [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh) [15:27:49] (03CR) 10Aleksandar Mastilovic: "MR that introduced the `/config` directory has been merged into `airflow-dags`. Do we need someone else to approve this CR so we can merge" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153704 (https://phabricator.wikimedia.org/T383931) (owner: 10Aleksandar Mastilovic) [15:32:14] (03CR) 10Ssingh: [V:03+2 C:03+2] sre.cdn.roll-restart-ats: add cookbook for restarting ATS [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh) [15:33:00] (03PS2) 10BryanDavis: shellbox-syntaxhighlight: Bump to 2025-06-05-215815 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154132 (https://phabricator.wikimedia.org/T364249) [15:34:42] !log eevans@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet [15:36:22] (03CR) 10CDanis: "+1 to, for now, just skipping any changes to the countries with sample sizes <50" [dns] - 10https://gerrit.wikimedia.org/r/1153334 (owner: 10CDobbins) [15:36:37] (03CR) 10BryanDavis: "Going with Scott's suggestion in Phabricator to only bump syntaxhighlight myself and let him do the bigger work of catching the other depl" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154132 (https://phabricator.wikimedia.org/T364249) (owner: 10BryanDavis) [15:39:08] (03CR) 10Scott French: [C:03+1] "Thanks, Bryan!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154132 (https://phabricator.wikimedia.org/T364249) (owner: 10BryanDavis) [15:40:12] (03Merged) 10jenkins-bot: sre.cdn.roll-restart-ats: add cookbook for restarting ATS [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh) [15:40:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10891475 (10ssingh) >>! In T387145#10886841, @cmooney wrote: >> Check with @cmooney for changes required to hieradata/common/lvs/interfaces.yaml (to add lvs1016 there) and also... [15:41:12] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet [15:45:41] (03CR) 10SBassett: [C:03+1] alertmanager: adjust phab project to security-team rather than security tag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1150624 (https://phabricator.wikimedia.org/T388531) (owner: 10Hnowlan) [15:47:17] (03PS1) 10Federico Ceratto: team-data-persistence: Add predictive disk space alerts [alerts] - 10https://gerrit.wikimedia.org/r/1154314 [15:47:17] (03CR) 10Federico Ceratto: "An initial experiment with predictions - not enabled for paging yet" [alerts] - 10https://gerrit.wikimedia.org/r/1154314 (owner: 10Federico Ceratto) [15:49:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [15:51:29] (03PS11) 10CDanis: [WIP] gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi) [15:52:37] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [15:54:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [15:55:30] Hey all - I know it’s Friday, but I’d like to get a minor update deployed to a private security mitigation that should fix https://phabricator.wikimedia.org/T396111. Let me know if there are any objections. [15:57:06] 10ops-codfw, 06SRE, 06DC-Ops: mc-misc2001 won't power up - https://phabricator.wikimedia.org/T395526#10891553 (10Jhancock.wm) sent them log info about the server. they're still investigating cause. [16:03:28] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Supermicro: test if Intel card exhibits the same cold boot behavior - https://phabricator.wikimedia.org/T394847#10891582 (10Jhancock.wm) @jhathaway are you still testing the 1G link? i got an automated ticket for the port that it's connected to. [16:05:23] (03CR) 10SBassett: "After discussing with @emill-ctr@wikimedia.org, I think we'd prefer to keep this in place on beta for all but CU, which shouldn't matter a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153674 (https://phabricator.wikimedia.org/T396061) (owner: 10Lucas Werkmeister) [16:08:08] !log Deployed security update to fix T396111 [16:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:12] T396111: Wikimedia\NormalizedException\NormalizedException: Invalid username: {username} - https://phabricator.wikimedia.org/T396111 [16:11:19] (03PS1) 10Cathal Mooney: Switch BGP: Automate & unify IBGP configs on switches [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530) [16:11:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10891620 (10Andrew) Sorry @Jclark-ctr, I've made a bit of a mess of this. Ideally each of these hosts would have 2x25G connections, each connected to a cl... [16:16:57] (03PS2) 10Cathal Mooney: Switch BGP: Automate & unify IBGP configs on switches [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530) [16:17:11] (03CR) 10Cathal Mooney: Switch BGP: Automate & unify IBGP configs on switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [16:17:36] (03CR) 10Cathal Mooney: Switch BGP: Automate & unify IBGP configs on switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [16:18:35] (03CR) 10Jforrester: [C:03+1] "Generally looks good. One stronger wording suggestion (not sure if we really want to go down the IETF SHOUTY CAPITALS wording model)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154136 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [16:20:13] (03CR) 10Scott French: "Thanks for the improvements!" [puppet] - 10https://gerrit.wikimedia.org/r/1153999 (owner: 10Effie Mouzeli) [16:20:21] (03PS3) 10Cathal Mooney: Switch BGP: Automate & unify IBGP configs on switches [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530) [16:27:21] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Supermicro: test if Intel card exhibits the same cold boot behavior - https://phabricator.wikimedia.org/T394847#10891640 (10jhathaway) >>! In T394847#10891582, @Jhancock.wm wrote: > @jhathaway are you still testing the 1G link? i got an automated... [16:28:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10891644 (10Andrew) Looks like I'm getting ahead of things a bit. We definitely do need 2 connections per host, but it's unclear on if we're skipping to 25... [16:36:17] (03CR) 10CDobbins: add rest of South America (except Falkland Islands) to geo-maps (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1153334 (owner: 10CDobbins) [16:38:15] RECOVERY - Hadoop NodeManager on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:39:11] !incidents testing access/permissions [16:39:12] You're not allowed to perform this action. [16:42:19] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:43:11] PROBLEM - Disk space on an-worker1110 is CRITICAL: DISK CRITICAL - free space: / 2114 MB (3% inode=95%): /tmp 2114 MB (3% inode=95%): /var/tmp 2114 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1110&var-datasource=eqiad+prometheus/ops [16:44:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:50:18] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudcephosd10[48-51] service implementation - https://phabricator.wikimedia.org/T395910#10891695 (10Andrew) 05Open→03Stalled p:05Triage→03Medium [16:55:11] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db2244 [16:55:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 [16:55:41] ChrisDobbins901_: try now please [16:55:55] !incidents testing access/permissions [16:55:56] team testing not found [16:55:56] could not find the team [16:56:03] just !incidents [16:56:04] !incidents [16:56:05] 6297 (RESOLVED) [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [16:56:05] 6307 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [16:56:05] 6306 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [16:56:05] 6299 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [16:56:05] 6305 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [16:56:06] 6298 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [16:56:06] 6304 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [16:56:06] 6303 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [16:56:07] 6302 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [16:56:07] but yeah, it seems to work [16:56:07] 6301 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [16:56:08] 6300 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [16:56:08] great [16:56:08] 6294 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [16:56:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:57:21] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:57:37] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [16:57:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:00:14] !log forced agent run on O:alerting_host to reload vopsbot to add cdobbins [17:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:55] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: T383811 - bking@cumin2002 [17:06:58] T383811: Ensure Search Platform-owned Elasticsearch cookbooks can handle Opensearch - https://phabricator.wikimedia.org/T383811 [17:08:51] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: T383811 - bking@cumin2002 [17:10:27] PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9243/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9243): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:10:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:16:23] RECOVERY - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 20, number_of_data_nodes: 20, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending [17:16:23] 41, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 3962, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [17:17:37] FIRING: SystemdUnitFailed: push_cross_cluster_settings_9200.service on cirrussearch1100:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:19:19] PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 2946 threshold =0.2 breach: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1374, active_shards: 1374, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2946, delayed_unas [17:19:19] hards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 31.805555555555554 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:20:21] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: T383811 - bking@cumin2002 [17:21:10] T383811: Ensure Search Platform-owned Elasticsearch cookbooks can handle Opensearch - https://phabricator.wikimedia.org/T383811 [17:21:30] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr1-magru.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [17:22:05] !incidents [17:22:05] 6308 (UNACKED) Primary inbound port utilisation over 80% (paged) network noc (cr1-magru.wikimedia.org) [17:22:05] 6297 (RESOLVED) [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [17:22:05] 6307 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [17:22:06] 6306 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [17:22:06] 6299 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [17:22:06] 6305 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [17:22:06] 6298 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [17:22:06] 6304 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [17:22:07] 6303 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [17:22:07] 6302 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [17:22:08] 6301 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [17:22:08] 6300 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [17:22:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:22:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [17:22:51] !ack 6308 [17:22:51] 6308 (ACKED) Primary inbound port utilisation over 80% (paged) network noc (cr1-magru.wikimedia.org) [17:23:02] !incidents [17:23:02] 6308 (ACKED) Primary inbound port utilisation over 80% (paged) network noc (cr1-magru.wikimedia.org) [17:23:02] 6309 (UNACKED) Primary outbound port utilisation over 80% (paged) network noc (cr2-eqiad.wikimedia.org) [17:23:02] 6297 (RESOLVED) [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [17:23:02] 6307 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [17:23:03] 6306 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [17:23:03] 6299 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [17:23:03] 6305 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [17:23:03] 6298 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [17:23:04] 6304 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [17:23:04] 6303 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [17:23:05] 6302 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [17:23:05] 6301 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [17:23:13] !ack 6309 [17:23:19] already acked [17:23:24] race condition detected [17:23:41] I should remember to use sirenbot, sorry [17:27:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:29:09] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] [17:29:21] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] [17:29:49] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm [17:30:02] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db2244 - https://phabricator.wikimedia.org/T393195#10891900 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2244.codfw.wmnet with OS bookworm [17:33:52] RESOLVED: SystemdUnitFailed: push_cross_cluster_settings_9200.service on cirrussearch1100:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:36:31] FIRING: Traffic bill over quota: Alert for device cr2-magru.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [17:46:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage [17:48:20] (03PS3) 10CDobbins: varnish: Replace X-RB-NOREDIR with rb_noredir var [puppet] - 10https://gerrit.wikimedia.org/r/1154085 [17:49:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage [17:53:38] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:53:52] FIRING: [2x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:54:46] PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy [17:56:18] RECOVERY - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1374, active_shards: 4052, relocating_shards: 0, initializing_shards: 29, unassigned_shards: 239, delayed_unassigned_shards: 0, numbe [17:56:18] ding_tasks: 5, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 610, active_shards_percent_as_number: 93.7962962962963 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:56:31] RESOLVED: Traffic bill over quota: Alert for device cr2-magru.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [17:56:44] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [17:56:44] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [17:57:30] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:57:37] FIRING: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:58:44] RECOVERY - Squid on install1004 is OK: TCP OK - 7.142 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [18:01:38] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:01:46] PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy [18:02:34] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 562 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:02:34] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:03:12] PROBLEM - Disk space on an-worker1110 is CRITICAL: DISK CRITICAL - free space: / 2083 MB (3% inode=95%): /tmp 2083 MB (3% inode=95%): /var/tmp 2083 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1110&var-datasource=eqiad+prometheus/ops [18:06:36] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:06:36] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:06:53] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:07:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [18:07:34] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:07:34] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 562 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:09:30] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:09:58] jhancock@cumin2002 reimage (PID 2610089) is awaiting input [18:11:30] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr1-magru.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [18:12:38] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:14:28] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:16:07] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154302 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [18:17:38] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:18:02] ^ did someone look at this? [18:18:10] sukhe: I am [18:18:23] thanks, sorry if it got lost in the noise [18:18:33] barely got shell.. just did though [18:18:36] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:18:36] RECOVERY - Squid on install1004 is OK: TCP OK - 0.000 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [18:22:37] RESOLVED: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:22:58] mutante: no mgmt interface for this? [18:23:00] (03PS1) 10CDobbins: throwaway commit; will be reverted [puppet] - 10https://gerrit.wikimedia.org/r/1154336 [18:23:33] sukhe: no..but .. it's because its a VM [18:23:49] ah this is a VM indeed [18:24:01] nothing I can see in the dmesg output at least [18:24:12] it was very busy, and now it's not [18:24:19] no smoking gun yet [18:24:49] DHCP appears to be working. (saying that because of https://phabricator.wikimedia.org/T383069) [18:24:52] https://grafana.wikimedia.org/goto/ntcig4YNR?orgId=1 [18:25:34] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye [18:25:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10892210 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS b... [18:28:27] grafana was also having issues, that's a VM as well [18:28:57] I wonder where both ofthem are on the ganeti cluster [18:29:17] no not related [18:31:38] the simple explanation with the rise in allocstall and swap-outs is that it is running out of memory [18:32:24] ah [18:32:26] it's squid [18:32:33] yea, it's squid! [18:32:37] or ..it was [18:33:52] "url.full": "http://performance-testing-graphite.wmftest.org:8080/render", [18:34:00] other installservers look fine and no recent puppet changes [18:34:03] so it is just install1004 [18:34:33] grep performance-testing /var/log/squid/access.log | wc -l [18:34:33] 84868 [18:35:02] blames wmftest.org ? [18:42:23] (03Abandoned) 10CDobbins: throwaway commit; will be reverted [puppet] - 10https://gerrit.wikimedia.org/r/1154336 (owner: 10CDobbins) [18:48:42] FIRING: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:53:42] FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:01:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:01:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm [19:01:57] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db2244 - https://phabricator.wikimedia.org/T393195#10892311 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2244.codfw.wmnet with OS bookworm completed: - db2244 (**PASS**) - Remov... [19:07:26] PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9243/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9243): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [19:09:18] RECOVERY - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1374, active_shards: 4084, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 236, delayed_unassigned_shards: 236, numb [19:09:18] nding_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 94.53703703703704 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:11:22] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: T383811 - bking@cumin2002 [19:11:26] T383811: Ensure Search Platform-owned Elasticsearch cookbooks can handle Opensearch - https://phabricator.wikimedia.org/T383811 [19:14:00] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 311108888 and 17 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:17:00] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 7631064 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:19:55] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10892387 (10KFrancis) Happy to help. I'll need the user's full name and email address. If they... [19:29:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1013:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [19:31:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye [19:31:55] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10892432 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host thanos-be2006.codfw.wmnet with OS bu... [19:39:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1013:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [19:45:49] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye [19:45:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10892448 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS bulls... [19:52:37] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [20:03:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage [20:06:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage [20:15:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye [20:15:48] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10892478 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host thanos-be2007.codfw.wmnet with OS bu... [20:16:13] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [20:16:36] 06SRE, 10Observability-Alerting, 07SecTeam-Processed, 07Security: Update MediaWikiElevatedUnknownLogins alert recipients - https://phabricator.wikimedia.org/T395117#10892479 (10sbassett) 05Open→03Resolved p:05Triage→03Medium [20:28:45] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db2244 - https://phabricator.wikimedia.org/T393195#10892493 (10Jhancock.wm) 05Open→03Resolved [20:29:32] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db2244 - https://phabricator.wikimedia.org/T393195#10892496 (10Jhancock.wm) @Marostegui install of this one is complete [20:35:56] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [20:38:08] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be2066 - https://phabricator.wikimedia.org/T395990#10892513 (10Eevans) a:03MatthewVernon Ok, earlier today @Jhancock.wm swapped the failed drive for us, and for some reason this caused the machine to spontaneously reboot (hardware... [20:38:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage [20:40:13] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [20:41:55] 06SRE, 10SRE-swift-storage: Consider increasing swift workers on proxy nodes to 32 - https://phabricator.wikimedia.org/T396203#10892531 (10Eevans) I responded to that incident and was quite surprised that we seemed to be "saturated", with what seemed like so much headroom in all the usual dimensions. I'd be +... [20:42:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage [20:43:12] PROBLEM - Disk space on an-worker1110 is CRITICAL: DISK CRITICAL - free space: / 2068 MB (3% inode=95%): /tmp 2068 MB (3% inode=95%): /var/tmp 2068 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1110&var-datasource=eqiad+prometheus/ops [20:45:46] 10ops-codfw, 06SRE, 06SRE-OnFire, 10Cassandra, and 3 others: additional sessionstore expansion — codfw - https://phabricator.wikimedia.org/T395954#10892534 (10Eevans) >>! In T395954#10884228, @Jhancock.wm wrote: > i have 12 x 480GB drives readily available on site >>! In T395954#10887801, @Jhancock.wm w... [20:57:37] FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh [21:02:18] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom [21:05:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10892561 (10Dwisehaupt) Ok. I have done a chunk of testing: * When booting the host with no drive in the path, I can see both interfaces connect and a... [21:15:12] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [21:19:55] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [21:25:35] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [21:33:32] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [21:41:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [21:46:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [22:53:42] FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:12:28] (03PS1) 10Cwhite: logstash: add filter_on_template_v2 [puppet] - 10https://gerrit.wikimedia.org/r/1154348 (https://phabricator.wikimedia.org/T234565) [23:13:27] (03PS1) 10Cwhite: add dependencies to readme [software/ecs] - 10https://gerrit.wikimedia.org/r/1154349 [23:14:20] (03CR) 10CI reject: [V:04-1] logstash: add filter_on_template_v2 [puppet] - 10https://gerrit.wikimedia.org/r/1154348 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [23:38:48] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1154350 [23:38:48] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1154350 (owner: 10TrainBranchBot) [23:49:49] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1154350 (owner: 10TrainBranchBot) [23:52:37] FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity