[00:03:11] <icinga-wm>	 PROBLEM - Disk space on an-worker1131 is CRITICAL: DISK CRITICAL - free space: / 2102 MB (3% inode=95%): /tmp 2102 MB (3% inode=95%): /var/tmp 2102 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1131&var-datasource=eqiad+prometheus/ops
[00:09:18] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1154145
[00:09:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1154145 (owner: 10TrainBranchBot)
[00:30:06] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1154145 (owner: 10TrainBranchBot)
[00:45:19] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1157 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:48:19] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1157 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:52:37] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_sfacctd.service on netflow7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:57:37] <jinxer-wm>	 FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh
[01:01:53] <icinga-wm>	 PROBLEM - Disk space on an-worker1154 is CRITICAL: DISK CRITICAL - free space: / 2047 MB (3% inode=95%): /tmp 2047 MB (3% inode=95%): /var/tmp 2047 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1154&var-datasource=eqiad+prometheus/ops
[01:07:37] <jinxer-wm>	 FIRING: [2x] GnmiTargetDown: asw1-b3-magru is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[01:20:20] <jinxer-wm>	 FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[01:25:20] <jinxer-wm>	 RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[01:25:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10889742 (10Stevemunene)
[01:25:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10889743 (10Stevemunene)
[01:28:20] <jinxer-wm>	 FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[01:38:20] <jinxer-wm>	 RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[01:43:20] <jinxer-wm>	 FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[01:48:20] <jinxer-wm>	 FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[02:03:20] <jinxer-wm>	 RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[02:06:20] <jinxer-wm>	 FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[02:09:20] <jinxer-wm>	 FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[02:11:20] <jinxer-wm>	 FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[02:14:20] <jinxer-wm>	 RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[02:16:20] <jinxer-wm>	 RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[02:36:05] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:36:11] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:36:21] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:36:21] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:36:31] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:36:43] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:36:57] <jinxer-wm>	 FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:37:03] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:37:09] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:37:19] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:37:19] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:37:29] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:37:43] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:37:43] <jinxer-wm>	 FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[02:37:44] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[02:41:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:42:43] <jinxer-wm>	 RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[02:42:44] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[02:44:05] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:44:13] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:44:13] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:44:27] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1158 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:44:39] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:44:39] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:44:43] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2014.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:44:57] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:44:57] <jinxer-wm>	 FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:45:05] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:45:05] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:45:07] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:45:17] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:46:09] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 5.755 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:46:11] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:46:11] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 6.683 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:46:25] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 9.473 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:47:03] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:47:11] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.429 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:47:27] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1158 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:47:37] <jinxer-wm>	 FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:47:53] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:48:13] <jinxer-wm>	 FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[02:48:14] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[02:48:39] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 8.290 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:48:49] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 6.321 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:49:03] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:49:07] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 3.140 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:49:17] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:49:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:49:49] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:50:11] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:50:13] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 8.521 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:50:21] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:51:05] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.284 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:51:21] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:51:41] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.535 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:52:05] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:52:17] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.486 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:52:23] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 4.034 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:53:03] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.719 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:53:05] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:53:05] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:53:23] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 4.224 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:54:03] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:54:05] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.247 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:54:05] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.558 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:54:11] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 6.851 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:54:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:54:43] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:54:49] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.268 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:54:53] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:55:03] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.201 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:55:05] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:55:15] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:55:43] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:56:29] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:57:05] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:57:11] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:57:11] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw
[02:57:25] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 4.559 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:57:51] <jinxer-wm>	 FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[02:58:05] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:58:13] <jinxer-wm>	 RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[02:58:14] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[02:58:25] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:58:39] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:58:45] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:59:05] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:59:15] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:59:43] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:59:43] <jinxer-wm>	 FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[02:59:49] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[02:59:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:00:03] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:00:03] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:00:29] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:00:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:00:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[03:01:13] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:01:47] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[03:01:55] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:02:09] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw
[03:03:03] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:03:35] <urandom>	 ok, that doesn't look promising
[03:04:15] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:04:53] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:05:03] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:05:31] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:05:33] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:05:47] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 8.883 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:05:57] <jinxer-wm>	 FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:06:05] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[03:06:09] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 6.839 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:06:21] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:06:35] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:06:35] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 5.745 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:06:43] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:07:21] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.355 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:08:13] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.181 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:08:17] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:08:39] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:08:53] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:09:03] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:09:05] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:09:31] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:09:43] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2016.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:09:57] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.065 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:10:03] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.508 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:10:05] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.855 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:10:12] <jinxer-wm>	 RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:10:17] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 5.069 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:10:23] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 6.999 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:10:37] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:10:41] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 7.908 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:10:51] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.328 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:10:57] <jinxer-wm>	 FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:11:03] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:11:41] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 5.434 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:11:53] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.625 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:12:55] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:13:03] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:13:03] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:13:11] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:13:33] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.524 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:13:39] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.500 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:14:03] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:14:53] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:14:55] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:15:12] <jinxer-wm>	 RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:15:35] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.642 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:15:57] <jinxer-wm>	 FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:16:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[03:16:03] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:16:15] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:16:21] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:16:31] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:16:37] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:17:13] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.181 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:17:15] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.558 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:17:19] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.217 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:17:33] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.332 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:17:43] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:17:55] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.182 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:17:57] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 5.550 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:17:57] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:18:03] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:18:25] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[03:18:47] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[03:18:55] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.331 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:19:03] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.705 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:19:11] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:19:13] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[03:19:15] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:20:15] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:20:21] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:20:31] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:20:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[03:20:43] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2016.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:21:55] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:22:13] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:22:41] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 5.197 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:22:51] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:23:03] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:23:13] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[03:23:15] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.078 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:23:29] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 9.101 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:24:01] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.151 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:24:19] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 8.868 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:24:37] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:25:03] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:25:12] <jinxer-wm>	 RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:25:37] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.568 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:25:57] <jinxer-wm>	 FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:26:01] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[03:26:17] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:26:21] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:26:53] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.539 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:27:13] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:27:29] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:27:53] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:27:55] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:28:05] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[03:28:11] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.236 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:28:13] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[03:28:37] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:28:55] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:29:07] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.222 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:29:15] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:29:21] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.463 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:29:51] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.214 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:30:01] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[03:30:07] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 4.024 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:30:07] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 5.607 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:30:12] <jinxer-wm>	 RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:30:57] <jinxer-wm>	 FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:32:33] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1160 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:32:37] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.190 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:32:39] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:32:43] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:33:03] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:33:35] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:33:53] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.519 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:34:15] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:34:41] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.818 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:34:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[03:35:12] <jinxer-wm>	 RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:35:31] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:35:33] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1160 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:35:41] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 6.946 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:35:43] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2016.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:35:57] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:36:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[03:36:05] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.589 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:36:55] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:36:57] <jinxer-wm>	 FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:37:13] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:37:35] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.737 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:37:39] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:38:17] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:38:21] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[03:38:51] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:38:55] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.231 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:39:03] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:39:05] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:39:15] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:39:17] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 6.004 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:39:29] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:39:43] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:39:55] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:40:03] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:40:45] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.971 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:42:31] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:43:05] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2012.codfw.wmnet, ms-fe2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:43:53] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:44:15] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:44:33] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.431 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:44:43] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2009.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2016.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:45:15] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.735 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:45:17] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:45:33] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.180 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:45:53] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:45:57] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:46:05] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:46:37] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:46:53] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.267 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:47:03] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:47:03] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:47:17] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.248 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:47:21] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[03:47:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:47:43] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.122 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:47:53] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.798 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:47:57] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.913 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:48:15] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:48:21] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:48:39] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:48:43] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:49:05] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.893 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:49:23] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.612 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:50:37] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:51:17] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.937 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:51:17] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:51:39] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.392 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:51:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[03:51:43] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:51:53] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:52:05] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:52:13] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[03:52:15] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:52:37] <jinxer-wm>	 FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[03:52:39] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 5.687 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:52:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[03:52:43] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:52:55] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:53:03] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:53:13] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.853 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:53:35] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:53:57] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:54:03] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:54:07] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.855 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:54:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:54:43] <jinxer-wm>	 RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[03:54:53] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:54:53] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.631 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:54:55] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:54:55] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:55:03] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:55:05] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2009.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2015.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:55:45] <jinxer-wm>	 FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[03:55:59] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 7.970 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:56:03] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:56:39] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 5.937 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:57:05] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:57:21] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:57:23] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[03:58:53] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:58:53] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:59:07] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.234 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:59:17] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.164 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:59:19] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 5.762 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:00:05] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:00:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:01:03] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:01:05] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[04:01:13] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:01:21] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.984 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:01:31] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:01:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[04:01:51] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:01:53] <icinga-wm>	 PROBLEM - Disk space on an-worker1154 is CRITICAL: DISK CRITICAL - free space: / 2077 MB (3% inode=95%): /tmp 2077 MB (3% inode=95%): /var/tmp 2077 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1154&var-datasource=eqiad+prometheus/ops
[04:01:55] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:01:55] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:01:57] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 5.807 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:02:13] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[04:02:43] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:02:55] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:03:13] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.969 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:03:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:03:35] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 5.879 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:03:43] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:04:07] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.787 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:05:03] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:05:21] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:05:57] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:06:07] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.451 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:06:13] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:06:29] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 9.518 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:06:55] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:06:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:07:13] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[04:07:13] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.877 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:07:23] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 6.524 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:07:37] <jinxer-wm>	 RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:08:03] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.169 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:08:05] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.120 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:08:51] <jinxer-wm>	 FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:09:44] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[04:10:45] <jinxer-wm>	 RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[04:10:57] <jinxer-wm>	 FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:11:47] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[04:12:37] <jinxer-wm>	 RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:14:37] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:14:39] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[04:14:53] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:15:31] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.241 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:15:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:16:01] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 9.806 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:17:05] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:17:37] <jinxer-wm>	 FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:17:43] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2010.codfw.wmnet, ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:17:51] <jinxer-wm>	 FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[04:18:05] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:18:43] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:22:23] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[04:23:03] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:23:13] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:23:51] <jinxer-wm>	 RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:23:53] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:24:53] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:25:03] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:25:13] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.263 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:25:53] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.552 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:25:55] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 3.211 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:26:11] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:30:27] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1161 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:32:33] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1160 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:32:45] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[04:32:51] <jinxer-wm>	 FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[04:33:27] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1161 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:33:43] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[04:33:52] <jinxer-wm>	 FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:35:33] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1160 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:36:01] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[04:36:51] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:37:37] <jinxer-wm>	 RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:42:07] <swfrench-wmf>	 !includents
[04:42:10] <swfrench-wmf>	 !incident
[04:42:22] * swfrench-wmf should just stop trying to computer
[04:42:25] <swfrench-wmf>	 !incidents
[04:42:26] <sirenbot>	 6297 (ACKED)  [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet)
[04:42:26] <sirenbot>	 6307 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[04:42:26] <sirenbot>	 6306 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[04:42:27] <sirenbot>	 6299 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[04:42:27] <sirenbot>	 6305 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[04:42:27] <sirenbot>	 6298 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[04:42:27] <sirenbot>	 6304 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[04:42:28] <sirenbot>	 6303 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[04:42:28] <sirenbot>	 6302 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[04:42:29] <sirenbot>	 6301 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[04:42:29] <sirenbot>	 6300 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[04:42:30] <sirenbot>	 6294 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[04:42:30] <sirenbot>	 6296 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[04:42:31] <sirenbot>	 6295 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[04:42:51] <jinxer-wm>	 RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[04:42:58] <urandom>	 that was the last one
[04:43:13] <_joe_>	 urandom: don't jinx it
[04:52:37] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_sfacctd.service on netflow7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:57:37] <jinxer-wm>	 FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh
[05:00:27] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1161 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[05:03:27] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1161 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[05:06:25] <wikibugs>	 10SRE-swift-storage, 06Commons: Commons: fault in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T396186#10889883 (10Pppery)
[05:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:07:37] <jinxer-wm>	 FIRING: [2x] GnmiTargetDown: asw1-b3-magru is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[05:08:15] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[05:11:15] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1162 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[05:32:33] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1160 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[05:35:33] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1160 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250606T0600)
[06:01:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:13:07] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device asw1-b3-magru
[06:13:08] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru
[06:14:54] <logmsgbot>	 !log stevemunene@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1
[06:15:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10889915 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a3b2e616-b8d3-4368-a852-410542d355ee) set b...
[06:15:40] <logmsgbot>	 !log stevemunene@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1
[06:15:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10889916 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0743c7f4-5226-4a34-8041-3f588f10811c) set b...
[06:27:24] <wikibugs>	 (03PS1) 10Fabfur: hiera: x-provenance header on all DCs [puppet] - 10https://gerrit.wikimedia.org/r/1154157 (https://phabricator.wikimedia.org/T392217)
[06:28:02] <wikibugs>	 (03CR) 10Fabfur: [C:04-1] "Merge today if needed" [puppet] - 10https://gerrit.wikimedia.org/r/1154157 (https://phabricator.wikimedia.org/T392217) (owner: 10Fabfur)
[06:28:26] <wikibugs>	 (03CR) 10Fabfur: [C:04-1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154157 (https://phabricator.wikimedia.org/T392217) (owner: 10Fabfur)
[06:48:15] <icinga-wm>	 RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 1 AdminDown: 4 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:49:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove netflow7001 [homer/public] - 10https://gerrit.wikimedia.org/r/1154161 (https://phabricator.wikimedia.org/T394263)
[06:52:33] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Remove netflow7001 [homer/public] - 10https://gerrit.wikimedia.org/r/1154161 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff)
[06:55:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove netflow7001 [homer/public] - 10https://gerrit.wikimedia.org/r/1154161 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff)
[06:57:37] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wmf_auto_restart_sfacctd.service on netflow7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250606T0700)
[07:01:54] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T396190 (10Rosalie_WMDE) 03NEW
[07:02:26] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Rosalie_WMDE - https://phabricator.wikimedia.org/T396190#10889993 (10Rosalie_WMDE)
[07:02:38] <wikibugs>	 (03PS1) 10Majavah: team-wmcs: Adapt HAProxy alerts for x3 on the replicas [alerts] - 10https://gerrit.wikimedia.org/r/1154163 (https://phabricator.wikimedia.org/T390954)
[07:08:52] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for cmelo - https://phabricator.wikimedia.org/T395966#10889997 (10SLyngshede-WMF)
[07:13:39] <icinga-wm>	 PROBLEM - MariaDB disk space on db2151 is CRITICAL: DISK CRITICAL - free space: / 2104MiB (5% inode=97%): /tmp 2104MiB (5% inode=97%): /var/tmp 2104MiB (5% inode=97%): https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[07:17:45] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10890001 (10SLyngshede-WMF) @KFrancis I believe this requires a signed NDA.   @WMDECyn we also n...
[07:22:59] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:25:20] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml: Add user deerbee to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1154166 (https://phabricator.wikimedia.org/T396129)
[07:26:07] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Grant Access to analytics-privatedata-user for "Dena WMDE" - https://phabricator.wikimedia.org/T396129#10890004 (10SLyngshede-WMF) p:05Triage→03Medium a:03SLyngshede-WMF
[07:26:13] <wikibugs>	 (03CR) 10CI reject: [V:04-1] data.yaml: Add user deerbee to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1154166 (https://phabricator.wikimedia.org/T396129) (owner: 10Slyngshede)
[07:26:18] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Rosalie_WMDE - https://phabricator.wikimedia.org/T396190#10890006 (10SLyngshede-WMF) p:05Triage→03Medium a:03SLyngshede-WMF
[07:27:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:29:09] <wikibugs>	 10SRE-swift-storage, 06Commons: Commons: fault in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T396186#10890009 (10MatthewVernon) This should be resolved now.
[07:33:14] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Rosalie_WMDE - https://phabricator.wikimedia.org/T396190#10890010 (10SLyngshede-WMF) User already have NDA and WMDE LDAP groups.
[07:33:17] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml: Add user rosalie-wmde to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1154167 (https://phabricator.wikimedia.org/T396190)
[07:36:53] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3606 MB (3% inode=98%): /tmp 3606 MB (3% inode=98%): /var/tmp 3606 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[07:37:06] <wikibugs>	 (03PS2) 10Slyngshede: data.yaml: Add user deerbee to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1154166 (https://phabricator.wikimedia.org/T396129)
[07:38:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good (once manager approval happened)" [puppet] - 10https://gerrit.wikimedia.org/r/1154167 (https://phabricator.wikimedia.org/T396190) (owner: 10Slyngshede)
[07:39:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1154166 (https://phabricator.wikimedia.org/T396129) (owner: 10Slyngshede)
[07:40:35] <icinga-wm>	 PROBLEM - Disk space on db2151 is CRITICAL: DISK CRITICAL - free space: / 1210MiB (3% inode=97%): /tmp 1210MiB (3% inode=97%): /var/tmp 1210MiB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=db2151&var-datasource=codfw+prometheus/ops
[07:41:15] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] data.yaml: Add user deerbee to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1154166 (https://phabricator.wikimedia.org/T396129) (owner: 10Slyngshede)
[07:42:37] <jinxer-wm>	 FIRING: ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:42:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] "Merged; iptables -L remained the same" [puppet] - 10https://gerrit.wikimedia.org/r/1153970 (owner: 10Muehlenhoff)
[07:43:52] <jinxer-wm>	 RESOLVED: ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:44:19] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1157 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:44:47] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Grant Access to analytics-privatedata-user for Rosalie_WMDE - https://phabricator.wikimedia.org/T396190#10890032 (10SLyngshede-WMF) @WMDE-leszek or @WMDECyn we just need an approval from a WMDE manager.
[07:52:18] <logmsgbot>	 !log ryankemper@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: T394543
[07:52:21] <stashbot>	 T394543: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543
[07:52:34] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Grant Access to analytics-privatedata-user for Rosalie_WMDE - https://phabricator.wikimedia.org/T396190#10890040 (10WMDE-leszek) Thanks @SLyngshede-WMF - approved on WMDE's behalf.
[07:52:37] <jinxer-wm>	 FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[07:52:40] <wikibugs>	 (03PS2) 10Slyngshede: data.yaml: Add user rosalie-wmde to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1154167 (https://phabricator.wikimedia.org/T396190)
[07:53:32] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10890042 (10WMDE-leszek) Hey @SLyngshede-WMF - I have approved for WMDE above T395917#10880432
[07:53:46] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Grant Access to analytics-privatedata-user for "Dena WMDE" - https://phabricator.wikimedia.org/T396129#10890043 (10SLyngshede-WMF) 05Open→03Resolved
[07:54:51] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10890044 (10SLyngshede-WMF) @WMDE-leszek Sorry, I completely missed that. Thank you.
[07:59:47] <wikibugs>	 (03PS2) 10Muehlenhoff: New structure for sshd_config starting with trixie (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762)
[08:01:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] New structure for sshd_config starting with trixie (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff)
[08:02:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1154167 (https://phabricator.wikimedia.org/T396190) (owner: 10Slyngshede)
[08:02:33] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1160 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[08:03:06] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] data.yaml: Add user rosalie-wmde to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1154167 (https://phabricator.wikimedia.org/T396190) (owner: 10Slyngshede)
[08:04:04] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Grant Access to analytics-privatedata-user for Rosalie_WMDE - https://phabricator.wikimedia.org/T396190#10890051 (10SLyngshede-WMF) 05Open→03Resolved
[08:05:26] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for kgraessle - https://phabricator.wikimedia.org/T395370#10890052 (10SLyngshede-WMF)
[08:07:22] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for kgraessle - https://phabricator.wikimedia.org/T395370#10890053 (10SLyngshede-WMF) p:05Triage→03Medium a:05DMburugu→03SLyngshede-WMF
[08:08:41] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml: Add user kgraessle to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1154229 (https://phabricator.wikimedia.org/T395370)
[08:10:38] <logmsgbot>	 !log volans@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet
[08:10:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good (once approval by Tyler happened)" [puppet] - 10https://gerrit.wikimedia.org/r/1154229 (https://phabricator.wikimedia.org/T395370) (owner: 10Slyngshede)
[08:11:09] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, SSH and Kerberos for GGoncalves-WMF - https://phabricator.wikimedia.org/T395428#10890059 (10SLyngshede-WMF)
[08:11:20] <logmsgbot>	 !log volans@cumin1003 START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet
[08:17:37] <jinxer-wm>	 RESOLVED: [2x] GnmiTargetDown: asw1-b3-magru is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[08:18:30] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, SSH and Kerberos for GGoncalves-WMF - https://phabricator.wikimedia.org/T395428#10890066 (10SLyngshede-WMF)
[08:19:39] <wikibugs>	 (03PS3) 10Volans: sre.hardware.upgrade-firmware: add support for SSD [cookbooks] - 10https://gerrit.wikimedia.org/r/1150728 (https://phabricator.wikimedia.org/T394543)
[08:22:06] <logmsgbot>	 !log volans@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet
[08:22:07] <logmsgbot>	 !log volans@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet
[08:23:10] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 63199
[08:23:56] <logmsgbot>	 !log volans@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet
[08:24:11] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199
[08:24:15] <logmsgbot>	 !log volans@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet
[08:25:09] <wikibugs>	 (03CR) 10Volans: [C:03+2] "Tested on `cirrussearch2113`, merging to unblock upgrade, I'll be happy to apply changes post-merge." [cookbooks] - 10https://gerrit.wikimedia.org/r/1150728 (https://phabricator.wikimedia.org/T394543) (owner: 10Volans)
[08:25:38] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml: Add user guilherme to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1154235 (https://phabricator.wikimedia.org/T395428)
[08:25:39] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - ml-ctrl_6443: Servers ml-serve-ctrl1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:25:58] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10890078 (10Volans) Thanks @RKemper for the depool, I've performed the final run with the current PS in gerrit with test-cookbook for `ci...
[08:25:58] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 199524
[08:26:39] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:27:08] <wikibugs>	 (03CR) 10Slyngshede: "Tyler created the task, I took that is implicit approval :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1154229 (https://phabricator.wikimedia.org/T395370) (owner: 10Slyngshede)
[08:28:22] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524
[08:28:23] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users, SSH and Kerberos for GGoncalves-WMF - https://phabricator.wikimedia.org/T395428#10890082 (10SLyngshede-WMF) SSH key verified via Slack.
[08:29:31] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for santhosh - https://phabricator.wikimedia.org/T394740#10890085 (10SLyngshede-WMF) p:05Triage→03Medium @Arrbee Ping
[08:30:05] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 13150
[08:30:27] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1161 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[08:30:41] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150
[08:31:13] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: add support for SSD [cookbooks] - 10https://gerrit.wikimedia.org/r/1150728 (https://phabricator.wikimedia.org/T394543) (owner: 10Volans)
[08:32:23] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 46562
[08:32:24] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562
[08:33:31] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 394065
[08:33:52] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065
[08:34:43] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 46997
[08:35:08] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997
[08:36:50] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 398044
[08:37:05] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044
[08:42:42] <wikibugs>	 (03PS1) 10Federico Ceratto: mysql: Add PhabricatorTask utility [cookbooks] - 10https://gerrit.wikimedia.org/r/1154240 (https://phabricator.wikimedia.org/T395427)
[08:42:42] <wikibugs>	 (03CR) 10Federico Ceratto: "A little utility function that adds graceful failure modes for Phabricator tasks updates and simplifies the cookbook logic." [cookbooks] - 10https://gerrit.wikimedia.org/r/1154240 (https://phabricator.wikimedia.org/T395427) (owner: 10Federico Ceratto)
[08:44:25] <wikibugs>	 10SRE-swift-storage, 06Commons: Commons: fault in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T396186#10890113 (10TheDJ) >>! In T396186#10890009, @MatthewVernon wrote: > This should be resolved now.  Because you fixed it, or due to auto reconciliation ?
[08:46:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mysql: Add PhabricatorTask utility [cookbooks] - 10https://gerrit.wikimedia.org/r/1154240 (https://phabricator.wikimedia.org/T395427) (owner: 10Federico Ceratto)
[08:46:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Fair enough :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1154229 (https://phabricator.wikimedia.org/T395370) (owner: 10Slyngshede)
[08:47:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1154235 (https://phabricator.wikimedia.org/T395428) (owner: 10Slyngshede)
[08:56:38] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10890120 (10Volans) Forgot to mention, this is what I used to upgrade just the SSD firmware:  `cookbook sre.hardware.upgrade-firmware -c...
[08:57:25] <wikibugs>	 (03PS2) 10Federico Ceratto: mysql: Add PhabricatorTask utility [cookbooks] - 10https://gerrit.wikimedia.org/r/1154240 (https://phabricator.wikimedia.org/T395427)
[08:57:37] <jinxer-wm>	 FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh
[09:00:22] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] data.yaml: Add user guilherme to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1154235 (https://phabricator.wikimedia.org/T395428) (owner: 10Slyngshede)
[09:03:00] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users, SSH and Kerberos for GGoncalves-WMF - https://phabricator.wikimedia.org/T395428#10890144 (10SLyngshede-WMF) 05In progress→03Resolved
[09:03:11] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] data.yaml: Add user kgraessle to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1154229 (https://phabricator.wikimedia.org/T395370) (owner: 10Slyngshede)
[09:03:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mysql: Add PhabricatorTask utility [cookbooks] - 10https://gerrit.wikimedia.org/r/1154240 (https://phabricator.wikimedia.org/T395427) (owner: 10Federico Ceratto)
[09:04:24] <wikibugs>	 10SRE-swift-storage, 06Commons: Commons: fault in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T396186#10890148 (10MatthewVernon) My colleagues in SRE applied some filtering to problematic traffic.
[09:04:56] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for kgraessle - https://phabricator.wikimedia.org/T395370#10890149 (10SLyngshede-WMF) 05In progress→03Resolved
[09:04:56] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[09:05:27] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[09:06:02] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for cmelo - https://phabricator.wikimedia.org/T395966#10890151 (10SLyngshede-WMF) a:05SLyngshede-WMF→03cmelo
[09:09:48] <wikibugs>	 10SRE-swift-storage, 06Commons: Commons: fault in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T396186#10890177 (10TheDJ) >>! In T396186#10890148, @MatthewVernon wrote: > My colleagues in SRE applied some filtering to problematic traffic.  My goodness. iI's a shame that simply keepin...
[09:14:27] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1158 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[09:16:53] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3438 MB (3% inode=98%): /tmp 3438 MB (3% inode=98%): /var/tmp 3438 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[09:25:03] <wikibugs>	 (03PS3) 10Muehlenhoff: New structure for sshd_config starting with trixie (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762)
[09:28:05] <wikibugs>	 (03PS3) 10Federico Ceratto: mysql: Add PhabricatorTask utility [cookbooks] - 10https://gerrit.wikimedia.org/r/1154240 (https://phabricator.wikimedia.org/T395427)
[09:29:29] <wikibugs>	 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10890244 (10SLyngshede-WMF) 05Open→03Resolved User can reopen if access if not working.
[09:34:11] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "good catch! I wonder if we should just go with [a-z] instead of [sx] to capture possible future sections? I'll let you decide what you thi" [alerts] - 10https://gerrit.wikimedia.org/r/1154163 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah)
[09:39:33] <federico3>	 !incidents
[09:39:33] <sirenbot>	 You're not allowed to perform this action.
[09:40:05] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up
[09:40:23] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up
[09:48:14] <wikibugs>	 (03CR) 10Majavah: [C:03+2] team-wmcs: Adapt HAProxy alerts for x3 on the replicas [alerts] - 10https://gerrit.wikimedia.org/r/1154163 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah)
[09:49:29] <wikibugs>	 (03Merged) 10jenkins-bot: team-wmcs: Adapt HAProxy alerts for x3 on the replicas [alerts] - 10https://gerrit.wikimedia.org/r/1154163 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah)
[10:00:41] <wikibugs>	 (03PS1) 10Tiziano Fogli: prometheus::pop: deploy instances according to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1154255 (https://phabricator.wikimedia.org/T395130)
[10:01:15] <wikibugs>	 (03CR) 10CI reject: [V:04-1] prometheus::pop: deploy instances according to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1154255 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli)
[10:02:29] <wikibugs>	 (03PS2) 10Tiziano Fogli: prometheus::pop: deploy instances according to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1154255 (https://phabricator.wikimedia.org/T395130)
[10:03:03] <wikibugs>	 (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154255 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli)
[10:05:53] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:07:31] <wikibugs>	 (03PS3) 10Tiziano Fogli: prometheus::pop: temporarily exclude ops instance from prometheus7002 [puppet] - 10https://gerrit.wikimedia.org/r/1154255 (https://phabricator.wikimedia.org/T395130)
[10:08:20] <wikibugs>	 (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154255 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli)
[10:09:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] prometheus::pop: temporarily exclude ops instance from prometheus7002 [puppet] - 10https://gerrit.wikimedia.org/r/1154255 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli)
[10:10:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:17:06] <wikibugs>	 (03PS1) 10Slyngshede: Docker: Add missing bs4 package [software/bitu] - 10https://gerrit.wikimedia.org/r/1154260 (https://phabricator.wikimedia.org/T396103)
[10:18:10] <wikibugs>	 (03PS1) 10Muehlenhoff: ssh: Stop managing /run/sshd with Trixie and later [puppet] - 10https://gerrit.wikimedia.org/r/1154261
[10:18:25] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Docker: Add missing bs4 package [software/bitu] - 10https://gerrit.wikimedia.org/r/1154260 (https://phabricator.wikimedia.org/T396103) (owner: 10Slyngshede)
[10:18:27] <wikibugs>	 (03CR) 10Slyngshede: [V:03+2 C:03+2] Docker: Add missing bs4 package [software/bitu] - 10https://gerrit.wikimedia.org/r/1154260 (https://phabricator.wikimedia.org/T396103) (owner: 10Slyngshede)
[10:18:43] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be2066 - https://phabricator.wikimedia.org/T395990#10890398 (10MatthewVernon) @Jhancock.wm I've had a look at the web-iDRAC, and the serial number I found above (9120A025F1QF) does correspond to the device in slot 14, which the web...
[10:21:32] <wikibugs>	 (03Merged) 10jenkins-bot: Docker: Add missing bs4 package [software/bitu] - 10https://gerrit.wikimedia.org/r/1154260 (https://phabricator.wikimedia.org/T396103) (owner: 10Slyngshede)
[10:21:37] <wikibugs>	 (03PS1) 10Slyngshede: P:idm Enable API [puppet] - 10https://gerrit.wikimedia.org/r/1154262 (https://phabricator.wikimedia.org/T364605)
[10:21:58] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154261 (owner: 10Muehlenhoff)
[10:22:38] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5786/co" [puppet] - 10https://gerrit.wikimedia.org/r/1154262 (https://phabricator.wikimedia.org/T364605) (owner: 10Slyngshede)
[10:29:02] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prometheus7002: assign replica_label [puppet] - 10https://gerrit.wikimedia.org/r/1154046 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli)
[10:29:47] <wikibugs>	 (03PS4) 10Tiziano Fogli: prometheus::pop: temporarily exclude ops instance from prometheus7002 [puppet] - 10https://gerrit.wikimedia.org/r/1154255 (https://phabricator.wikimedia.org/T395130)
[10:29:53] <wikibugs>	 (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154046 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli)
[10:31:29] <wikibugs>	 (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154255 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli)
[10:36:02] <wikibugs>	 (03PS5) 10Btullis: Airflow: Add local settings to enable the xcom_sidecar functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154248 (https://phabricator.wikimedia.org/T388378)
[10:38:52] <georgekyz>	 Hey folks I am going to run a maintance job at deployment server for afwiki
[10:39:06] <georgekyz>	 `extensions/ORES/maintenance/PopulateDatabase.php`
[10:44:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] ssh: Stop managing /run/sshd with Trixie and later [puppet] - 10https://gerrit.wikimedia.org/r/1154261 (owner: 10Muehlenhoff)
[10:45:53] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prometheus::pop: temporarily exclude ops instance from prometheus7002 [puppet] - 10https://gerrit.wikimedia.org/r/1154255 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli)
[10:50:37] <wikibugs>	 06SRE, 10SRE-swift-storage: Consider increasing swift workers on proxy nodes to 32 - https://phabricator.wikimedia.org/T396203 (10MatthewVernon) 03NEW
[10:56:53] <icinga-wm>	 RECOVERY - Disk space on archiva1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[10:59:12] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Also add replica label for the new upcoming prometheus7002 node [puppet] - 10https://gerrit.wikimedia.org/r/1153126 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff)
[11:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250606T0700)
[11:00:04] <jouncebot>	 jelto, arnoldokoth, and mutante: Time to do the GitLab version upgrades deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250606T1100).
[11:04:44] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] admin_ng: Split envoyfilters installation into a separate release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153979 (https://phabricator.wikimedia.org/T389080) (owner: 10JMeybohm)
[11:04:46] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] admin_ng: Fix dependencies/needs of helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153982 (https://phabricator.wikimedia.org/T389080) (owner: 10JMeybohm)
[11:07:48] <wikibugs>	 (03PS1) 10Vgutierrez: liberica: Don't bind VIPs to lo interface with katran [puppet] - 10https://gerrit.wikimedia.org/r/1154268 (https://phabricator.wikimedia.org/T395228)
[11:11:19] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: Split envoyfilters installation into a separate release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153979 (https://phabricator.wikimedia.org/T389080) (owner: 10JMeybohm)
[11:11:37] <wikibugs>	 (03PS1) 10Hnowlan: rest-gateway: per-route statistics option, enable for lists and wikifeeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154269
[11:11:38] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: Fix dependencies/needs of helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153982 (https://phabricator.wikimedia.org/T389080) (owner: 10JMeybohm)
[11:12:03] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154268 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez)
[11:12:42] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[11:12:57] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[11:16:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[11:17:07] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[11:17:56] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[11:18:33] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[11:18:53] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[11:21:08] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[11:21:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[11:22:25] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[11:25:54] <logmsgbot>	 !log jayme@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'.
[11:26:39] <logmsgbot>	 !log jayme@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'.
[11:27:17] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10890602 (10Ladsgroup)
[11:27:37] <logmsgbot>	 !log jayme@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'.
[11:27:41] <logmsgbot>	 !log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[11:28:20] <jinxer-wm>	 FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[11:28:24] <logmsgbot>	 !log jayme@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[11:28:28] <logmsgbot>	 !log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[11:28:47] <logmsgbot>	 !log jayme@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[11:28:51] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[11:30:52] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[11:30:56] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[11:31:36] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[11:31:41] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[11:32:05] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[11:32:09] <logmsgbot>	 !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[11:32:23] <logmsgbot>	 !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[11:32:51] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] rest-gateway: per-route statistics option, enable for lists and wikifeeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154269 (owner: 10Hnowlan)
[11:33:20] <jinxer-wm>	 FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[11:36:19] <wikibugs>	 (03PS3) 10Hnowlan: imagemagick: ignore all py3exiv2 exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101154 (https://phabricator.wikimedia.org/T381594) (owner: 10AntiCompositeNumber)
[11:37:09] <wikibugs>	 (03PS4) 10Hnowlan: imagemagick: ignore all py3exiv2 exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101154 (https://phabricator.wikimedia.org/T381594) (owner: 10AntiCompositeNumber)
[11:37:51] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] rest-gateway: per-route statistics option, enable for lists and wikifeeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154269 (owner: 10Hnowlan)
[11:38:20] <jinxer-wm>	 FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[11:39:38] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: per-route statistics option, enable for lists and wikifeeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154269 (owner: 10Hnowlan)
[11:42:06] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[11:42:19] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[11:44:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] imagemagick: ignore all py3exiv2 exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101154 (https://phabricator.wikimedia.org/T381594) (owner: 10AntiCompositeNumber)
[11:48:20] <jinxer-wm>	 RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[11:49:20] <wikibugs>	 (03PS1) 10Marostegui: ms3: Migrate to parsercache role [puppet] - 10https://gerrit.wikimedia.org/r/1154276
[11:51:01] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] ms3: Migrate to parsercache role [puppet] - 10https://gerrit.wikimedia.org/r/1154276 (owner: 10Marostegui)
[11:51:51] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1159 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:52:20] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications
[11:52:37] <jinxer-wm>	 FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[11:58:41] <icinga-wm>	 RECOVERY - MariaDB disk space on db2151 is OK: DISK OK https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[12:00:35] <icinga-wm>	 RECOVERY - Disk space on db2151 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=db2151&var-datasource=codfw+prometheus/ops
[12:03:35] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[12:03:43] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[12:04:57] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1154268 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez)
[12:19:58] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet
[12:20:01] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002
[12:20:19] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002
[12:22:46] <wikibugs>	 (03PS1) 10Sergio Gimeno: [beta] GrowthExperiments: enable limiting add a link task via config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154282 (https://phabricator.wikimedia.org/T393923)
[12:23:52] <wikibugs>	 (03PS2) 10Sergio Gimeno: [beta] GrowthExperiments: enable limiting add a link task via config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154282 (https://phabricator.wikimedia.org/T393769)
[12:24:40] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026#10890719 (10fgiunchedi) a:05fgiunchedi→03None
[12:25:45] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10890720 (10MoritzMuehlenhoff)
[12:30:53] <wikibugs>	 (03PS16) 10Cathal Mooney: BGP: Adjust switch IBGP templates to support evpn and unicast ibgp [homer/public] - 10https://gerrit.wikimedia.org/r/1152272 (https://phabricator.wikimedia.org/T394530)
[12:31:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] BGP: Adjust switch IBGP templates to support evpn and unicast ibgp [homer/public] - 10https://gerrit.wikimedia.org/r/1152272 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney)
[12:31:01] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144484 (https://phabricator.wikimedia.org/T393872) (owner: 10SD0001)
[12:46:17] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] liberica: Don't bind VIPs to lo interface with katran [puppet] - 10https://gerrit.wikimedia.org/r/1154268 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez)
[12:50:19] <wikibugs>	 (03PS1) 10Tiziano Fogli: prometheus::pop: enable rsyncd on magru [puppet] - 10https://gerrit.wikimedia.org/r/1154284 (https://phabricator.wikimedia.org/T395130)
[12:57:37] <jinxer-wm>	 FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh
[12:58:24] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns4003*} and (A:dnsbox)
[12:58:24] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org
[12:58:55] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns5003*} and (A:dnsbox)
[12:58:55] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org
[12:59:51] <logmsgbot>	 fceratto@cumin1002 clone (PID 2148058) is awaiting input
[13:04:10] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:04:31] <sukhe>	 ^ expected, dns5003
[13:05:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1154284 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli)
[13:05:34] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prometheus::pop: enable rsyncd on magru [puppet] - 10https://gerrit.wikimedia.org/r/1154284 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli)
[13:06:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Maybe add a tick box to https://phabricator.wikimedia.org/T395130 to remove it at the end of the migration, it's quite easy to miss" [puppet] - 10https://gerrit.wikimedia.org/r/1154284 (https://phabricator.wikimedia.org/T395130) (owner: 10Tiziano Fogli)
[13:06:38] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[13:09:10] <jinxer-wm>	 RESOLVED: [4x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:09:46] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org
[13:09:46] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns4003*} and (A:dnsbox)
[13:11:12] <wikibugs>	 (03CR) 10Ayounsi: ASW Templates: modify Jinja templates step 1 (try 2) (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1153615 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney)
[13:12:10] <logmsgbot>	 cmooney@cumin1003 netbox (PID 450379) is awaiting input
[13:13:01] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org
[13:13:01] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns5003*} and (A:dnsbox)
[13:13:40] <wikibugs>	 10SRE-SLO: Add a section to the SLO template that explains Pyrra's dashboards and alerts - https://phabricator.wikimedia.org/T395920#10890859 (10akosiaris) Thanks for this! I 've also landed a round of updates today in https://wikitech.wikimedia.org/w/index.php?title=SLO/Template_instructions/Dashboards_and_aler...
[13:14:10] <wikibugs>	 06SRE, 06Traffic: Upgrade pdns-recursor to 5.x on all prod DNS hosts (all C:dnsrecursor and so possibly WMCS) - https://phabricator.wikimedia.org/T381608#10890860 (10ssingh) a:03CDobbins
[13:14:39] <wikibugs>	 (03PS1) 10Vgutierrez: Revert^4 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154286
[13:15:10] <wikibugs>	 (03PS2) 10Vgutierrez: Revert^4 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154286 (https://phabricator.wikimedia.org/T395228)
[13:15:16] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154286 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez)
[13:16:43] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Revert^4 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154286 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez)
[13:18:34] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns3003*} and (A:dnsbox)
[13:18:34] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org
[13:18:47] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns6001*} and (A:dnsbox)
[13:18:47] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org
[13:21:40] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003"
[13:21:44] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003"
[13:21:44] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:22:49] <icinga-wm>	 PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:22:49] <icinga-wm>	 PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:23:04] <sukhe>	 ^ expected
[13:25:49] <icinga-wm>	 RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:27:49] <icinga-wm>	 RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:29:31] <wikibugs>	 (03CR) 10Ssingh: "Looks good. I have two questions in-line before we add others for review to see if it is something in the data itself (and our calculation" [dns] - 10https://gerrit.wikimedia.org/r/1153334 (owner: 10CDobbins)
[13:30:56] <wikibugs>	 (03PS4) 10Cathal Mooney: ASW Templates: modify Jinja templates step 1 (try 2) [homer/public] - 10https://gerrit.wikimedia.org/r/1153615 (https://phabricator.wikimedia.org/T394530)
[13:31:04] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org
[13:31:04] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns6001*} and (A:dnsbox)
[13:31:27] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] Revert^4 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154286 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez)
[13:32:04] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org
[13:32:04] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns3003*} and (A:dnsbox)
[13:32:45] <wikibugs>	 (03CR) 10Cathal Mooney: "Thanks for the review, should be ok in latest patchset and no-op on switches (tested different types - https://phabricator.wikimedia.org/P" [homer/public] - 10https://gerrit.wikimedia.org/r/1153615 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney)
[13:34:57] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns1006*} and (A:dnsbox)
[13:34:57] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org
[13:35:02] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns2006*} and (A:dnsbox)
[13:35:02] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org
[13:36:34] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10890941 (10MoritzMuehlenhoff)
[13:36:43] <icinga-wm>	 PROBLEM - Host ntp-c.anycast.wmnet is DOWN: PING CRITICAL - Packet loss = 100%
[13:36:51] <sukhe>	 ^ hmm
[13:36:52] <sukhe>	 ok
[13:37:10] <sukhe>	 we have redundancy here, so nothing to worry,ntp-[ab]
[13:37:24] <sukhe>	 should be back up soon
[13:37:30] <jinxer-wm>	 FIRING: LibericaStaleConfig: Liberica instance lvs1013 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=eqiad&var-instance=lvs1013 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig
[13:38:28] <wikibugs>	 (03PS5) 10Cathal Mooney: ASW Templates: modify Jinja templates step 1 (try 2) [homer/public] - 10https://gerrit.wikimedia.org/r/1153615 (https://phabricator.wikimedia.org/T394530)
[13:39:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-codfw and 208.80.153.107 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:39:17] <sukhe>	 ^ expected
[13:40:24] <logmsgbot>	 !log vgutierrez@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran
[13:40:33] <logmsgbot>	 !log vgutierrez@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs1013.eqiad.wmnet} and A:liberica
[13:40:38] <logmsgbot>	 !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs1013.eqiad.wmnet} and A:liberica
[13:41:36] <wikibugs>	 (03PS1) 10Vgutierrez: Revert^4 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154294
[13:41:48] <wikibugs>	 (03PS3) 10JMeybohm: Use Wikimedia DNS IPs as mock [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153975 (https://phabricator.wikimedia.org/T396107)
[13:41:50] <wikibugs>	 (03PS3) 10JMeybohm: calico: Add support to manage CNI installation by daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153976 (https://phabricator.wikimedia.org/T396107)
[13:41:53] <wikibugs>	 (03PS3) 10JMeybohm: coredns: Run coredns on an unprivileged port (5353) instead of 53 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153977 (https://phabricator.wikimedia.org/T396107)
[13:41:55] <wikibugs>	 (03PS3) 10JMeybohm: cfssl-issuer: Allow to provide a custom CA certificate store [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153978 (https://phabricator.wikimedia.org/T396107)
[13:42:14] <wikibugs>	 (03PS2) 10Vgutierrez: Revert^4 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154294 (https://phabricator.wikimedia.org/T395228)
[13:42:20] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154294 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez)
[13:42:30] <jinxer-wm>	 RESOLVED: LibericaStaleConfig: Liberica instance lvs1013 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=eqiad&var-instance=lvs1013 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig
[13:42:31] <wikibugs>	 (03PS6) 10Cathal Mooney: ASW Templates: modify Jinja templates step 1 (try 2) [homer/public] - 10https://gerrit.wikimedia.org/r/1153615 (https://phabricator.wikimedia.org/T394530)
[13:44:08] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/1153615 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney)
[13:44:10] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.107 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:46:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Add puppetserver2004 [dns] - 10https://gerrit.wikimedia.org/r/1154296 (https://phabricator.wikimedia.org/T381274)
[13:46:44] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Revert^4 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154294 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez)
[13:47:30] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] ASW Templates: modify Jinja templates step 1 (try 2) [homer/public] - 10https://gerrit.wikimedia.org/r/1153615 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney)
[13:48:04] <wikibugs>	 (03Merged) 10jenkins-bot: ASW Templates: modify Jinja templates step 1 (try 2) [homer/public] - 10https://gerrit.wikimedia.org/r/1153615 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney)
[13:49:03] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org
[13:49:03] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns1006*} and (A:dnsbox)
[13:49:10] <jinxer-wm>	 RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.107 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:49:51] <icinga-wm>	 RECOVERY - Host ntp-c.anycast.wmnet is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms
[13:50:01] <wikibugs>	 (03PS2) 10Muehlenhoff: Add puppetserver2004 [dns] - 10https://gerrit.wikimedia.org/r/1154296 (https://phabricator.wikimedia.org/T381274)
[13:50:58] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Add puppetserver2004 [dns] - 10https://gerrit.wikimedia.org/r/1154296 (https://phabricator.wikimedia.org/T381274) (owner: 10Muehlenhoff)
[13:51:12] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org
[13:51:13] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns2006*} and (A:dnsbox)
[13:52:14] <wikibugs>	 (03PS1) 10Hnowlan: (api|rest)-gateway: remove envoyproxy annotation, scrape all ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154298
[13:55:03] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Add puppetserver2004 [dns] - 10https://gerrit.wikimedia.org/r/1154296 (https://phabricator.wikimedia.org/T381274) (owner: 10Muehlenhoff)
[13:57:21] <wikibugs>	 (03PS5) 10Hnowlan: imagemagick: ignore all py3exiv2 exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101154 (https://phabricator.wikimedia.org/T381594) (owner: 10AntiCompositeNumber)
[13:58:56] <wikibugs>	 (03CR) 10A smart kitten: "Judging by T388531#10887779, it might be okay to go ahead with this patch now?" [puppet] - 10https://gerrit.wikimedia.org/r/1150624 (https://phabricator.wikimedia.org/T388531) (owner: 10Hnowlan)
[14:09:36] <wikibugs>	 (03PS1) 10Ebernhardson: EventStream: Enable hive ingeestion for wcqs-external.sparql-query [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154300
[14:10:07] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns2004*} and (A:dnsbox)
[14:10:07] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org
[14:10:15] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on P{dns1005*} and (A:dnsbox)
[14:10:15] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org
[14:12:47] <wikibugs>	 (03CR) 10Hnowlan: imagemagick: ignore all py3exiv2 exceptions (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101154 (https://phabricator.wikimedia.org/T381594) (owner: 10AntiCompositeNumber)
[14:14:54] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: registry: Minor Puppet cleanups [puppet] - 10https://gerrit.wikimedia.org/r/1154301 (https://phabricator.wikimedia.org/T390251)
[14:14:56] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: docker_registry_ha: Refactor to make it docker_registry [puppet] - 10https://gerrit.wikimedia.org/r/1154302 (https://phabricator.wikimedia.org/T390251)
[14:15:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:15:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] docker_registry_ha: Refactor to make it docker_registry [puppet] - 10https://gerrit.wikimedia.org/r/1154302 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris)
[14:18:04] <wikibugs>	 (03PS2) 10Hnowlan: (api|rest)-gateway: remove envoyproxy annotation, scrape all ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154298
[14:20:10] <jinxer-wm>	 RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:20:36] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] registry: Minor Puppet cleanups [puppet] - 10https://gerrit.wikimedia.org/r/1154301 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris)
[14:22:57] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org
[14:22:57] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns1005*} and (A:dnsbox)
[14:23:36] <logmsgbot>	 !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org
[14:23:36] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P{dns2004*} and (A:dnsbox)
[14:23:44] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[14:24:23] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[14:25:05] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] Revert^4 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154294 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez)
[14:26:42] <wikibugs>	 (03PS1) 10Ayounsi: [WIP] gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303
[14:27:06] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [WIP] gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi)
[14:28:20] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: docker_registry_ha: Refactor to make it docker_registry [puppet] - 10https://gerrit.wikimedia.org/r/1154302 (https://phabricator.wikimedia.org/T390251)
[14:28:33] <wikibugs>	 (03PS2) 10Ayounsi: [WIP] gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303
[14:29:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [WIP] gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi)
[14:29:39] <wikibugs>	 (03PS3) 10Ayounsi: [WIP] gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303
[14:30:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [WIP] gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi)
[14:31:17] <icinga-wm>	 PROBLEM - Host ms-be2066 is DOWN: PING CRITICAL - Packet loss = 100%
[14:31:40] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] "Let's try it out!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154128 (https://phabricator.wikimedia.org/T290759) (owner: 10Arlolra)
[14:32:24] <wikibugs>	 (03PS4) 10Ayounsi: [WIP] gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303
[14:34:42] <wikibugs>	 (03Abandoned) 10Cathal Mooney: BGP: Adjust switch IBGP templates to support evpn and unicast ibgp [homer/public] - 10https://gerrit.wikimedia.org/r/1152272 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney)
[14:37:35] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Create plwiki Oversight list - https://phabricator.wikimedia.org/T396083#10891198 (10Ladsgroup) 05Open→03Resolved I created it and added all current oversights as admins (looking up their emails in the database, just in case). https://lists.wikimedia.org/postorius/list...
[14:42:56] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:53:10] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:53:58] <wikibugs>	 (03CR) 10CDobbins: add rest of South America (except Falkland Islands) to geo-maps (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/1153334 (owner: 10CDobbins)
[14:56:16] <wikibugs>	 (03CR) 10CDobbins: add rest of South America (except Falkland Islands) to geo-maps (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1153334 (owner: 10CDobbins)
[15:02:58] <wikibugs>	 (03PS1) 10Vgutierrez: Revert^5 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154307
[15:03:50] <wikibugs>	 (03PS2) 10Vgutierrez: Revert^5 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154307 (https://phabricator.wikimedia.org/T395228)
[15:03:59] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154307 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez)
[15:04:43] <wikibugs>	 (03PS1) 10Tchanders: Document that IP reveal permissions can't just be reassigned [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154308 (https://phabricator.wikimedia.org/T396217)
[15:05:40] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] Revert^5 "hiera: Use katran in lvs1013" [puppet] - 10https://gerrit.wikimedia.org/r/1154307 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez)
[15:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:08:24] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] (api|rest)-gateway: remove envoyproxy annotation, scrape all ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154298 (owner: 10Hnowlan)
[15:11:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:11:52] <vgutierrez>	 ^^that's me
[15:16:42] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:19:40] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet
[15:19:41] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet
[15:19:51] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154282 (https://phabricator.wikimedia.org/T393769) (owner: 10Sergio Gimeno)
[15:20:11] <wikibugs>	 (03PS1) 10Vgutierrez: Revert^5 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154311
[15:21:07] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Revert^5 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154311 (owner: 10Vgutierrez)
[15:21:21] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] Revert^5 "hiera: Depool lvs1013 before switching to katran" [puppet] - 10https://gerrit.wikimedia.org/r/1154311 (owner: 10Vgutierrez)
[15:21:36] <wikibugs>	 (03CR) 10Ssingh: "Adding cdanis for their input as well on the general methodology for their awareness and feedback if any. Otherwise my recommendation is t" [dns] - 10https://gerrit.wikimedia.org/r/1153334 (owner: 10CDobbins)
[15:23:11] <icinga-wm>	 PROBLEM - Disk space on an-worker1110 is CRITICAL: DISK CRITICAL - free space: / 2117 MB (3% inode=95%): /tmp 2117 MB (3% inode=95%): /var/tmp 2117 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1110&var-datasource=eqiad+prometheus/ops
[15:23:55] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs1013.eqiad.wmnet} and A:liberica
[15:24:11] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs1013.eqiad.wmnet} and A:liberica
[15:24:25] <wikibugs>	 (03PS4) 10Ssingh: sre.cdn.roll-restart-ats: add cookbook for restarting ATS [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781
[15:24:33] <wikibugs>	 (03CR) 10Ssingh: "Thinking about this a bit more and based on some conversations here on this CR and on IRC: I am going to merge this one and we can come ba" [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh)
[15:26:18] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS1 Status - issue on wikikube-worker2324:9290 - https://phabricator.wikimedia.org/T396067#10891404 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated power cable on both ends. alert cleared
[15:26:39] <icinga-wm>	 RECOVERY - Host ms-be2066 is UP: PING OK - Packet loss = 0%, RTA = 30.24 ms
[15:26:42] <wikibugs>	 (03CR) 10Ssingh: "For further clarity: one of the reasons _against_ merging the two is the different purpose they serve. And at least in the Traffic realm, " [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh)
[15:27:49] <wikibugs>	 (03CR) 10Aleksandar Mastilovic: "MR that introduced the `/config` directory has been merged into `airflow-dags`. Do we need someone else to approve this CR so we can merge" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1153704 (https://phabricator.wikimedia.org/T383931) (owner: 10Aleksandar Mastilovic)
[15:32:14] <wikibugs>	 (03CR) 10Ssingh: [V:03+2 C:03+2] sre.cdn.roll-restart-ats: add cookbook for restarting ATS [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh)
[15:33:00] <wikibugs>	 (03PS2) 10BryanDavis: shellbox-syntaxhighlight: Bump to 2025-06-05-215815 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154132 (https://phabricator.wikimedia.org/T364249)
[15:34:42] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet
[15:36:22] <wikibugs>	 (03CR) 10CDanis: "+1 to, for now, just skipping any changes to the countries with sample sizes <50" [dns] - 10https://gerrit.wikimedia.org/r/1153334 (owner: 10CDobbins)
[15:36:37] <wikibugs>	 (03CR) 10BryanDavis: "Going with Scott's suggestion in Phabricator to only bump syntaxhighlight myself and let him do the bigger work of catching the other depl" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154132 (https://phabricator.wikimedia.org/T364249) (owner: 10BryanDavis)
[15:39:08] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Bryan!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154132 (https://phabricator.wikimedia.org/T364249) (owner: 10BryanDavis)
[15:40:12] <wikibugs>	 (03Merged) 10jenkins-bot: sre.cdn.roll-restart-ats: add cookbook for restarting ATS [cookbooks] - 10https://gerrit.wikimedia.org/r/1152781 (owner: 10Ssingh)
[15:40:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10891475 (10ssingh) >>! In T387145#10886841, @cmooney wrote: >> Check with @cmooney for changes required to hieradata/common/lvs/interfaces.yaml (to add lvs1016 there) and also...
[15:41:12] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet
[15:45:41] <wikibugs>	 (03CR) 10SBassett: [C:03+1] alertmanager: adjust phab project to security-team rather than security tag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1150624 (https://phabricator.wikimedia.org/T388531) (owner: 10Hnowlan)
[15:47:17] <wikibugs>	 (03PS1) 10Federico Ceratto: team-data-persistence: Add predictive disk space alerts [alerts] - 10https://gerrit.wikimedia.org/r/1154314
[15:47:17] <wikibugs>	 (03CR) 10Federico Ceratto: "An initial experiment with predictions - not enabled for paging yet" [alerts] - 10https://gerrit.wikimedia.org/r/1154314 (owner: 10Federico Ceratto)
[15:49:20] <jinxer-wm>	 FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[15:51:29] <wikibugs>	 (03PS11) 10CDanis: [WIP] gNMI: spread targets on multiple netflow hosts [puppet] - 10https://gerrit.wikimedia.org/r/1154303 (owner: 10Ayounsi)
[15:52:37] <jinxer-wm>	 FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[15:54:20] <jinxer-wm>	 RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[15:55:30] <sbassett>	 Hey all - I know it’s Friday, but I’d like to get a minor update deployed to a private security mitigation that should fix https://phabricator.wikimedia.org/T396111.  Let me know if there are any objections.
[15:57:06] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: mc-misc2001 won't power up - https://phabricator.wikimedia.org/T395526#10891553 (10Jhancock.wm) sent them log info about the server. they're still investigating cause.
[16:03:28] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Supermicro: test if Intel card exhibits the same cold boot behavior - https://phabricator.wikimedia.org/T394847#10891582 (10Jhancock.wm) @jhathaway are you still testing the 1G link? i got an automated ticket for the port that it's connected to.
[16:05:23] <wikibugs>	 (03CR) 10SBassett: "After discussing with @emill-ctr@wikimedia.org, I think we'd prefer to keep this in place on beta for all but CU, which shouldn't matter a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153674 (https://phabricator.wikimedia.org/T396061) (owner: 10Lucas Werkmeister)
[16:08:08] <sbassett>	 !log Deployed security update to fix T396111
[16:08:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:12] <stashbot>	 T396111: Wikimedia\NormalizedException\NormalizedException: Invalid username: {username} - https://phabricator.wikimedia.org/T396111
[16:11:19] <wikibugs>	 (03PS1) 10Cathal Mooney: Switch BGP: Automate & unify IBGP configs on switches [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530)
[16:11:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10891620 (10Andrew) Sorry @Jclark-ctr, I've made a bit of a mess of this.  Ideally each of these hosts would have 2x25G connections, each connected to a cl...
[16:16:57] <wikibugs>	 (03PS2) 10Cathal Mooney: Switch BGP: Automate & unify IBGP configs on switches [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530)
[16:17:11] <wikibugs>	 (03CR) 10Cathal Mooney: Switch BGP: Automate & unify IBGP configs on switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney)
[16:17:36] <wikibugs>	 (03CR) 10Cathal Mooney: Switch BGP: Automate & unify IBGP configs on switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney)
[16:18:35] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] "Generally looks good. One stronger wording suggestion (not sure if we really want to go down the IETF SHOUTY CAPITALS wording model)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154136 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[16:20:13] <wikibugs>	 (03CR) 10Scott French: "Thanks for the improvements!" [puppet] - 10https://gerrit.wikimedia.org/r/1153999 (owner: 10Effie Mouzeli)
[16:20:21] <wikibugs>	 (03PS3) 10Cathal Mooney: Switch BGP: Automate & unify IBGP configs on switches [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530)
[16:27:21] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Supermicro: test if Intel card exhibits the same cold boot behavior - https://phabricator.wikimedia.org/T394847#10891640 (10jhathaway) >>! In T394847#10891582, @Jhancock.wm wrote: > @jhathaway are you still testing the 1G link? i got an automated...
[16:28:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10891644 (10Andrew) Looks like I'm getting ahead of things a bit. We definitely do need 2 connections per host, but it's unclear on if we're skipping to 25...
[16:36:17] <wikibugs>	 (03CR) 10CDobbins: add rest of South America (except Falkland Islands) to geo-maps (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1153334 (owner: 10CDobbins)
[16:38:15] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:39:11] <ChrisDobbins901_>	 !incidents testing access/permissions
[16:39:12] <sirenbot>	 You're not allowed to perform this action.
[16:42:19] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[16:43:11] <icinga-wm>	 PROBLEM - Disk space on an-worker1110 is CRITICAL: DISK CRITICAL - free space: / 2114 MB (3% inode=95%): /tmp 2114 MB (3% inode=95%): /var/tmp 2114 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1110&var-datasource=eqiad+prometheus/ops
[16:44:51] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:50:18] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudcephosd10[48-51] service implementation - https://phabricator.wikimedia.org/T395910#10891695 (10Andrew) 05Open→03Stalled p:05Triage→03Medium
[16:55:11] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db2244
[16:55:19] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244
[16:55:41] <sukhe>	 ChrisDobbins901_: try now please
[16:55:55] <ChrisDobbins901_>	 !incidents testing access/permissions
[16:55:56] <sirenbot>	 team testing not found
[16:55:56] <sirenbot>	 could not find the team
[16:56:03] <sukhe>	 just !incidents
[16:56:04] <ChrisDobbins901_>	 !incidents
[16:56:05] <sirenbot>	 6297 (RESOLVED)  [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet)
[16:56:05] <sirenbot>	 6307 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[16:56:05] <sirenbot>	 6306 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[16:56:05] <sirenbot>	 6299 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[16:56:05] <sirenbot>	 6305 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[16:56:06] <sirenbot>	 6298 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[16:56:06] <sirenbot>	 6304 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[16:56:06] <sirenbot>	 6303 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[16:56:07] <sirenbot>	 6302 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[16:56:07] <sukhe>	 but yeah, it seems to work
[16:56:07] <sirenbot>	 6301 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[16:56:08] <sirenbot>	 6300 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[16:56:08] <sukhe>	 great
[16:56:08] <sirenbot>	 6294 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[16:56:52] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:57:21] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:57:37] <jinxer-wm>	 FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh
[16:57:42] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:00:14] <sukhe>	 !log forced agent run on O:alerting_host to reload vopsbot to add cdobbins
[17:00:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:55] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: T383811 - bking@cumin2002
[17:06:58] <stashbot>	 T383811: Ensure Search Platform-owned Elasticsearch cookbooks can handle Opensearch - https://phabricator.wikimedia.org/T383811
[17:08:51] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: T383811 - bking@cumin2002
[17:10:27] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9243/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9243): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:10:42] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[17:16:23] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 20, number_of_data_nodes: 20, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending
[17:16:23] <icinga-wm>	 41, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 3962, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:17:37] <jinxer-wm>	 FIRING: SystemdUnitFailed: push_cross_cluster_settings_9200.service on cirrussearch1100:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:19:19] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 2946 threshold =0.2 breach: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1374, active_shards: 1374, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2946, delayed_unas
[17:19:19] <icinga-wm>	 hards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 31.805555555555554 https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:20:21] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: T383811 - bking@cumin2002
[17:21:10] <stashbot>	 T383811: Ensure Search Platform-owned Elasticsearch cookbooks can handle Opensearch - https://phabricator.wikimedia.org/T383811
[17:21:30] <jinxer-wm>	 FIRING: Primary inbound port utilisation over 80%  #page: Alert for device cr1-magru.wikimedia.org - Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[17:22:05] <herron>	 !incidents
[17:22:05] <sirenbot>	 6308 (UNACKED)  Primary inbound port utilisation over 80%  (paged) network noc (cr1-magru.wikimedia.org)
[17:22:05] <sirenbot>	 6297 (RESOLVED)  [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet)
[17:22:05] <sirenbot>	 6307 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[17:22:06] <sirenbot>	 6306 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[17:22:06] <sirenbot>	 6299 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[17:22:06] <sirenbot>	 6305 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[17:22:06] <sirenbot>	 6298 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[17:22:06] <sirenbot>	 6304 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[17:22:07] <sirenbot>	 6303 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[17:22:07] <sirenbot>	 6302 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[17:22:08] <sirenbot>	 6301 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[17:22:08] <sirenbot>	 6300 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[17:22:11] <jinxer-wm>	 FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:22:31] <jinxer-wm>	 FIRING: Primary outbound port utilisation over 80%  #page: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[17:22:51] <herron>	 !ack 6308
[17:22:51] <sirenbot>	 6308 (ACKED)  Primary inbound port utilisation over 80%  (paged) network noc (cr1-magru.wikimedia.org)
[17:23:02] <herron>	 !incidents
[17:23:02] <sirenbot>	 6308 (ACKED)  Primary inbound port utilisation over 80%  (paged) network noc (cr1-magru.wikimedia.org)
[17:23:02] <sirenbot>	 6309 (UNACKED)  Primary outbound port utilisation over 80%  (paged) network noc (cr2-eqiad.wikimedia.org)
[17:23:02] <sirenbot>	 6297 (RESOLVED)  [3x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet)
[17:23:02] <sirenbot>	 6307 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[17:23:03] <sirenbot>	 6306 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[17:23:03] <sirenbot>	 6299 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[17:23:03] <sirenbot>	 6305 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[17:23:03] <sirenbot>	 6298 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[17:23:04] <sirenbot>	 6304 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[17:23:04] <sirenbot>	 6303 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[17:23:05] <sirenbot>	 6302 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[17:23:05] <sirenbot>	 6301 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[17:23:13] <herron>	 !ack 6309
[17:23:19] <urandom>	 already acked
[17:23:24] <urandom>	 race condition detected
[17:23:41] <urandom>	 I should remember to use sirenbot, sorry
[17:27:11] <jinxer-wm>	 RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:29:09] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244']
[17:29:21] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244']
[17:29:49] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm
[17:30:02] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db2244 - https://phabricator.wikimedia.org/T393195#10891900 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2244.codfw.wmnet with OS bookworm
[17:33:52] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: push_cross_cluster_settings_9200.service on cirrussearch1100:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:36:31] <jinxer-wm>	 FIRING: Traffic bill over quota: Alert for device cr2-magru.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[17:46:20] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage
[17:48:20] <wikibugs>	 (03PS3) 10CDobbins: varnish: Replace X-RB-NOREDIR with rb_noredir var [puppet] - 10https://gerrit.wikimedia.org/r/1154085
[17:49:48] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage
[17:53:38] <icinga-wm>	 PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:53:52] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:54:46] <icinga-wm>	 PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy
[17:56:18] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1374, active_shards: 4052, relocating_shards: 0, initializing_shards: 29, unassigned_shards: 239, delayed_unassigned_shards: 0, numbe
[17:56:18] <icinga-wm>	 ding_tasks: 5, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 610, active_shards_percent_as_number: 93.7962962962963 https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:56:31] <jinxer-wm>	 RESOLVED: Traffic bill over quota: Alert for device cr2-magru.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[17:56:44] <icinga-wm>	 PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[17:56:44] <icinga-wm>	 PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[17:57:30] <icinga-wm>	 RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:57:37] <jinxer-wm>	 FIRING: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:58:44] <icinga-wm>	 RECOVERY - Squid on install1004 is OK: TCP OK - 7.142 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy
[18:01:38] <icinga-wm>	 PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:01:46] <icinga-wm>	 PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy
[18:02:34] <icinga-wm>	 RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 562 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[18:02:34] <icinga-wm>	 RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[18:03:12] <icinga-wm>	 PROBLEM - Disk space on an-worker1110 is CRITICAL: DISK CRITICAL - free space: / 2083 MB (3% inode=95%): /tmp 2083 MB (3% inode=95%): /var/tmp 2083 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1110&var-datasource=eqiad+prometheus/ops
[18:06:36] <icinga-wm>	 PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[18:06:36] <icinga-wm>	 PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[18:06:53] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[18:07:31] <jinxer-wm>	 RESOLVED: Primary outbound port utilisation over 80%  #page: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[18:07:34] <icinga-wm>	 RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[18:07:34] <icinga-wm>	 RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 562 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[18:09:30] <icinga-wm>	 RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:09:58] <logmsgbot>	 jhancock@cumin2002 reimage (PID 2610089) is awaiting input
[18:11:30] <jinxer-wm>	 RESOLVED: Primary inbound port utilisation over 80%  #page: Device cr1-magru.wikimedia.org recovered from Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[18:12:38] <icinga-wm>	 PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:14:28] <icinga-wm>	 RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:16:07] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1154302 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris)
[18:17:38] <icinga-wm>	 PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:18:02] <sukhe>	 ^ did someone look at this?
[18:18:10] <mutante>	 sukhe: I am 
[18:18:23] <sukhe>	 thanks, sorry if it got lost in the noise
[18:18:33] <mutante>	 barely got shell.. just did though
[18:18:36] <icinga-wm>	 RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:18:36] <icinga-wm>	 RECOVERY - Squid on install1004 is OK: TCP OK - 0.000 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy
[18:22:37] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:22:58] <sukhe>	 mutante: no mgmt interface for this?
[18:23:00] <wikibugs>	 (03PS1) 10CDobbins: throwaway commit; will be reverted [puppet] - 10https://gerrit.wikimedia.org/r/1154336
[18:23:33] <mutante>	 sukhe: no..but .. it's because its a VM
[18:23:49] <sukhe>	 ah this is a VM indeed
[18:24:01] <sukhe>	 nothing I can see in the dmesg output at least
[18:24:12] <mutante>	 it was very busy, and now it's not
[18:24:19] <mutante>	 no smoking gun yet
[18:24:49] <mutante>	 DHCP appears to be working. (saying that because of https://phabricator.wikimedia.org/T383069)
[18:24:52] <sukhe>	 https://grafana.wikimedia.org/goto/ntcig4YNR?orgId=1
[18:25:34] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye
[18:25:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10892210 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS b...
[18:28:27] <sukhe>	 grafana was also having issues, that's a VM as well
[18:28:57] <sukhe>	 I wonder where both ofthem are on the ganeti cluster
[18:29:17] <sukhe>	 no not related
[18:31:38] <sukhe>	 the simple explanation with the rise in allocstall and swap-outs is that it is running out of memory
[18:32:24] <sukhe>	 ah
[18:32:26] <sukhe>	 it's squid
[18:32:33] <mutante>	 yea, it's squid!
[18:32:37] <mutante>	 or ..it was
[18:33:52] <mutante>	 "url.full": "http://performance-testing-graphite.wmftest.org:8080/render", 
[18:34:00] <sukhe>	 other installservers look fine and no recent puppet changes
[18:34:03] <sukhe>	 so it is just install1004
[18:34:33] <mutante>	 grep performance-testing /var/log/squid/access.log | wc -l
[18:34:33] <mutante>	 84868
[18:35:02] <mutante>	 blames wmftest.org ?
[18:42:23] <wikibugs>	 (03Abandoned) 10CDobbins: throwaway commit; will be reverted [puppet] - 10https://gerrit.wikimedia.org/r/1154336 (owner: 10CDobbins)
[18:48:42] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:53:42] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:01:49] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[19:01:50] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm
[19:01:57] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db2244 - https://phabricator.wikimedia.org/T393195#10892311 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2244.codfw.wmnet with OS bookworm completed: - db2244 (**PASS**)   - Remov...
[19:07:26] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9243/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9243): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[19:09:18] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1374, active_shards: 4084, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 236, delayed_unassigned_shards: 236, numb
[19:09:18] <icinga-wm>	 nding_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 94.53703703703704 https://wikitech.wikimedia.org/wiki/Search%23Administration
[19:11:22] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: T383811 - bking@cumin2002
[19:11:26] <stashbot>	 T383811: Ensure Search Platform-owned Elasticsearch cookbooks can handle Opensearch - https://phabricator.wikimedia.org/T383811
[19:14:00] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 311108888 and 17 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[19:17:00] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 7631064 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[19:19:55] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10892387 (10KFrancis) Happy to help.  I'll need the user's full name and email address.  If they...
[19:29:43] <jinxer-wm>	 FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1013:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[19:31:44] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye
[19:31:55] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10892432 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host thanos-be2006.codfw.wmnet with OS bu...
[19:39:43] <jinxer-wm>	 RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1013:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[19:45:49] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye
[19:45:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10892448 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1185.eqiad.wmnet with OS bulls...
[19:52:37] <jinxer-wm>	 FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[20:03:11] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage
[20:06:49] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage
[20:15:35] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye
[20:15:48] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10892478 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host thanos-be2007.codfw.wmnet with OS bu...
[20:16:13] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[20:16:36] <wikibugs>	 06SRE, 10Observability-Alerting, 07SecTeam-Processed, 07Security: Update MediaWikiElevatedUnknownLogins alert recipients - https://phabricator.wikimedia.org/T395117#10892479 (10sbassett) 05Open→03Resolved p:05Triage→03Medium
[20:28:45] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db2244 - https://phabricator.wikimedia.org/T393195#10892493 (10Jhancock.wm) 05Open→03Resolved
[20:29:32] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db2244 - https://phabricator.wikimedia.org/T393195#10892496 (10Jhancock.wm) @Marostegui install of this one is complete
[20:35:56] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[20:38:08] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be2066 - https://phabricator.wikimedia.org/T395990#10892513 (10Eevans) a:03MatthewVernon Ok, earlier today @Jhancock.wm swapped the failed drive for us, and for some reason this caused the machine to spontaneously reboot (hardware...
[20:38:44] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage
[20:40:13] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[20:41:55] <wikibugs>	 06SRE, 10SRE-swift-storage: Consider increasing swift workers on proxy nodes to 32 - https://phabricator.wikimedia.org/T396203#10892531 (10Eevans) I responded to that incident and was quite surprised that we seemed to be "saturated", with what seemed like so much headroom in all the usual dimensions.  I'd be +...
[20:42:04] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage
[20:43:12] <icinga-wm>	 PROBLEM - Disk space on an-worker1110 is CRITICAL: DISK CRITICAL - free space: / 2068 MB (3% inode=95%): /tmp 2068 MB (3% inode=95%): /var/tmp 2068 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1110&var-datasource=eqiad+prometheus/ops
[20:45:46] <wikibugs>	 10ops-codfw, 06SRE, 06SRE-OnFire, 10Cassandra, and 3 others: additional sessionstore expansion — codfw - https://phabricator.wikimedia.org/T395954#10892534 (10Eevans) >>! In T395954#10884228, @Jhancock.wm wrote: > i have 12 x 480GB drives readily available on site    >>! In T395954#10887801, @Jhancock.wm w...
[20:57:37] <jinxer-wm>	 FIRING: GoRoutinesTooHigh: gNMIc running on netflow1002 have more than 10000 Go routines. - https://wikitech.wikimedia.org/wiki/Network_telemetry#GoRoutinesTooHigh - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGoRoutinesTooHigh
[21:02:18] <logmsgbot>	 !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom
[21:05:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10892561 (10Dwisehaupt) Ok. I have done a chunk of testing: * When booting the host with no drive in the path, I can see both interfaces connect and a...
[21:15:12] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[21:19:55] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[21:25:35] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[21:33:32] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[21:41:20] <jinxer-wm>	 FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[21:46:20] <jinxer-wm>	 RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[22:53:42] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:12:28] <wikibugs>	 (03PS1) 10Cwhite: logstash: add filter_on_template_v2 [puppet] - 10https://gerrit.wikimedia.org/r/1154348 (https://phabricator.wikimedia.org/T234565)
[23:13:27] <wikibugs>	 (03PS1) 10Cwhite: add dependencies to readme [software/ecs] - 10https://gerrit.wikimedia.org/r/1154349
[23:14:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] logstash: add filter_on_template_v2 [puppet] - 10https://gerrit.wikimedia.org/r/1154348 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[23:38:48] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1154350
[23:38:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1154350 (owner: 10TrainBranchBot)
[23:49:49] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1154350 (owner: 10TrainBranchBot)
[23:52:37] <jinxer-wm>	 FIRING: SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity