[01:37:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [01:42:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [01:47:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [01:52:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [01:57:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [02:02:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [02:07:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [02:12:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [03:41:47] (Traffic on tunnel link) firing: Traffic on tunnel link - https://alerts.wikimedia.org [03:43:32] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 75 probes of 630 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:46:47] (Traffic on tunnel link) resolved: Traffic on tunnel link - https://alerts.wikimedia.org [03:49:24] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 60 probes of 630 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:05:02] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: debian-weekly-rebuild.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:37:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [04:42:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [04:47:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [04:52:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [04:57:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [05:02:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [05:07:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [05:12:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [06:00:54] PROBLEM - Stale file for node-exporter textfile in eqiad on alert1001 is CRITICAL: cluster=labsnfs file=node_directory_size_bytes.prom instance=labstore1004 job=node site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210620T0700) [07:03:14] RECOVERY - Stale file for node-exporter textfile in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [07:37:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [07:42:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [07:47:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [07:52:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [07:53:36] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [07:57:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [07:58:50] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [08:02:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [08:07:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [08:12:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [09:55:33] (03CR) 10Zabe: [C: 03+1] ptwikinews: Remove NS ID 102,103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700428 (https://phabricator.wikimedia.org/T285163) (owner: 10Urbanecm) [10:37:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [10:42:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [10:43:24] PROBLEM - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:47:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [10:52:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [10:57:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [11:02:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [11:07:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [11:12:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [11:44:08] RECOVERY - SSH on wdqs2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:29:53] 10SRE, 10Traffic, 10Chinese-Sites: Adding an image to a zh.wp article throws HTTP 503 or 504 error - https://phabricator.wikimedia.org/T285160 (10RhinosF1) [13:30:12] 10SRE, 10Traffic, 10Chinese-Sites: Adding an image to a zh.wp article throws HTTP 503 or 504 error - https://phabricator.wikimedia.org/T285160 (10RhinosF1) Does this actually only affect Chinese Wikipedia? [13:31:13] XioNoX: could it be an esqin issue? [13:32:26] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [13:37:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [13:42:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [13:45:39] 10SRE, 10MediaWiki-Uploading, 10Traffic, 10Chinese-Sites, 10Performance Issue: Adding an image to zh.wp 长江桥隧列表 article throws HTTP 503 or 504 error - https://phabricator.wikimedia.org/T285160 (10RhinosF1) Reproduced with adding an image to that article specifically. Request from via cp3062 cp3062, V... [13:45:46] 10SRE, 10MediaWiki-Uploading, 10Traffic, 10Chinese-Sites, 10Performance Issue: Adding an image to zh.wp 长江桥隧列表 article throws HTTP 503 or 504 error - https://phabricator.wikimedia.org/T285160 (10RhinosF1) [13:46:07] 10SRE, 10MediaWiki-Uploading, 10Traffic, 10Chinese-Sites, 10Performance Issue: Adding an image to zh.wp 长江桥隧列表 article throws HTTP 503 or 504 error - https://phabricator.wikimedia.org/T285160 (10RhinosF1) The whole page seems very slow to load though [13:46:23] No it's just a very slow oage [13:46:25] Page [13:47:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [13:50:10] 10SRE, 10MediaWiki-Uploading, 10Traffic, 10Chinese-Sites, 10Performance Issue: Adding an image to zh.wp 长江桥隧列表 article throws HTTP 503 or 504 error - https://phabricator.wikimedia.org/T285160 (10IN) Can't be see on another article, see https://zhwp.org/w/index.php?title=%E8%A1%A8%E6%83%85%E5%8C%85&diff=6... [13:52:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [13:54:12] 10SRE, 10MediaWiki-Uploading, 10Traffic, 10Chinese-Sites, 10Performance Issue: Adding an image to zh.wp 长江桥隧列表 article throws HTTP 503 or 504 error - https://phabricator.wikimedia.org/T285160 (10RhinosF1) Parser profiling data shows over 5 seconds to load it [13:55:30] 10SRE, 10MediaWiki-Uploading, 10Traffic, 10Chinese-Sites, 10Performance Issue: Adding an image to zh.wp 长江桥隧列表 article throws HTTP 503 or 504 error - https://phabricator.wikimedia.org/T285160 (10IN) To prevent me from messing with the Chinese Wikipedia, I tried to copy the article to my test page for tes... [13:57:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [14:02:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [14:07:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [14:10:41] 10SRE, 10MediaWiki-Uploading, 10Traffic, 10Chinese-Sites, 10Performance Issue: Adding an image to zh.wp 长江桥隧列表 article throws HTTP 503 or 504 error - https://phabricator.wikimedia.org/T285160 (10IN) The test showed that if I was doing an API edit to my mirror page in the sandbox, it would take the server... [14:12:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [16:37:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [16:42:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [16:47:24] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [16:47:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [16:52:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [16:56:44] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [16:57:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [17:02:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [17:05:24] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.6032 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:07:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [17:07:50] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [17:08:58] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.03175 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:12:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [17:32:50] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [17:42:28] PROBLEM - SSH on mw1297.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:43:10] RECOVERY - SSH on mw1297.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:37:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [19:42:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [19:47:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [19:52:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [19:57:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [20:02:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [20:07:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [20:12:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [21:13:20] (03PS1) 10Zabe: Disable Education Program namespaces in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/700455 (https://phabricator.wikimedia.org/T285193) [21:15:06] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1005 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:16:56] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:17:00] PROBLEM - WDQS high update lag on wdqs1005 is CRITICAL: 6.512e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [22:07:10] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [22:08:55] ACKNOWLEDGEMENT - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project andrew bogott investigating https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [22:22:08] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 0 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [22:30:41] (03PS3) 10Andrew Bogott: Toolforge bastions: add a broken shell for disabled tools [puppet] - 10https://gerrit.wikimedia.org/r/699577 (https://phabricator.wikimedia.org/T170355) [22:30:43] (03PS1) 10Andrew Bogott: Cloud-vps puppetmasters: rotate enc logs [puppet] - 10https://gerrit.wikimedia.org/r/700456 [22:31:29] (03CR) 10jerkins-bot: [V: 04-1] Cloud-vps puppetmasters: rotate enc logs [puppet] - 10https://gerrit.wikimedia.org/r/700456 (owner: 10Andrew Bogott) [22:32:50] (03CR) 10Andrew Bogott: [C: 03+2] Toolforge bastions: add a broken shell for disabled tools [puppet] - 10https://gerrit.wikimedia.org/r/699577 (https://phabricator.wikimedia.org/T170355) (owner: 10Andrew Bogott) [22:34:01] (03PS2) 10Andrew Bogott: Cloud-vps puppetmasters: rotate enc logs [puppet] - 10https://gerrit.wikimedia.org/r/700456 [22:35:02] (03CR) 10Andrew Bogott: [C: 03+2] Cloud-vps puppetmasters: rotate enc logs [puppet] - 10https://gerrit.wikimedia.org/r/700456 (owner: 10Andrew Bogott) [22:37:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [22:42:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [22:44:20] (03PS1) 10Andrew Bogott: cloud-vps puppetmaster: fix log name for enc log rotation [puppet] - 10https://gerrit.wikimedia.org/r/700457 [22:45:22] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps puppetmaster: fix log name for enc log rotation [puppet] - 10https://gerrit.wikimedia.org/r/700457 (owner: 10Andrew Bogott) [22:47:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [22:52:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [22:57:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [23:02:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [23:07:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [23:12:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org