[00:39:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/921311 [00:39:43] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/921311 (owner: 10TrainBranchBot) [00:49:49] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:56:18] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/921311 (owner: 10TrainBranchBot) [01:07:13] (03CR) 10Superpes15: "Did you follow logos/README.md? Seems that tox wasn't run" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921372 (owner: 10Robertsky) [01:14:39] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:43:57] PROBLEM - PHP7 rendering on mw1494 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:44:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:45:21] RECOVERY - PHP7 rendering on mw1494 is OK: HTTP OK: HTTP/1.1 200 OK - 280 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:49:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:54:07] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:55:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:59:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:00:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:04:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:06:47] PROBLEM - PHP7 jobrunner on mw1494 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [03:08:15] RECOVERY - PHP7 jobrunner on mw1494 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 3.412 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [03:09:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:10:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:14:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:14:31] PROBLEM - PHP7 jobrunner on mw1494 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [03:15:25] PROBLEM - PHP7 rendering on mw1494 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:01] RECOVERY - PHP7 jobrunner on mw1494 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 6.477 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [03:16:57] RECOVERY - PHP7 rendering on mw1494 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 8.540 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:23:47] PROBLEM - PHP7 jobrunner on mw1494 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [03:25:17] RECOVERY - PHP7 jobrunner on mw1494 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 7.097 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [03:29:45] PROBLEM - PHP7 rendering on mw1494 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:29:59] PROBLEM - PHP7 jobrunner on mw1494 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [03:31:29] RECOVERY - PHP7 jobrunner on mw1494 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 6.541 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [03:31:47] PROBLEM - PHP7 rendering on mw1468 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:32:49] RECOVERY - PHP7 rendering on mw1494 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 6.867 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:33:11] RECOVERY - PHP7 rendering on mw1468 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:41:29] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2004.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2004.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:42:51] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled ht [03:42:51] kitech.wikimedia.org/wiki/PyBal [03:43:55] PROBLEM - PHP7 jobrunner on mw1494 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [03:43:59] PROBLEM - PHP7 jobrunner on mw1468 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [03:44:43] PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:45:23] RECOVERY - PHP7 jobrunner on mw1494 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 5.278 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [03:45:27] RECOVERY - PHP7 jobrunner on mw1468 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 5.912 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [03:46:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:46:09] RECOVERY - WDQS SPARQL on wdqs2011 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.196 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:47:05] PROBLEM - PHP7 rendering on mw1468 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:47:27] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2009.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2009.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2009.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimed [03:47:27] iki/PyBal [03:47:37] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2009.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2009.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are mar [03:47:37] but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:48:37] RECOVERY - PHP7 rendering on mw1468 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 9.406 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:50:03] PROBLEM - PHP7 jobrunner on mw1494 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [03:51:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:51:35] RECOVERY - PHP7 jobrunner on mw1494 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 8.625 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [03:51:45] PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:53:09] RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.257 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:55:33] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:56:57] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.182 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:58:39] PROBLEM - WDQS SPARQL on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:59:17] PROBLEM - PHP7 jobrunner on mw1494 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:00:03] RECOVERY - WDQS SPARQL on wdqs2010 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.209 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:00:41] RECOVERY - PHP7 jobrunner on mw1494 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [04:01:21] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2007.codfw.wmnet are mar [04:01:21] but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:02:29] PROBLEM - PHP7 rendering on mw1494 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:02:37] PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:03:55] PROBLEM - PHP7 jobrunner on mw1494 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:05:35] RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.201 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:05:59] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:07:49] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [04:08:29] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:08:29] RECOVERY - PHP7 jobrunner on mw1494 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 4.044 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [04:08:41] RECOVERY - PHP7 rendering on mw1494 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 7.753 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:09:17] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [04:09:29] PROBLEM - WDQS SPARQL on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:10:39] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs2009.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:10:47] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:10:55] RECOVERY - WDQS SPARQL on wdqs2010 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.193 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:11:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:11:27] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.209 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:14:07] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:14:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:15:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:15:31] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.181 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:16:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:16:13] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:16:49] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:16:59] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:17:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:17:37] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.185 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:19:55] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:20:05] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:20:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:21:22] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:22:47] PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:23:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:24:11] RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.188 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:25:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:26:07] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:26:22] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:29:39] PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:30:13] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:32:19] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled ht [04:32:19] kitech.wikimedia.org/wiki/PyBal [04:32:27] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:34:11] RECOVERY - WDQS SPARQL on wdqs2011 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.573 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:36:17] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 692 bytes in 2.308 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:37:27] PROBLEM - WDQS SPARQL on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:38:59] PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:40:23] RECOVERY - WDQS SPARQL on wdqs2010 is OK: HTTP OK: HTTP/1.1 200 OK - 692 bytes in 2.364 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:45:01] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wik [04:45:09] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:45:41] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:46:35] PROBLEM - PHP7 rendering on mw1494 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:46:41] RECOVERY - WDQS SPARQL on wdqs2011 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 8.942 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:47:05] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.200 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:48:01] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [04:48:01] RECOVERY - PHP7 rendering on mw1494 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 2.903 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:49:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:50:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:50:25] PROBLEM - PHP7 jobrunner on mw1494 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:51:21] PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:51:51] RECOVERY - PHP7 jobrunner on mw1494 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 1.645 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [04:51:53] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:52:25] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimed [04:52:25] iki/PyBal [04:52:47] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.182 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:54:21] RECOVERY - WDQS SPARQL on wdqs2011 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.279 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:54:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:55:19] PROBLEM - PHP7 rendering on mw1494 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:56:29] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 692 bytes in 5.414 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:56:43] RECOVERY - PHP7 rendering on mw1494 is OK: HTTP OK: HTTP/1.1 200 OK - 282 bytes in 0.729 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:56:57] PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:58:21] RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.216 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:59:09] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:59:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:00:09] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimed [05:00:09] iki/PyBal [05:01:13] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:01:51] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimed [05:01:51] iki/PyBal [05:03:09] PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:05:43] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.185 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:06:21] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:06:29] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:06:53] PROBLEM - WDQS SPARQL on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:09:13] RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 1.738 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:09:25] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:09:33] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:10:29] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:11:23] RECOVERY - WDQS SPARQL on wdqs2010 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.181 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:11:23] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.190 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:11:57] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 4.863 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:13:59] PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:15:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:15:23] RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.195 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:15:43] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2009.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2009.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:16:11] PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:16:11] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:16:41] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:19:11] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 2.666 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:19:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:23:19] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are mar [05:23:19] but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:24:22] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:25:25] RECOVERY - WDQS SPARQL on wdqs2011 is OK: HTTP OK: HTTP/1.1 200 OK - 692 bytes in 7.289 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:25:47] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.595 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:27:03] PROBLEM - WDQS SPARQL on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:27:55] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2009.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2009.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:31:41] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:32:35] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2011.codfw.wmnet are mar [05:32:35] but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:32:43] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled ht [05:32:43] kitech.wikimedia.org/wiki/PyBal [05:33:07] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 2.085 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:35:15] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:38:15] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.234 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:39:33] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:40:59] RECOVERY - WDQS SPARQL on wdqs2010 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.211 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:41:05] PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:42:29] RECOVERY - WDQS SPARQL on wdqs2011 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.198 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:43:31] PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:44:07] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.186 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:48:03] RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.196 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:48:15] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2011.codfw.wmnet are mar [05:48:15] but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:48:19] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2009.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:49:21] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:52:19] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.185 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:53:35] PROBLEM - WDQS SPARQL on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:55:11] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:55:57] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:56:03] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2010.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:56:33] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.194 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:57:05] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:58:05] RECOVERY - WDQS SPARQL on wdqs2010 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.194 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:00:03] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.180 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:01:23] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:02:49] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 2.635 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:04:17] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:04:27] PROBLEM - WDQS SPARQL on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:07:31] PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:08:23] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimed [06:08:23] iki/PyBal [06:08:29] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2009.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:08:57] RECOVERY - WDQS SPARQL on wdqs2011 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.844 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:09:17] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:57] PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:10:35] RECOVERY - WDQS SPARQL on wdqs2010 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.625 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:11:23] RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 1.523 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:15:47] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:16:11] PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:17:13] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.191 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:20:51] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2011.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:21:39] PROBLEM - WDQS SPARQL on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:22:03] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:23:05] RECOVERY - WDQS SPARQL on wdqs2010 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.183 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:23:29] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.783 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:23:51] RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.182 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:43:57] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:43:59] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:00:06] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230520T0700) [07:40:55] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:47:11] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:07:49] (03PS1) 10Jameel Kaisar: Set NetworkProbeLimit cookie [puppet] - 10https://gerrit.wikimedia.org/r/921437 (https://phabricator.wikimedia.org/T335637) [08:08:12] (03CR) 10CI reject: [V: 04-1] Set NetworkProbeLimit cookie [puppet] - 10https://gerrit.wikimedia.org/r/921437 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [08:14:51] (03PS2) 10Jameel Kaisar: Set NetworkProbeLimit cookie [puppet] - 10https://gerrit.wikimedia.org/r/921437 (https://phabricator.wikimedia.org/T335637) [08:15:51] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/921437 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [08:19:02] (03PS2) 10Dreamy Jazz: Always collapse by default the CheckUserHelper on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886370 (https://phabricator.wikimedia.org/T328726) [08:58:00] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:59:18] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:00:57] !log volans@cumin1001 START - Cookbook sre.dns.netbox [09:04:18] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:07:18] !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Added records for the new private.codfw.wikimedia.cloud domain - volans@cumin1001" [09:08:21] !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Added records for the new private.codfw.wikimedia.cloud domain - volans@cumin1001" [09:08:22] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:12:08] (03PS12) 10Volans: templates: convert 172.20.5.0/24 to Nebox [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759) (owner: 10Arturo Borrero Gonzalez) [09:15:21] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759) (owner: 10Arturo Borrero Gonzalez) [09:15:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] templates: convert 172.20.5.0/24 to Nebox [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759) (owner: 10Arturo Borrero Gonzalez) [09:16:12] (03CR) 10Krinkle: [C: 04-1] Remove innodb_lock_wait_timeout from the DatabaseMysqli SET statement in open() (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918612 (owner: 10Aaron Schulz) [09:16:45] (03PS1) 10Zabe: manage-dblist: Add remove as alias for del [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921453 [09:23:18] (03CR) 10Krinkle: [C: 03+1] manage-dblist: Add remove as alias for del [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921453 (owner: 10Zabe) [09:36:06] (03CR) 10Jelto: [C: 03+2] miscweb/annualreport: bump image version to v1.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/921354 (https://phabricator.wikimedia.org/T336217) (owner: 10Jelto) [09:36:51] (03Merged) 10jenkins-bot: miscweb/annualreport: bump image version to v1.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/921354 (https://phabricator.wikimedia.org/T336217) (owner: 10Jelto) [09:54:37] 10SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Deoband Community Wikimedia - https://phabricator.wikimedia.org/T322996 (10TheAafi) @Ladsgroup, could you please make this mailing list as a private one? Over the last few months, we have had some discussions internally and we felt that keeping the li... [09:54:57] 10SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Deoband Community Wikimedia - https://phabricator.wikimedia.org/T322996 (10TheAafi) 05Resolved→03Open [09:59:08] (03CR) 10Zabe: [C: 03+2] "does not need deployment -> okay to merge on weekend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921453 (owner: 10Zabe) [09:59:58] (03Merged) 10jenkins-bot: manage-dblist: Add remove as alias for del [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921453 (owner: 10Zabe) [10:00:14] (03PS1) 10Zabe: Move jquery.wikibase.wbtooltip and dependencies to Lib [extensions/Wikibase] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/921164 (https://phabricator.wikimedia.org/T337081) [10:40:16] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create mechanism to disable IPv6 RA generation on irb interfaces when required - https://phabricator.wikimedia.org/T337057 (10cmooney) [10:47:22] (03PS2) 10Cathal Mooney: Add disable_ra var to homer config to enable manual disabling of IPv6 RAs [homer/public] - 10https://gerrit.wikimedia.org/r/921400 (https://phabricator.wikimedia.org/T337057) [10:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:15:07] (03PS3) 10Jameel Kaisar: Set NetworkProbeLimit cookie [puppet] - 10https://gerrit.wikimedia.org/r/921437 (https://phabricator.wikimedia.org/T335637) [11:17:28] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/921437 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar) [11:42:52] (03PS1) 10Jelto: gitlab: use sshkey for git-ssh public keys [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) [11:43:15] (03CR) 10CI reject: [V: 04-1] gitlab: use sshkey for git-ssh public keys [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [11:43:45] 10SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Deoband Community Wikimedia - https://phabricator.wikimedia.org/T322996 (10Aklapper) 05Open→03Resolved Hi, please file a new task for a new request. This was about creating a mailing list, and the list was already created. So the task is done. Tha... [11:44:41] (03PS2) 10Jelto: gitlab: use sshkey for git-ssh public keys [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) [11:46:39] (03PS3) 10Jelto: gitlab: use sshkey for git-ssh public keys [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) [11:47:41] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:56:36] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] "Marius says he identified a better fix, so let’s not backport this just yet." [extensions/Wikibase] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/921164 (https://phabricator.wikimedia.org/T337081) (owner: 10Zabe) [11:56:46] (03PS4) 10Jelto: gitlab: use sshkey for git-ssh public keys [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) [12:03:34] (03CR) 10Jelto: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [12:15:27] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:45] (03CR) 10Volans: "LGTM, one question inline" [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [12:56:15] (03PS1) 10Volans: cloudcumin: fix SSH key config [puppet] - 10https://gerrit.wikimedia.org/r/921566 (https://phabricator.wikimedia.org/T323484) [12:57:10] 10SRE, 10SRE-Access-Requests: Requesting access to JS moniring for eccenux - https://phabricator.wikimedia.org/T337121 (10Nux) [12:58:00] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:24:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Support urllib 2.x [software/conftool] - 10https://gerrit.wikimedia.org/r/903599 (owner: 10Giuseppe Lavagetto) [13:28:18] (03Merged) 10jenkins-bot: Support urllib 2.x [software/conftool] - 10https://gerrit.wikimedia.org/r/903599 (owner: 10Giuseppe Lavagetto) [13:30:40] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Release 2.3.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/903600 (owner: 10Giuseppe Lavagetto) [13:34:00] (03Merged) 10jenkins-bot: Release 2.3.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/903600 (owner: 10Giuseppe Lavagetto) [13:34:10] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add black formatting and enforcement [software/conftool] - 10https://gerrit.wikimedia.org/r/903617 (owner: 10Giuseppe Lavagetto) [13:37:28] (03Merged) 10jenkins-bot: Add black formatting and enforcement [software/conftool] - 10https://gerrit.wikimedia.org/r/903617 (owner: 10Giuseppe Lavagetto) [13:57:42] 10SRE, 10LDAP-Access-Requests: Log stash access for Dreamy Jazz - https://phabricator.wikimedia.org/T337126 (10Dreamy_Jazz) >>! In T337126#8867589, @taavi wrote: > This would be the `nda` LDAP group. Thanks. [14:04:25] (03Abandoned) 10Zabe: Move jquery.wikibase.wbtooltip and dependencies to Lib [extensions/Wikibase] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/921164 (https://phabricator.wikimedia.org/T337081) (owner: 10Zabe) [14:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:10:10] (03PS1) 10Hoo man: Remove linkitem dependency on jquery.wikibase.wbtooltip [extensions/Wikibase] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/921549 (https://phabricator.wikimedia.org/T337081) [14:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:26:50] thcipriani: hashar: ping [14:27:59] (03CR) 10Itamar Givon: [C: 03+1] Remove linkitem dependency on jquery.wikibase.wbtooltip [extensions/Wikibase] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/921549 (https://phabricator.wikimedia.org/T337081) (owner: 10Hoo man) [14:33:55] (03PS1) 10Zabe: manage-dblist: Add init command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921585 (https://phabricator.wikimedia.org/T330059) [14:35:22] Amir1: https://gerrit.wikimedia.org/r/921549 This is the backport [14:36:10] (03PS2) 10Zabe: manage-dblist: Add init command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921585 (https://phabricator.wikimedia.org/T330059) [14:37:16] (03PS3) 10Zabe: manage-dblist: Add init command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921585 (https://phabricator.wikimedia.org/T330059) [14:40:26] hoo: o/ what's up? [14:40:49] thcipriani: https://gerrit.wikimedia.org/r/921549 This is a trivial fix for an UBN [14:41:20] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: name=parse1018.eqiad.wmnet [14:41:22] * thcipriani reads [14:43:16] hoo: I can backport that if you can check me on it [14:43:39] thcipriani: I can also do that, whatever works best for yo [14:43:47] (03CR) 10Ladsgroup: [C: 03+1] "Approval from SRE to deploy in off-hours" [extensions/Wikibase] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/921549 (https://phabricator.wikimedia.org/T337081) (owner: 10Hoo man) [14:43:55] (I'm at the hackathon, but not using the provided wlan, but rather roaming) [14:44:41] hoo: ah, didn't realize you could DIY deploy—if you're fine doing that given the wifi situation: go for it [14:44:54] (if you'd rather have me do it: also fine :)) [14:45:17] I'll go ahead :) [14:45:55] cool, let me know if you need anything. Also, pro tip: we have tmux on the deployment server for spotty-internet reasons :) [14:47:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hoo@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/921549 (https://phabricator.wikimedia.org/T337081) (owner: 10Hoo man) [14:47:54] 10SRE, 10SRE-Access-Requests: Requesting access to JS moniring for eccenux - https://phabricator.wikimedia.org/T337121 (10sgrabarczuk) As a community-facing person, I support this request. Nux is a trusted person with a long experience in collaborating with us. [14:48:11] thcipriani: Using screen (because I never learned the tmux keys…)... but still with spotty wifi seems dangerous [14:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:53:18] 10SRE, 10SRE-Access-Requests: Requesting access to JS moniring for eccenux - https://phabricator.wikimedia.org/T337121 (10taavi) `analytics-privatedata-users` is the wrong group and #sre-access-requests is the wrong workflow here. What you want for (especially client) log access is Logstash access which can be... [15:03:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:08:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:08:33] (03Merged) 10jenkins-bot: Remove linkitem dependency on jquery.wikibase.wbtooltip [extensions/Wikibase] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/921549 (https://phabricator.wikimedia.org/T337081) (owner: 10Hoo man) [15:08:57] !log hoo@deploy1002 Started scap: Backport for [[gerrit:921549|Remove linkitem dependency on jquery.wikibase.wbtooltip (T337081)]] [15:09:02] T337081: "Add interlanguage link" shows a bubble: "An unknown error occurred" - https://phabricator.wikimedia.org/T337081 [15:10:24] !log hoo@deploy1002 hoo: Backport for [[gerrit:921549|Remove linkitem dependency on jquery.wikibase.wbtooltip (T337081)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [15:13:34] (03PS6) 10Jelto: gitlab: use sshkey for git-ssh public keys [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) [15:14:01] (03CR) 10CI reject: [V: 04-1] gitlab: use sshkey for git-ssh public keys [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [15:14:07] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:15:03] (03CR) 10Jelto: gitlab: use sshkey for git-ssh public keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [15:17:45] !log hoo@deploy1002 Finished scap: Backport for [[gerrit:921549|Remove linkitem dependency on jquery.wikibase.wbtooltip (T337081)]] (duration: 08m 47s) [15:17:51] T337081: "Add interlanguage link" shows a bubble: "An unknown error occurred" - https://phabricator.wikimedia.org/T337081 [15:18:30] \o/ (wfm again) thanks hoo <3 [15:18:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:18:50] Same for me, nice :) [15:19:07] (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:19:15] (old version might still be cached for some, so might need a hard refresh) [15:19:26] (03PS7) 10Jelto: gitlab: use sshkey for git-ssh public keys [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) [15:19:50] (03CR) 10CI reject: [V: 04-1] gitlab: use sshkey for git-ssh public keys [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) (owner: 10Jelto) [15:20:34] (03PS8) 10Jelto: gitlab: use sshkey for git-ssh public keys [puppet] - 10https://gerrit.wikimedia.org/r/921506 (https://phabricator.wikimedia.org/T337107) [15:23:19] (03PS4) 10Zabe: manage-dblist: Add init command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921585 (https://phabricator.wikimedia.org/T330059) [15:23:25] (03CR) 10Urbanecm: [C: 04-1] "-1 mainly because of the special wikis bit. Thanks for working on this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921585 (https://phabricator.wikimedia.org/T330059) (owner: 10Zabe) [15:23:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:24:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:26:02] (03CR) 10Ladsgroup: manage-dblist: Add init command (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921585 (https://phabricator.wikimedia.org/T330059) (owner: 10Zabe) [15:26:10] (03PS1) 10Urbanecm: Migrate GrowthExperiments config to its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921599 (https://phabricator.wikimedia.org/T308932) [15:29:07] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:30:19] (03CR) 10FNegri: [C: 03+1] "Now I also want a patch for the ssh_config man file :D" [puppet] - 10https://gerrit.wikimedia.org/r/921566 (https://phabricator.wikimedia.org/T323484) (owner: 10Volans) [15:31:20] (03CR) 10Volans: [C: 03+2] cloudcumin: fix SSH key config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/921566 (https://phabricator.wikimedia.org/T323484) (owner: 10Volans) [15:45:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:45:23] (03CR) 10Urbanecm: manage-dblist: Add init command (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921585 (https://phabricator.wikimedia.org/T330059) (owner: 10Zabe) [15:46:27] (03CR) 10Urbanecm: [C: 04-1] manage-dblist: Add init command (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921585 (https://phabricator.wikimedia.org/T330059) (owner: 10Zabe) [15:50:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:53:18] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificates) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:55:28] (03CR) 10Gergő Tisza: [C: 03+1] Migrate GrowthExperiments config to its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921599 (https://phabricator.wikimedia.org/T308932) (owner: 10Urbanecm) [15:58:18] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST certificates) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:10:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:20:07] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:23:46] (03PS13) 10Winston Sung: Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558052 (https://phabricator.wikimedia.org/T248352) (owner: 10Fomafix) [16:26:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:30:22] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:35:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:39:04] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: name=parse1018.eqiad.wmnet [16:40:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:46:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:56:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:58:00] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:39:42] (03CR) 10Kosta Harlan: [C: 03+1] "Didn't review it line-by-line, but I'm in favor of the concept." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921599 (https://phabricator.wikimedia.org/T308932) (owner: 10Urbanecm) [17:43:06] (03CR) 10Zabe: [C: 03+1] "looks good, diffDocker detects no changes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921599 (https://phabricator.wikimedia.org/T308932) (owner: 10Urbanecm) [18:08:49] 10SRE, 10Inuka-Team, 10Wikipedia-Preview, 10User-bd808, 10Wikimedia-Hackathon-2023: Add both Wikipedia Preview repos to Packagist - https://phabricator.wikimedia.org/T310938 (10bd808) [18:11:00] (03PS5) 10Zabe: manage-dblist: Add init command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921585 (https://phabricator.wikimedia.org/T330059) [18:11:48] (03CR) 10Zabe: manage-dblist: Add init command (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921585 (https://phabricator.wikimedia.org/T330059) (owner: 10Zabe) [18:12:18] (03PS6) 10Zabe: manage-dblist: Add init command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921585 (https://phabricator.wikimedia.org/T330059) [18:16:03] (03CR) 10Zabe: manage-dblist: Add init command (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921585 (https://phabricator.wikimedia.org/T330059) (owner: 10Zabe) [18:18:38] (03PS1) 10Robertsky: going through the tox as stated in the readme [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921610 [18:24:16] (03PS1) 10Simon04: Enable the Wikibase REST API on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921612 (https://phabricator.wikimedia.org/T337141) [18:24:39] (03PS1) 10Zabe: update composer dependencies to latest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921613 [18:25:27] !log restart varnish cp3061 [18:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:35] (03PS1) 10Zabe: Enable VE on new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921614 [18:29:55] (03PS2) 10Zabe: Enable VE on new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921614 [18:30:05] (03CR) 10Zabe: [C: 03+2] update composer dependencies to latest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921613 (owner: 10Zabe) [18:30:54] (03Merged) 10jenkins-bot: update composer dependencies to latest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921613 (owner: 10Zabe) [18:45:27] (03PS1) 10Volans: varnish: fix call to cluster_fe_ratelimit [puppet] - 10https://gerrit.wikimedia.org/r/921617 (https://phabricator.wikimedia.org/T337142) [18:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:24:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [19:29:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [19:37:12] (03CR) 10Lucas Werkmeister: "Note: I haven’t tested this yet. A while ago I had a setup to use a local clone of the webservice command in a tool, maybe I can see if I " [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/921620 (owner: 10Lucas Werkmeister) [19:38:08] (03CR) 10Lucas Werkmeister: Restart Kubernetes webservices more cleanly (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/921620 (owner: 10Lucas Werkmeister) [19:53:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:58:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:04:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:09:07] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:11:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:16:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:18:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:23:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:25:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:30:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:35:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:40:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:43:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:48:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:50:46] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for eccenux - https://phabricator.wikimedia.org/T337121 (10Nux) [20:52:34] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for eccenux - https://phabricator.wikimedia.org/T337121 (10Nux) > analytics-privatedata-users is the wrong group and SRE-Access-Requests is the wrong workflow here. What you want for (especially client) log access is Logstash access which can be received via t... [20:53:16] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for eccenux - https://phabricator.wikimedia.org/T337121 (10RhinosF1) @Nux: do you have a WMF sponsor for this? [20:54:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:54:29] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for eccenux - https://phabricator.wikimedia.org/T337121 (10RhinosF1) >>! In T337121#8868286, @RhinosF1 wrote: > @Nux: do you have a WMF sponsor for this? @SGrabarczuk: Can you sponsor this formally? You'll need your manager and C-Level sign off? [20:58:00] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:59:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:01:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:06:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:14:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:34:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:40:39] PROBLEM - PHP7 rendering on mw1494 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:42:13] RECOVERY - PHP7 rendering on mw1494 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 9.825 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:44:07] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:47:31] PROBLEM - PHP7 jobrunner on mw1494 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [21:49:03] RECOVERY - PHP7 jobrunner on mw1494 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 9.088 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [21:52:11] PROBLEM - PHP7 jobrunner on mw1494 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [21:53:41] RECOVERY - PHP7 jobrunner on mw1494 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 7.827 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [21:56:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:56:47] PROBLEM - PHP7 jobrunner on mw1494 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [21:58:19] RECOVERY - PHP7 jobrunner on mw1494 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 8.309 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [22:01:07] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:01:39] PROBLEM - PHP7 rendering on mw1494 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:03:05] RECOVERY - PHP7 rendering on mw1494 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 3.549 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:04:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:19:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:22:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:37:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale