[00:52:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:54:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:17:15] 10SRE: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10Legoktm) I am generally in favor of making our maps as broadly usable as possible, regardless of who is using it, just like how we allow and encourage reuse of article content, images, etc. If we feel the new maps... [02:42:06] PROBLEM - SSH on db2086.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:43:12] RECOVERY - SSH on db2086.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:38:50] PROBLEM - SSH on rdb1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:02:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:07:10] 10SRE-swift-storage, 10Commons, 10Internet-Archive, 10MediaWiki-API, and 3 others: Large PDF upload issue - https://phabricator.wikimedia.org/T254459 (10Midleading) PDF files as big as 2.13 GB have been uploaded to Commons with some tweaks to API requests. This was never imaginable before. Thanks! [08:15:04] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [08:17:16] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [09:01:04] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Mon 27 Dec 2021 09:00:28 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:05:26] RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 25 Feb 2022 08:56:29 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:42:22] RECOVERY - SSH on rdb1006.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:16:38] PROBLEM - SSH on db2083.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:24:04] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:24:16] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:39:46] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Mon 27 Dec 2021 09:00:28 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:42:00] RECOVERY - mailman archives on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 25 Feb 2022 08:56:29 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:42:34] PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:17:44] RECOVERY - SSH on db2083.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:25:10] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:25:22] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:43:42] RECOVERY - SSH on kubernetes1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:24:55] (03PS1) 10PipelineBot: apple-search: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/748349 [12:41:23] (03PS1) 10PipelineBot: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/748350 [12:43:38] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Mon 27 Dec 2021 09:00:28 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:45:07] (03PS1) 10PipelineBot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/748353 [12:45:52] RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 25 Feb 2022 08:56:29 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:49:28] (03PS1) 10PipelineBot: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/748354 [13:09:44] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Mon 27 Dec 2021 09:00:28 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:11:58] RECOVERY - mailman archives on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 25 Feb 2022 08:56:29 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:46:18] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Mon 27 Dec 2021 09:00:28 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:50:46] RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 25 Feb 2022 08:56:29 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:47:46] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Mon 27 Dec 2021 09:00:28 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:48:36] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Mon 27 Dec 2021 09:00:28 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:50:02] RECOVERY - mailman archives on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 25 Feb 2022 08:56:29 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:55:18] RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 25 Feb 2022 08:56:29 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:52:54] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3226 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [15:57:24] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.08065 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:02:25] So is parsoid [16:03:53] Them graphs don't look right [16:04:28] Look at https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?viewPanel=3&orgId=1&from=now-6h&to=now [16:17:44] Looks like a bad bot ^ [16:18:03] Traffic to api jumped 14:13 [17:06:44] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Mon 27 Dec 2021 09:00:28 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:07:13] Oh that's not flapping again [17:07:39] urbanecm: that's also something that needs doing ^, I assume it's the stupid restart bug [17:07:51] not a SRE :) [17:08:16] Amir1: was recently online [17:08:31] Amir1: can we bribe you into restarting Apache [17:08:50] ugh [17:08:51] on it [17:08:58] RECOVERY - mailman archives on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 25 Feb 2022 08:56:29 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:09:04] Ref https://phabricator.wikimedia.org/T293826 [17:09:18] it has a week [17:09:19] it says 7 days, probably not urgent enough to get people to fix it on a sunday if they are not around [17:09:59] majavah: Martin was only talking to me like half an hour ago [17:10:32] !log restart apache2 on lists1001 (T293826) [17:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:39] T293826: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 [17:11:37] restarted, let's see if it flaps again [17:11:51] thanks [17:12:04] * Amir1 goes back to eating dinner [17:12:43] Enjoy [17:51:17] 10SRE-swift-storage, 10Commons, 10affects-Kiwix-and-openZIM: JPEG image is reported with the wrong mime-type application/octet-stream - https://phabricator.wikimedia.org/T298011 (10Aklapper) [17:55:20] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:56:28] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:15:03] 10SRE: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10Ed6767) The compelling argument is that they would like to use Welsh language maps - so, given there is no alternative and it would increase accessibility for Welsh users of the website, I'm not opposed given thes... [23:28:00] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10Jclark-ctr) [23:59:42] PROBLEM - SSH on rdb1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook