[00:14:53] 06Labs, 10Labs-Infrastructure, 10Salt, 13Patch-For-Review: update salt key monitoring scripts for labs to new nova api version - https://phabricator.wikimedia.org/T123607#2624884 (10AlexMonk-WMF) a:05ArielGlenn>03AlexMonk-WMF [00:17:33] PROBLEM - Puppet staleness on tools-k8s-etcd-01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [43200.0] [01:01:52] goddamit [01:02:50] hi yuvipanda [01:03:59] hi krenair [01:04:05] what's up? [01:04:06] I'm just going to reboot k8s-etcd-01 and then leave again [01:04:08] I didn't touch anything this time [01:04:16] tools-k8s-etcd-01 is in the io hung state again [01:04:21] ugh [01:04:37] have we asked upstream about that issue? [01:05:22] I've no idea. [01:06:47] !log tools migrate tools-k8s-etcd-01 to labvirt1012, is in state doing no io [01:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [01:07:22] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Track labs instances hanging - https://phabricator.wikimedia.org/T141673#2625013 (10yuvipanda) Just happened to tools-k8s-etcd-01. Given it's a friday evening and I'm trying really hard to not work this weekend, I've just migrated it to a different host and... [01:10:22] PROBLEM - Host tools-k8s-etcd-01 is DOWN: CRITICAL - Host Unreachable (10.68.21.254) [01:17:41] RECOVERY - Host tools-k8s-etcd-01 is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [01:20:33] PROBLEM - Puppet run on tools-k8s-etcd-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [01:27:34] RECOVERY - Puppet staleness on tools-k8s-etcd-01 is OK: OK: Less than 1.00% above the threshold [3600.0] [01:28:54] migration complete, and it's back up [01:30:32] RECOVERY - Puppet run on tools-k8s-etcd-01 is OK: OK: Less than 1.00% above the threshold [0.0] [08:20:53] (03CR) 10Jean-Frédéric: Replace TestFillTableMonumentsBase by CustomAssertions (032 comments) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/302887 (owner: 10Lokal Profil) [08:21:50] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [08:29:25] RECOVERY - Host tools-secgroup-test-103 is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [08:59:26] PROBLEM - Host tools-secgroup-test-103 is DOWN: CRITICAL - Host Unreachable (10.68.21.22) [09:11:22] 06Labs, 10Tool-Labs: jsub should respect .sge_request - https://phabricator.wikimedia.org/T145269#2625220 (10whym) [09:43:12] RECOVERY - Host secgroup-lag-102 is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [09:47:24] RECOVERY - Host tools-secgroup-test-102 is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms [09:48:08] PROBLEM - Host secgroup-lag-102 is DOWN: CRITICAL - Host Unreachable (10.68.17.218) [10:37:29] 06Labs, 10Tool-Labs: jsub should respect .sge_request - https://phabricator.wikimedia.org/T145269#2625220 (10valhallasw) If you want to use advanced SGE functionalities (which includes .sge_request), please use qsub directly instead of jsub. [10:46:02] 06Labs, 10Tool-Labs: puppet disabled on tools-exec-1410 - https://phabricator.wikimedia.org/T145274#2625330 (10valhallasw) [10:46:13] 06Labs, 10Tool-Labs: puppet disabled on tools-exec-1410 - https://phabricator.wikimedia.org/T145274#2625343 (10valhallasw) [10:50:24] 06Labs, 10Tool-Labs: Linkwatcher spawns many processes without parent - https://phabricator.wikimedia.org/T123121#2625344 (10valhallasw) Sorry for the late response -- done! [11:04:41] PROBLEM - Host tools-secgroup-test-102 is DOWN: CRITICAL - Host Unreachable (10.68.21.170) [11:08:37] (03CR) 10Lokal Profil: Replace TestFillTableMonumentsBase by CustomAssertions (032 comments) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/302887 (owner: 10Lokal Profil) [11:45:07] 06Labs, 10Tool-Labs: jsub should respect .sge_request - https://phabricator.wikimedia.org/T145269#2625375 (10whym) If jsub cannot (or does not want to) support .sge_request, maybe it should entirely ignore .sge_request? The current state is that some values from .sge_request are overwritten while others (thos... [12:04:20] 06Labs, 10Tool-Labs: jsub should respect .sge_request - https://phabricator.wikimedia.org/T145269#2625380 (10valhallasw) Jsub internally calls qsub, which is why .sge_request et al are taken into account. I'm not sure if we can make qsub ignore .sge_request. Besides, ``` valhallasw@tools-bastion-02:~$ sudo ls... [12:34:12] 06Labs, 10Wikimedia-Extension-setup, 10wikitech.wikimedia.org, 07I18n, and 2 others: Install Translate extension on wikitech - https://phabricator.wikimedia.org/T100313#2625393 (10Peachey88) What level of translation services do we want on wikitech? * full sectional level translation? ** aka the full E:Tr... [13:24:14] !log librarybase librarybase-reston-01 looks out of space [13:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Librarybase/SAL, Master [13:28:25] !log librarybase 'cp -R /var/lib/mysql/ /srv/mysql/' on reston-01 [13:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Librarybase/SAL, Master [13:41:43] !log librarybase librarybase-reston-01:/var/www/html/rdf# mv *.rdf /srv/ [13:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Librarybase/SAL, Master [13:45:07] !log librarybase reboot librarybase-reston-01 [13:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Librarybase/SAL, Master [13:47:11] !log librarybase back up :) [13:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Librarybase/SAL, Master [14:38:56] !log librarybase install npm on sparql instance [14:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Librarybase/SAL, Master [19:10:58] !log librarybase everything is back up and running..! :D [19:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Librarybase/SAL, Master [20:10:27] 06Labs, 10Labs-Team-Backlog, 10Tool-Labs, 10Mail: Set up A-based SPF for tools.wmflabs.org - https://phabricator.wikimedia.org/T104733#1425752 (10AlexMonk-WMF) Note LDAP is now irrelevant here as we're using Designate instead. Using Designate, projectadmins should be able to edit DNS records like this whe... [20:22:00] 06Labs, 10Mail: failed exim service on labs instances - https://phabricator.wikimedia.org/T135033#2626031 (10AlexMonk-WMF) 05Open>03Resolved a:03Andrew When this ticket was created, the latest jessie image would've been what is now called `debian-8.3-jessie (deprecated 2016-06-13)` (created 2016-02-16) T... [21:13:54] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [21:40:57] 10PAWS, 10MediaWiki-extensions-OAuth, 10Pywikibot-OAuth: PAWS can not login - https://phabricator.wikimedia.org/T136114#2626171 (10Framawiki) Hi, I have tested now with "framawiki" and it still not work. "framabot" looks ok. I have save all files. @yuvipanda Can you try to reset any content in your server fo... [21:53:53] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:57:07] PROBLEM - Puppet run on tools-webgrid-lighttpd-1208 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [21:58:11] PROBLEM - Puppet run on tools-webgrid-lighttpd-1204 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [22:02:37] PROBLEM - Puppet run on tools-webgrid-lighttpd-1207 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [22:07:14] PROBLEM - Puppet run on tools-precise-dev is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [22:07:27] 06Labs, 10Tool-Labs: templatetiger is using 613G in Tools out of 8T - https://phabricator.wikimedia.org/T136192#2626218 (10Kolossos) I reduced the volume to the minimum of 150 GB by deleting old files manually. With the often broken dump files I have no better strategy. As I still have the problem with sort... [22:09:24] PROBLEM - Puppet run on tools-webgrid-lighttpd-1206 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [22:18:51] PROBLEM - Puppet run on tools-webgrid-lighttpd-1205 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [22:19:23] PROBLEM - Puppet run on tools-webgrid-lighttpd-1209 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [22:23:17] PROBLEM - Puppet run on tools-webgrid-lighttpd-1202 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [22:23:47] PROBLEM - Puppet run on tools-webgrid-lighttpd-1201 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [22:27:14] PROBLEM - Puppet run on tools-webgrid-lighttpd-1203 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]