[00:12:59] 10PAWS: Increase file upload size limit - https://phabricator.wikimedia.org/T144146#2589807 (10Capt_Swing) [00:30:51] PROBLEM - Free space - all mounts on tools-static-01 is CRITICAL: CRITICAL: tools.tools-static-01.diskspace._srv.byte_percentfree (<100.00%) [00:37:51] (03PS1) 10BryanDavis: Phabricator repo lookup is a "contains" not exact match [labs/striker] - 10https://gerrit.wikimedia.org/r/307229 (https://phabricator.wikimedia.org/T144139) [00:37:53] (03PS1) 10BryanDavis: Only display repo urls that are visible in Phabricator [labs/striker] - 10https://gerrit.wikimedia.org/r/307230 (https://phabricator.wikimedia.org/T143957) [00:37:55] (03PS1) 10BryanDavis: Mark http repo URLs as hidden by default [labs/striker] - 10https://gerrit.wikimedia.org/r/307231 (https://phabricator.wikimedia.org/T143957) [00:37:57] (03PS1) 10BryanDavis: Add link to diffusion repo on detail page [labs/striker] - 10https://gerrit.wikimedia.org/r/307232 (https://phabricator.wikimedia.org/T143975) [00:40:22] (03PS2) 10BryanDavis: Catch and log database errors while saving models [labs/striker] - 10https://gerrit.wikimedia.org/r/307110 (https://phabricator.wikimedia.org/T144082) [00:41:07] (03CR) 10BryanDavis: [V: 032] Catch and log database errors while saving models [labs/striker] - 10https://gerrit.wikimedia.org/r/307110 (https://phabricator.wikimedia.org/T144082) (owner: 10BryanDavis) [00:42:23] (03CR) 10BryanDavis: [C: 032] Catch and log database errors while saving models [labs/striker] - 10https://gerrit.wikimedia.org/r/307110 (https://phabricator.wikimedia.org/T144082) (owner: 10BryanDavis) [00:45:02] (03Merged) 10jenkins-bot: Catch and log database errors while saving models [labs/striker] - 10https://gerrit.wikimedia.org/r/307110 (https://phabricator.wikimedia.org/T144082) (owner: 10BryanDavis) [00:45:28] (03CR) 10Alex Monk: [C: 032] Update site.css [labs/striker/staticfiles] - 10https://gerrit.wikimedia.org/r/307136 (https://phabricator.wikimedia.org/T143972) (owner: 10BryanDavis) [00:45:34] (03Merged) 10jenkins-bot: Update site.css [labs/striker/staticfiles] - 10https://gerrit.wikimedia.org/r/307136 (https://phabricator.wikimedia.org/T143972) (owner: 10BryanDavis) [00:49:26] (03PS2) 10BryanDavis: Only display repo urls that are visible in Phabricator [labs/striker] - 10https://gerrit.wikimedia.org/r/307230 (https://phabricator.wikimedia.org/T143957) [00:49:28] (03PS2) 10BryanDavis: Mark http repo URLs as hidden by default [labs/striker] - 10https://gerrit.wikimedia.org/r/307231 (https://phabricator.wikimedia.org/T143957) [00:49:30] (03PS2) 10BryanDavis: Phabricator repo lookup is a "contains" not exact match [labs/striker] - 10https://gerrit.wikimedia.org/r/307229 (https://phabricator.wikimedia.org/T144139) [00:49:32] (03PS2) 10BryanDavis: Add link to diffusion repo on detail page [labs/striker] - 10https://gerrit.wikimedia.org/r/307232 (https://phabricator.wikimedia.org/T143975) [00:59:33] (03PS2) 10BryanDavis: Install/upgrade via wheels rather than complete venv reload [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/307085 [01:00:03] (03CR) 10BryanDavis: [C: 032] Install/upgrade via wheels rather than complete venv reload [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/307085 (owner: 10BryanDavis) [01:00:09] (03Merged) 10jenkins-bot: Install/upgrade via wheels rather than complete venv reload [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/307085 (owner: 10BryanDavis) [01:02:42] (03CR) 10Alex Monk: [C: 032] Phabricator repo lookup is a "contains" not exact match [labs/striker] - 10https://gerrit.wikimedia.org/r/307229 (https://phabricator.wikimedia.org/T144139) (owner: 10BryanDavis) [01:03:46] (03Merged) 10jenkins-bot: Phabricator repo lookup is a "contains" not exact match [labs/striker] - 10https://gerrit.wikimedia.org/r/307229 (https://phabricator.wikimedia.org/T144139) (owner: 10BryanDavis) [01:07:51] (03CR) 10Alex Monk: [C: 032] Only display repo urls that are visible in Phabricator [labs/striker] - 10https://gerrit.wikimedia.org/r/307230 (https://phabricator.wikimedia.org/T143957) (owner: 10BryanDavis) [01:10:31] (03Merged) 10jenkins-bot: Only display repo urls that are visible in Phabricator [labs/striker] - 10https://gerrit.wikimedia.org/r/307230 (https://phabricator.wikimedia.org/T143957) (owner: 10BryanDavis) [01:10:57] you're a machine Krenair! Thanks :) [01:13:00] thanks bd808. I'll look at the rest later [01:14:39] bd808, can you add me as a reviewer so I don't forget? [01:21:30] PROBLEM - Free space - all mounts on tools-web-static-02 is CRITICAL: CRITICAL: tools.tools-web-static-02.diskspace._srv.byte_percentfree (<40.00%) [01:25:29] 10PAWS, 10MediaWiki-extensions-OAuth, 10Pywikibot-OAuth: PAWS can not login - https://phabricator.wikimedia.org/T136114#2589888 (10Tgr) @yuvipanda, but what are those credentials? Sharing the client_id/access_key should be safe and it's hard to debug the problem without that. [01:36:37] (03CR) 10BryanDavis: "cherry-picked to https://striker.wmflabs.org/ for verification." [labs/striker] - 10https://gerrit.wikimedia.org/r/307231 (https://phabricator.wikimedia.org/T143957) (owner: 10BryanDavis) [01:36:54] (03CR) 10BryanDavis: "cherry-picked to https://striker.wmflabs.org/ for verification" [labs/striker] - 10https://gerrit.wikimedia.org/r/307232 (https://phabricator.wikimedia.org/T143975) (owner: 10BryanDavis) [01:49:29] PROBLEM - Free space - all mounts on tools-web-static-01 is CRITICAL: CRITICAL: tools.tools-web-static-01.diskspace._srv.byte_percentfree (<40.00%) [01:55:46] 06Labs, 10Phabricator, 07Puppet: Update phabricator puppet role to support use on labs - https://phabricator.wikimedia.org/T144112#2589893 (10Peachey88) [02:15:12] (03PS1) 10BryanDavis: Add django_log_request_id wheel [labs/striker/wheels] - 10https://gerrit.wikimedia.org/r/307233 (https://phabricator.wikimedia.org/T143949) [02:17:28] (03CR) 10BryanDavis: [C: 032] Add django_log_request_id wheel [labs/striker/wheels] - 10https://gerrit.wikimedia.org/r/307233 (https://phabricator.wikimedia.org/T143949) (owner: 10BryanDavis) [02:17:34] (03Merged) 10jenkins-bot: Add django_log_request_id wheel [labs/striker/wheels] - 10https://gerrit.wikimedia.org/r/307233 (https://phabricator.wikimedia.org/T143949) (owner: 10BryanDavis) [02:23:39] (03PS1) 10BryanDavis: Bump submodules: static, striker, wheels [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/307234 [02:25:11] (03CR) 10BryanDavis: [C: 032] Bump submodules: static, striker, wheels [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/307234 (owner: 10BryanDavis) [02:25:17] (03Merged) 10jenkins-bot: Bump submodules: static, striker, wheels [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/307234 (owner: 10BryanDavis) [04:52:38] 10PAWS, 10MediaWiki-extensions-OAuth, 10Pywikibot-OAuth: PAWS can not login - https://phabricator.wikimedia.org/T136114#2589908 (10yuvipanda) ah, sure. 0a73e346a40b07262b6e36bdba01cba4 is the client_id. [04:54:34] 06Labs, 10Beta-Cluster-Infrastructure: puppet::self hosts now have two servers set - https://phabricator.wikimedia.org/T144108#2589909 (10yuvipanda) @mmodell https://phabricator.wikimedia.org/T120159 [04:55:51] 06Labs, 10Beta-Cluster-Infrastructure: puppet::self hosts now have two servers set - https://phabricator.wikimedia.org/T144108#2589910 (10yuvipanda) role::puppet::self for puppet *clients* is doubly terrible. I'll spend next week getting rid of that across labs - see https://phabricator.wikimedia.org/T120159#2... [05:40:28] 06Labs: Clean up data in /data/scratch/mwoffliner - https://phabricator.wikimedia.org/T144025#2589926 (10Kelson) This directory works as a cache. This is why it's pretty big. One time T117095 is fully implemented, the big part of it will be removed. I plan to work on this during September. [05:46:08] 10PAWS: Increase file upload size limit - https://phabricator.wikimedia.org/T144146#2589931 (10yuvipanda) I'll look into this! Right now, you can put it somewhere else and 'wget' it from 'new -> terminal' in PAWS. I'll increase the file limit as well, but that can go only so far. I might also enable scp access... [06:27:18] 10PAWS, 10MediaWiki-extensions-OAuth, 10Pywikibot-OAuth: PAWS can not login - https://phabricator.wikimedia.org/T136114#2589978 (10Tgr) The access token is `7cb5b315a11d0dcbe46d1c90332dd210` for Dvorapa (timestamp: 20160827125759) and `0d64b706435230a05213f605ff1ad8ac` for Framawiki (20160829061415). Does th... [06:59:52] 10PAWS, 10MediaWiki-extensions-OAuth, 10Pywikibot-OAuth: PAWS can not login - https://phabricator.wikimedia.org/T136114#2589990 (10yuvipanda) Hmm, nope that doesn't match. [07:00:45] 10PAWS: Disable password based login on pwb on PAWS - https://phabricator.wikimedia.org/T144151#2589991 (10yuvipanda) [07:01:54] 10PAWS: Disable password based login on pwb on PAWS - https://phabricator.wikimedia.org/T144151#2589991 (10yuvipanda) (I've found and deleted a couple of these) [07:03:17] 10PAWS, 10MediaWiki-extensions-OAuth, 10Pywikibot-OAuth: PAWS can not login - https://phabricator.wikimedia.org/T136114#2590006 (10yuvipanda) Also, @Dvorapa @Framawiki please do not use password based login. This is highly insecure - I've filed T144151 to disable it on PAWS. In the meantime, you two should p... [07:04:14] 10PAWS, 10Jupyter-Hub: I can't login my bot in JUPYTER - https://phabricator.wikimedia.org/T135306#2590009 (10yuvipanda) @Maathavan I think I've fixed this now finally. Can you try again? [07:30:57] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Move kubernetes authentication to using X.509 client certs - https://phabricator.wikimedia.org/T144153#2590052 (10yuvipanda) [07:50:15] 10PAWS: Implement a 'signing OAuth Proxy' for PAWS - https://phabricator.wikimedia.org/T120469#2590093 (10Tgr) See [[https://www.sans.org/reading-room/whitepapers/application/attacks-oauth-secure-oauth-implementation-33644|this article]] (section 2) for the threat model. In short, # a hostile user can send reque... [08:06:51] (03CR) 10Alex Monk: [C: 032] Mark http repo URLs as hidden by default [labs/striker] - 10https://gerrit.wikimedia.org/r/307231 (https://phabricator.wikimedia.org/T143957) (owner: 10BryanDavis) [08:07:31] (03Merged) 10jenkins-bot: Mark http repo URLs as hidden by default [labs/striker] - 10https://gerrit.wikimedia.org/r/307231 (https://phabricator.wikimedia.org/T143957) (owner: 10BryanDavis) [08:10:06] (03CR) 10Alex Monk: [C: 032] Add link to diffusion repo on detail page [labs/striker] - 10https://gerrit.wikimedia.org/r/307232 (https://phabricator.wikimedia.org/T143975) (owner: 10BryanDavis) [08:10:46] (03Merged) 10jenkins-bot: Add link to diffusion repo on detail page [labs/striker] - 10https://gerrit.wikimedia.org/r/307232 (https://phabricator.wikimedia.org/T143975) (owner: 10BryanDavis) [08:55:34] 10PAWS, 10MediaWiki-extensions-OAuth, 10Pywikibot-OAuth: PAWS can not login - https://phabricator.wikimedia.org/T136114#2590203 (10Urbanecm) @yuvipanda Okay, so sorry for my advice :). I wanted to give them access to the service somehow and not reveal them passwords :). BTW, when they run chmod 600 on lwp fi... [09:01:30] 10PAWS, 10MediaWiki-extensions-OAuth, 10Pywikibot-OAuth: PAWS can not login - https://phabricator.wikimedia.org/T136114#2590225 (10Urbanecm) I filled T144157 for set up PWB as I described. [09:32:06] 10Tool-Labs-tools-Pageviews, 10Analytics, 10Pageviews-API: siteviews data for 2016 August 27 appears to be empty - https://phabricator.wikimedia.org/T144159#2590263 (10Amire80) [09:48:20] PROBLEM - Puppet staleness on tools-exec-cyberbot is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [10:38:07] PROBLEM - Puppet run on tools-static-02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [10:53:56] 10PAWS, 10MediaWiki-extensions-OAuth, 10Pywikibot-OAuth: PAWS can not login - https://phabricator.wikimedia.org/T136114#2590426 (10Framawiki) Ok, i understand. (@yuvipanda, I just send you an email) [11:08:03] RECOVERY - Puppet run on tools-static-02 is OK: OK: Less than 1.00% above the threshold [0.0] [11:40:56] PROBLEM - Puppet staleness on tools-exec-1211 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [11:41:54] (03CR) 10Hashar: [C: 031] Add gerrit project to #wikimedia-releng channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/307071 (owner: 10Paladox) [12:04:05] PROBLEM - Puppet run on tools-static-02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [12:39:04] RECOVERY - Puppet run on tools-static-02 is OK: OK: Less than 1.00% above the threshold [0.0] [12:56:17] PROBLEM - Puppet run on tools-exec-1207 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [12:58:09] 06Labs, 10Tool-Labs, 10Deployment-Systems: Add release engineering people to tools.jouncebot user group - https://phabricator.wikimedia.org/T144175#2590777 (10hashar) [12:59:25] if anyone is lurking around, we could use a restart of tool.jouncebot https://wikitech.wikimedia.org/wiki/Tool:Jouncebot [12:59:30] it is idling / dead apparently :( [13:05:04] PROBLEM - Puppet run on tools-static-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [13:30:26] 06Labs, 10Tool-Labs, 10DBA: u13367 is running 2 inefficient 9-day-long queries, causing high cpu usage - https://phabricator.wikimedia.org/T144180#2590904 (10jcrespo) [13:35:33] 06Labs, 10Tool-Labs, 10Deployment-Systems: Add release engineering people to tools.jouncebot user group - https://phabricator.wikimedia.org/T144175#2590935 (10chasemp) p:05Triage>03Normal 18 members of https://phabricator.wikimedia.org/project/members/20/, some of whom I don't recognize. Please provide... [13:36:17] RECOVERY - Puppet run on tools-exec-1207 is OK: OK: Less than 1.00% above the threshold [0.0] [13:37:11] 06Labs, 15User-Luke081515: Revert: Request increased quota for rcm labs project - https://phabricator.wikimedia.org/T142311#2590951 (10chasemp) 05Open>03Resolved [13:37:13] 06Labs, 07Tracking: Existing Labs project quota increase requests (Tracking) - https://phabricator.wikimedia.org/T140904#2590952 (10chasemp) [13:37:54] 06Labs, 10Tool-Labs, 10DBA: u13367 is running 2 inefficient 9-day-long queries, causing high cpu usage - https://phabricator.wikimedia.org/T144180#2590954 (10Urbanecm) I'm sorry for it. I'm trying to update list of 500 most linked disambigs at cswiki. I have some script but I want to convert it to one SQL qu... [13:39:38] 06Labs, 10Tool-Labs, 10DBA: u13367 is running 2 inefficient 9-day-long queries, causing high cpu usage - https://phabricator.wikimedia.org/T144180#2590958 (10Urbanecm) ATM all my processes including any mysql consoles should be killed, no job should be running from my personal account, from my two tools only... [13:40:41] !log tools restart jouncebot [13:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [13:48:21] 06Labs, 13Patch-For-Review: Kill ldapsupportlib.py - https://phabricator.wikimedia.org/T114063#1683588 (10hashar) Just found that `ldaplist` is scheduled for deletion. I am still relying on it because its syntax is quite trivial. If I want to lookup my LDAP informations I just: ldaplist -l passwd hashar W... [13:50:44] 06Labs, 10Tool-Labs, 10DBA, 07Tracking: Certain tools users create multiple long running queries that take all memory from labsdb hosts, slowing it down and potentially crashing (tracking) - https://phabricator.wikimedia.org/T119601#2590976 (10jcrespo) [13:50:46] 06Labs, 10Tool-Labs, 10DBA, 15User-Urbanecm: u13367 is running 2 inefficient 9-day-long queries, causing high cpu usage - https://phabricator.wikimedia.org/T144180#2590973 (10jcrespo) 05Open>03Resolved a:03Urbanecm Yes, I killed the queries. Normally the query killer should limit these, but for some... [13:52:06] 06Labs, 10Tool-Labs, 10Deployment-Systems: Add release engineering people to tools.jouncebot user group - https://phabricator.wikimedia.org/T144175#2590981 (10hashar) [13:52:32] 06Labs, 10Tool-Labs, 10Deployment-Systems: Add release engineering people to tools.jouncebot user group - https://phabricator.wikimedia.org/T144175#2590777 (10hashar) My bad sorry. Edited with list of each of our labs shell accounts: ``` dduvall demon gjg hashar twentyafterfour thcipriani zfilipin ``` [13:54:28] o/ [13:54:37] I've got an instance where I can't access "/srv [13:54:43] I'm trying to figure out what's going on [13:54:53] ores-compute-01.ores.eqiad.wmflabs [13:55:59] "sysfs" [13:56:28] halfak: I can't seem to get on that atm [13:56:55] chasemp, Can't log in or can't access /sys on the machine? [13:57:05] can't login [13:57:15] Hmm... I'm definitely able to log in [14:01:29] chasemp, any suggestions? [14:01:33] Maybe just reboot? [14:01:54] halfak: I'm not sure it hangs on me, I can reboot for you if you'd like [14:02:13] Sure if you've got it handy [14:03:12] !log ores reboot 94886e74-5be4-4669-a1c1-840ce7c65de9 ores-compute-01 [14:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [14:03:50] thanks charitwo [14:03:56] * chasemp [14:04:11] * halfak needs to type more characters before wildly hitting [TAB] [14:05:14] I'm an enigma of tab completion [14:05:41] * halfak tries to log back in. [14:10:04] RECOVERY - Puppet run on tools-static-02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:11:36] Yay! It works! [14:15:36] 06Labs, 06Operations: Connect secondary nic for labstore1004 and labstore1005 - https://phabricator.wikimedia.org/T144183#2591023 (10chasemp) [14:37:49] 06Labs, 10Tool-Labs, 10Deployment-Systems: Add release engineering people to tools.jouncebot user group - https://phabricator.wikimedia.org/T144175#2591094 (10chasemp) It's possibly early morning fugue state but I don't see: demon (I do see chad) twentyafterfour gjg I added (to tools if necessary as well):... [14:49:55] 06Labs, 10Tool-Labs, 10Deployment-Systems: Add release engineering people to tools.jouncebot user group - https://phabricator.wikimedia.org/T144175#2590777 (10Paladox) @chasemp that would be ^demon [15:01:07] PROBLEM - Puppet run on tools-static-02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [15:03:33] !log tools.jouncebot Stopped & started bot to try and get it back in #wikimedia-operations [15:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.jouncebot/SAL, Master [15:04:23] bd808: did that work? [15:04:42] nope :( [15:05:13] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1207 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [15:05:24] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1208 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [15:05:36] the last thing it logs is "2016-08-29 15:02:13,417 - INFO - Attempting to join channel #wikimedia-operations" [15:05:43] but no joy on the join [15:06:24] bd808: could it be part of the anti bot / spam stuff that has been ongoing there? [15:06:38] I really don't know but every time I look there is a new strategy [15:06:58] maybe, but I'd sort of expect it to log if it was klined or something [15:07:36] I'll poke at it a bit. I think it needs a rebuild of the venv too. There are some libxml warnings [15:11:06] PROBLEM - Host secgroup-lag-102 is DOWN: CRITICAL - Host Unreachable (10.68.17.218) [15:12:17] 10Tool-Labs-tools-Other, 10Deployment-Systems: Jouncebot not joining #wikimedia-operations - https://phabricator.wikimedia.org/T144189#2591190 (10bd808) [15:20:55] 10Tool-Labs-tools-Other, 10Deployment-Systems: Jouncebot not joining #wikimedia-operations - https://phabricator.wikimedia.org/T144189#2591190 (10Paladox) Could it be someone quieted the ip of stashbot? [15:37:57] !unban yuvipanda [15:38:11] 06Labs, 10Tool-Labs, 07Wikimedia-Incident: Setup a simple service that pages when it is unreachable - https://phabricator.wikimedia.org/T143638#2591275 (10chasemp) we agreed to make a check for the proxy health itself today, I'll get added to the PAWS check and we'll iterate on this from there. [15:38:12] !log tools.jouncebot Cherry-picked https://gerrit.wikimedia.org/r/#/c/307315/ and restarted bot [15:38:15] 10Tool-Labs-tools-Pageviews, 10Analytics, 10Pageviews-API: siteviews data for 2016 August 27 appears to be empty - https://phabricator.wikimedia.org/T144159#2590263 (10Milimetric) A job failed, is being rerun now, so things should be in order soon. We'll track and close this when it's resolved. Thanks for... [15:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.jouncebot/SAL, Master [15:38:53] 06Labs, 10Tool-Labs, 13Patch-For-Review, 07Wikimedia-Incident: Tune nginx config parameters for tools / labs proxies - https://phabricator.wikimedia.org/T143637#2591278 (10chasemp) Stauts from labs meeting: we are porting static to jessie to work out needed tuning [15:38:58] 10Tool-Labs-tools-Pageviews, 10Analytics, 06Analytics-Kanban, 10Pageviews-API: siteviews data for 2016 August 27 appears to be empty - https://phabricator.wikimedia.org/T144159#2591279 (10Milimetric) p:05Triage>03Normal [15:40:42] 10Tool-Labs-tools-Pageviews, 06Analytics-Kanban: siteviews data for 2016 August 27 appears to be empty - https://phabricator.wikimedia.org/T144159#2591287 (10Milimetric) [15:46:53] 06Labs: Request increased quota for git labs project - https://phabricator.wikimedia.org/T143815#2579978 (10Andrew) +1 this is fine [15:52:18] 10Tool-Labs-tools-Other, 10Deployment-Systems, 13Patch-For-Review: Jouncebot not joining #wikimedia-operations - https://phabricator.wikimedia.org/T144189#2591368 (10bd808) I patched in some additional logging, but am not seeing any clear reason why things aren't working yet: ``` 2016-08-29 15:47:58,387 - IN... [15:57:40] 10Tool-Labs-tools-Other, 10Deployment-Systems, 13Patch-For-Review: Jouncebot not joining #wikimedia-operations - https://phabricator.wikimedia.org/T144189#2591407 (10bd808) ``` [09:56] jouncebot has userhost tools.joun@instance-tools-exec-1404.tools.wmflabs.org and real name “https://github.com/mattofak/joun... [16:11:46] 06Labs, 10Phabricator, 07Puppet: Update phabricator puppet role to support use on labs - https://phabricator.wikimedia.org/T144112#2591508 (10mmodell) related {T131899} [16:14:15] 10Tool-Labs-tools-Other, 10Deployment-Systems, 13Patch-For-Review: Jouncebot not joining #wikimedia-operations - https://phabricator.wikimedia.org/T144189#2591190 (10Platonides) Fixed by temporarily removing the inheritance from #wikimedia-bans Most probably, it was affected by the “ban everyone not register... [16:16:46] 06Labs, 07Tracking: Existing Labs project quota increase requests (Tracking) - https://phabricator.wikimedia.org/T140904#2591567 (10chasemp) [16:16:49] 06Labs: Request increased quota for git labs project - https://phabricator.wikimedia.org/T143815#2591564 (10chasemp) 05Open>03Resolved a:03chasemp Should be gtg [16:24:12] 06Labs: Request increased quota for git labs project - https://phabricator.wikimedia.org/T143815#2591683 (10Paladox) @chasemp thanks and what do you mean by gtg? [16:25:10] 10Tool-Labs-tools-Other, 10Deployment-Systems, 13Patch-For-Review: Jouncebot not joining #wikimedia-operations - https://phabricator.wikimedia.org/T144189#2591699 (10bd808) >>! In T144189#2591536, @Platonides wrote: > Fixed by temporarily removing the inheritance from #wikimedia-bans Most probably, it was af... [16:29:02] 06Labs: Request increased quota for git labs project - https://phabricator.wikimedia.org/T143815#2591726 (10chasemp) good to go :) [16:32:26] 06Labs: Request increased quota for git labs project - https://phabricator.wikimedia.org/T143815#2591769 (10Paladox) Ah thanks :) [16:35:44] !log tools run chmod u+x /data/project/framabot [16:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [16:41:03] RECOVERY - Puppet run on tools-static-02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:46:26] 06Labs, 06Operations, 13Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#2591854 (10yuvipanda) I moved striker, and @bd808 reports it is all good. the instance with this issue in the shinken proje... [16:52:05] 06Labs, 10Labs-project-Extdist: http://extdist.wmflabs.org/ 502's (Bad Gateway) - https://phabricator.wikimedia.org/T143209#2591874 (10chasemp) 05Open>03Resolved a:03chasemp this seems back now [16:53:57] 06Labs, 10Tool-Labs: Maintainers are not shown in the Tools list - https://phabricator.wikimedia.org/T142684#2591891 (10chasemp) 05Open>03Resolved a:03chasemp Please look at https://toolsadmin.wikimedia.org/. I think this is resolved there and should be considered canonical. [16:55:37] 06Labs: Can not kill job on tools labs - https://phabricator.wikimedia.org/T138924#2414164 (10chasemp) @Magnus, is this still the case? can you add some more details so we can look into it? [17:03:17] PROBLEM - Host tools-secgroup-test-103 is DOWN: CRITICAL - Host Unreachable (10.68.21.22) [17:07:03] PROBLEM - Puppet run on tools-static-02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [17:13:51] 06Labs, 10Tool-Labs: Setup monitoring for cdnjs git pull - https://phabricator.wikimedia.org/T144215#2591965 (10madhuvishy) [17:22:28] (03PS1) 10BryanDavis: Bump striker submodule [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/307334 [17:22:47] (03CR) 10BryanDavis: [C: 032] Bump striker submodule [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/307334 (owner: 10BryanDavis) [17:22:54] (03Merged) 10jenkins-bot: Bump striker submodule [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/307334 (owner: 10BryanDavis) [17:30:50] RECOVERY - Free space - all mounts on tools-static-01 is OK: OK: All targets OK [17:32:23] RECOVERY - Free space - all mounts on tools-static-02 is OK: OK: All targets OK [17:32:25] PROBLEM - Host tools-secgroup-test-102 is DOWN: CRITICAL - Host Unreachable (10.68.21.170) [17:35:33] PROBLEM - Puppet run on tools-static-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [17:45:40] PROBLEM - Puppet run on tools-web-static-01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [18:01:51] PROBLEM - Free space - all mounts on tools-static-01 is CRITICAL: CRITICAL: tools.tools-static-01.diskspace._srv.byte_percentfree (<55.56%) [18:06:23] 06Labs, 06Operations, 13Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#2592177 (10yuvipanda) servermon is done. I'll do the analytics ones next. [18:10:34] RECOVERY - Puppet run on tools-static-01 is OK: OK: Less than 1.00% above the threshold [0.0] [18:12:04] RECOVERY - Puppet run on tools-static-02 is OK: OK: Less than 1.00% above the threshold [0.0] [18:13:22] PROBLEM - Free space - all mounts on tools-static-02 is CRITICAL: CRITICAL: tools.tools-static-02.diskspace._srv.byte_percentfree (<60.00%) [18:17:33] !log analytics kill ldap entry for analytics303, doesn't exist [18:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Analytics/SAL, Master [18:19:00] 06Labs, 10Phabricator, 13Patch-For-Review, 07Puppet: Update phabricator puppet role to support use on labs - https://phabricator.wikimedia.org/T144112#2592218 (10Paladox) [18:19:05] 06Labs, 10Phabricator, 07Puppet: Phabricator labs puppet role configures phabricator wrong - https://phabricator.wikimedia.org/T131899#2592220 (10Paladox) [18:19:32] 06Labs, 10Phabricator, 13Patch-For-Review, 07Puppet: Update phabricator puppet role to support use on labs - https://phabricator.wikimedia.org/T144112#2588887 (10Paladox) We will fix production role first and after that remove the labs role once production role works in labs :) [18:23:33] 06Labs, 06Operations, 13Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#2592244 (10yuvipanda) The analytics project doesn't actually seem to have any! These were just stale LDAP entries for insta... [18:30:40] RECOVERY - Puppet run on tools-web-static-01 is OK: OK: Less than 1.00% above the threshold [0.0] [18:48:19] !log toolsbeta reboot toolsbeta-puppetmaster3, puppet run process became Zommmmbiiiieeee, ate all my brains [18:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL, Master [18:52:34] (03CR) 1020after4: [C: 032] Add gerrit project to #wikimedia-releng channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/307071 (owner: 10Paladox) [18:57:36] (03Merged) 10jenkins-bot: Add gerrit project to #wikimedia-releng channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/307071 (owner: 10Paladox) [19:14:51] 06Labs, 10Phabricator, 07Puppet: Phabricator labs puppet role configures phabricator wrong - https://phabricator.wikimedia.org/T131899#2592521 (10Paladox) 05duplicate>03Open Re opening for now since the main role will take a while to fix. [19:18:50] !log toolsbeta reboot toolsbeta-mail, seems, uh, stuck [19:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL, Master [19:20:24] !log toolsbeta reboot toolsbeta-master, seems, uh, stuck [19:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL, Master [19:32:41] Platonides: Do you know what the current open source ipsec based cross platform vpn client of choice is? [19:33:05] ipsec? [19:33:26] my first thought was openvpn, but that's userland, not ipsec [19:33:32] isn't that part of the kernel? [19:34:21] Microsoft has crap support for vpn's [19:34:33] Linux Kernel ≥2.5.47 has a native IPsec stack [19:34:53] well, it's Microsoft… ;) [19:34:58] Maybe Windows10 is better.... [19:36:24] Windows has "easy PPTP" [19:36:41] I've used Cisco client quite a lot (not free) or https://www.shrew.net/download/vpn (seemed to have stopped). [19:36:42] however, that doesn't mean it's secure… https://www.schneier.com/academic/pptp/ [19:36:54] PPTP is not secure at all [19:36:54] Cisco uses ipsec? [19:37:12] Yeah, Cisco ASA is all about ipsec [19:38:28] I'm not a big fan of the SSL based VPN because every company has a different implementation. [19:40:28] * bd808 liks ssh tunnels as vpn method, but is weird like that [19:40:37] multichill: I think the built-in client in win10 (and win7?) supports ipsec, but I'm not 100% sure [19:41:02] 10PAWS, 10Discussion-modeling: Detox on Paws - https://phabricator.wikimedia.org/T144234#2592591 (10ellery) [19:41:13] xauth is missing in the win7 client afaik [19:41:27] actually, with a good ssh client, their tunnels are great [19:53:36] !log tools.stashbot Updated to 04e6f98 (If authenticating, pause to let auth actually work) && added authentication as registered stashbot freenode account [19:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL, Master [19:56:00] 10PAWS, 10Discussion-modeling: Detox notebooks on PAWS - https://phabricator.wikimedia.org/T144234#2592641 (10DarTar) [19:56:39] 10PAWS, 06Research-and-Data-Backlog, 07Epic, 03Research-and-Data-2017-Q1: Launch Open Notebooks Infrastructure - https://phabricator.wikimedia.org/T140430#2592644 (10DarTar) [19:56:41] 10PAWS, 10Discussion-modeling: Detox notebooks on PAWS - https://phabricator.wikimedia.org/T144234#2592591 (10DarTar) [19:59:17] 10PAWS, 06Research-and-Data-Backlog, 07Epic, 03Research-and-Data-2017-Q1: Launch Open Notebooks Infrastructure - https://phabricator.wikimedia.org/T140430#2592655 (10DarTar) @Capt_Swing if you have a few cycles to help with this, it'd be awesome. It would also be worth coordinating doc work with the (separ... [20:09:34] PROBLEM - Puppet run on tools-webgrid-lighttpd-1403 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [20:11:00] PROBLEM - Puppet run on tools-exec-1201 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [20:11:45] PROBLEM - Free space - all mounts on tools-puppetmaster-01 is CRITICAL: CRITICAL: tools.tools-puppetmaster-01.diskspace._public_dumps.byte_percentfree (No valid datapoints found)tools.tools-puppetmaster-01.diskspace.root.byte_percentfree (<10.00%) [20:12:25] waaat [20:12:27] PROBLEM - Puppet run on tools-exec-1214 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:12:29] PROBLEM - Puppet run on tools-prometheus-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:12:31] also waaat, /public/dumps?! [20:12:32] uh oh [20:12:37] PROBLEM - Puppet run on tools-exec-1406 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [20:12:38] PROBLEM - Puppet run on tools-checker-02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [20:13:04] PROBLEM - Puppet run on tools-mail-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [20:16:13] looks like puppet did fail, but was transient [20:16:59] 06Labs, 06Operations, 13Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#2592701 (10yuvipanda) did toolsbeta, or at least the toolsbeta instances that are still sshable. Many didn't come back up a... [20:21:44] RECOVERY - Free space - all mounts on tools-puppetmaster-01 is OK: OK: tools.tools-puppetmaster-01.diskspace._public_dumps.byte_percentfree (No valid datapoints found) [20:22:01] well done, shinken-wm [20:22:06] ^ yuvipanda it probably makes sense not to check NFS on every run there yeah? [20:22:18] it's never going to deviate from on the server [20:22:26] but the check itself is load etc [20:22:38] not too worrysome tho [20:22:39] chasemp I have no idea why it's checking NFS tho [20:22:42] it used to not [20:22:47] ah even more interesting [20:22:52] rather, there was no graphite metric for the NFS mounts at all [20:23:01] so I was surprised when the alert popped up [20:23:10] huh.... [20:23:18] the alert just has a * in there, so if there's nfs mount stats there, it'll alert [20:23:33] but the diamond collector wasn't sending nfs mount stats when I first set it up, and I've never seen an alert like this before [20:23:38] ah [20:32:37] RECOVERY - Puppet run on tools-checker-02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:47:29] RECOVERY - Puppet run on tools-prometheus-01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:48:01] RECOVERY - Puppet run on tools-mail-01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:49:33] RECOVERY - Puppet run on tools-webgrid-lighttpd-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [20:50:59] RECOVERY - Puppet run on tools-exec-1201 is OK: OK: Less than 1.00% above the threshold [0.0] [20:52:25] RECOVERY - Puppet run on tools-exec-1214 is OK: OK: Less than 1.00% above the threshold [0.0] [20:52:37] RECOVERY - Puppet run on tools-exec-1406 is OK: OK: Less than 1.00% above the threshold [0.0] [20:55:12] bd808, SSH tunnels are good, but they're bad because you have to connect to localhost for the tunnel. [20:55:52] So if something is configured to connect to a specific instance name.. e.g. a Javascript web interface [20:56:06] Lots of head scratching. :( [21:00:25] tom29739: you could use -D [21:00:35] that removes flexibility [21:00:42] but a vpn route does, too [21:00:55] I use Windows with Putty, [21:01:04] I don't think it has that option :/ [21:01:18] Nor does it have an easy ProxyCommand. [21:01:25] * Platonides quotes himself: "with a good ssh client,…" [21:02:04] SSH tunnels are sometimes called "a poor man's VPN" [21:02:11] 10Tool-Labs-tools-Pageviews: Add "subpages" as a source to Massviews - https://phabricator.wikimedia.org/T144238#2592817 (10MusikAnimal) [21:02:21] Why use an SSH tunnel if you could use a VPN? [21:02:53] tunnel-stacking? [21:03:14] It's slow on here. [21:08:12] Platonides, Linux users have it easy :/ [21:10:52] * Platonides replaces tom29739's windows with Ubuntu [21:11:10] Oi! Windows is good for some things! [21:11:23] Like MS Office for instance. [21:42:24] o/ yuvipanda. I'm having a PAWS issue. My kernel keeps getting restarted. Do you know why that's happening? [21:46:44] I guess I'll just dev locally for a while [21:52:28] hey halfak [21:52:32] taking a look [21:53:08] 06Labs, 10Tool-Labs, 10Deployment-Systems: Add release engineering people to tools.jouncebot user group - https://phabricator.wikimedia.org/T144175#2592987 (10AlexMonk-WMF) 05Open>03Resolved a:03chasemp @chasemp, if you added through wikitech then it's possible you were trying to use the provided uids... [21:53:24] halfak I started a new terminal, and it seems to work fine. [21:53:29] which notebook was having issues for you? [21:53:50] yuvipanda, it happens in a loop at the end of this notebook:http://paws-public.wmflabs.org/paws-public/User:EpochFail/projects/vectors_demo/damage_detection_test.ipynb [21:53:57] I turned the loop into a batch just in case [21:55:02] Woops. Looks like I made an edit that broke it. Fixing now [21:55:35] fixed [21:55:59] The loop is running again. Usually, I get through about 300 (out of ~20k) before it crashes [21:56:04] halfak ok! :) I see it running, and I see it is ramping up in memory use [21:56:05] I'll get the exact error message this time [21:56:18] but maybe not, I see it winding down as well [21:56:18] Ohhh.. That could totally be the issue. [21:56:20] let's see [21:56:28] What's my max memory? [21:56:48] halfak 1G I think. [21:56:56] I see. That could be the problem. [21:57:27] I'll definitely need a lot of memory for this maneuver [21:57:31] yeah. I should probably have a mechanism for user groups where people have higher limits. [21:57:40] how much is 'lot of memory'? [21:57:44] I'm guessing ~6GB [21:57:54] Looks like I'm not doing too bad though. [21:58:00] 300-400 MB [21:58:08] yeah [21:58:09] so far so good [21:58:18] has it crashed yet? [21:58:22] Nope [21:58:24] Maybe I hit a degenerate revision eventually [21:58:30] Up to 550 MB [21:58:35] back to 400MB [21:59:14] 650MB [21:59:19] 460MB [21:59:21] lol [22:01:32] (I crashed my browser trying to get a dashboard) [22:01:36] halfak has it crashed yet? [22:01:52] nope. Might have made it longer than last time. [22:02:19] I just saved to give you more stderr dots :) [22:04:57] Saved again [22:05:03] Still going. Weird. [22:06:56] (03Draft1) 10Paladox: Add GerritBot project to #wikimedia-releng channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/307439 [22:09:32] (03PS2) 10Paladox: Add GerritBot project to #wikimedia-releng channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/307439 [22:10:56] the free internet I am leeching off decided to crap out just now. whee [22:11:35] Still not crashed yuvipanda [22:11:43] -._o_.- [22:12:12] 908MB [22:12:16] 580MB [22:12:22] right [22:12:29] so if it crossed 1023 I think it'll crash [22:12:55] Strange. there should be determinism in this execution. [22:13:04] Oh well. I have a script running locally now [22:13:10] So my work will be saved either way. [22:15:17] halfak I'm looking at fresh new https://grafana-labs-admin.wikimedia.org/dashboard/db/paws [22:15:26] and I think you hit 1023 once [22:15:33] (everything is duplicated, need to fix that) [22:15:38] this is sampled only ever minute [22:15:42] so won't catch everything [22:15:55] gotcha [22:16:16] Maybe I can make it [22:16:21] Quit looking at me watchdog! [22:16:28] :D [22:16:35] I only want a lot of memory for a second [22:16:39] I need to figure out how to properly make this apparent [22:16:52] right. that works well for CPU, not so much for memory tho [22:17:11] (with our current configuration, at least) [22:17:23] you can spike CPU when others aren't using it, but not memory [22:18:08] anyway, I gotta step out for a bit, I'll brb [22:18:16] Thanks for your help! [22:20:17] (03Draft1) 10Paladox: Add grrrit-wm project to #wikimedia-releng channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/307441 [22:20:59] (03PS2) 10Paladox: Add grrrit-wm project to #wikimedia-releng channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/307441 [22:33:44] 06Labs: Clean up data in /data/scratch/mwoffliner - https://phabricator.wikimedia.org/T144025#2593137 (10madhuvishy) @Kelson Since it's this big and is cache data, we weren't planning to migrate this on to the new scratch setup in labstore1003 (This switch happens 8/31). [22:36:12] STILL GOING! [22:36:14] MWAHAHAHA [22:36:22] It's going to crash right before it finishes. [22:36:27] I know it [22:49:35] PROBLEM - Puppet run on tools-exec-1217 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [22:49:43] PROBLEM - Puppet run on tools-worker-1018 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [22:49:47] PROBLEM - Puppet run on tools-worker-1012 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [22:50:07] PROBLEM - Puppet run on tools-exec-1216 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [22:50:49] PROBLEM - Puppet run on tools-exec-1208 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [22:51:05] PROBLEM - Puppet run on tools-webgrid-lighttpd-1409 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [22:51:15] PROBLEM - Puppet run on tools-worker-1016 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [22:51:41] PROBLEM - Puppet run on tools-bastion-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [22:51:41] PROBLEM - Puppet run on tools-web-static-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [22:52:03] PROBLEM - Puppet run on tools-worker-1011 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [22:52:07] 06Labs, 06Operations, 13Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#1846901 (10AlexMonk-WMF) So it seems the list is now: - deployment-prep - integration - wikidata-query - etcd >>!... [22:52:15] PROBLEM - Puppet run on tools-exec-1212 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [22:52:19] PROBLEM - Puppet run on tools-worker-1008 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [22:52:21] PROBLEM - Puppet run on tools-webgrid-lighttpd-1210 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [22:52:51] PROBLEM - Puppet run on tools-mail is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [22:52:53] PROBLEM - Puppet run on tools-webgrid-lighttpd-1205 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [22:53:17] PROBLEM - Puppet run on tools-webgrid-lighttpd-1202 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [22:53:21] PROBLEM - Puppet run on tools-grid-shadow is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [22:53:29] PROBLEM - Puppet run on tools-exec-1220 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [22:53:58] PROBLEM - Puppet run on tools-exec-1215 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [22:54:02] PROBLEM - Puppet run on tools-worker-1020 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [22:54:04] PROBLEM - Puppet run on tools-webgrid-lighttpd-1408 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [22:54:12] PROBLEM - Puppet run on tools-exec-gift is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [22:54:12] 06Labs: Purge stale data from LDAP - https://phabricator.wikimedia.org/T138150#2391674 (10AlexMonk-WMF) I ran into this thing recently: ```dn: dc=basic.puppet.node,ou=hosts,dc=wikimedia,dc=org objectClass: domainrelatedobject objectClass: dnsdomain objectClass: domain objectClass: puppetclient objectClass: dcobj... [22:54:26] PROBLEM - Puppet run on tools-webgrid-lighttpd-1209 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [22:54:28] PROBLEM - Puppet run on tools-bastion-03 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [22:54:31] 06Labs, 07LDAP: Document LDAP structure unambiguously - https://phabricator.wikimedia.org/T138151#2593231 (10AlexMonk-WMF) [22:54:38] PROBLEM - Puppet run on tools-worker-1021 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [22:54:41] 06Labs, 07LDAP: Purge stale data from LDAP - https://phabricator.wikimedia.org/T138150#2593232 (10AlexMonk-WMF) [22:54:48] PROBLEM - Puppet run on tools-worker-1010 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [22:54:50] PROBLEM - Puppet run on tools-webgrid-lighttpd-1201 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [22:54:54] PROBLEM - Puppet run on tools-worker-1006 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [22:54:58] PROBLEM - Puppet run on tools-webgrid-generic-1402 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [22:55:12] PROBLEM - Puppet run on tools-grid-master is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [22:55:20] PROBLEM - Puppet run on tools-webgrid-lighttpd-1410 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [22:55:24] PROBLEM - Puppet run on tools-cron-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [22:55:38] PROBLEM - Puppet run on tools-web-static-02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [22:55:50] PROBLEM - Puppet run on tools-exec-1405 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [22:55:52] PROBLEM - Puppet run on tools-exec-1205 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [22:56:00] PROBLEM - Puppet run on tools-checker-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [22:56:38] PROBLEM - Puppet run on tools-exec-1218 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [22:56:41] PROBLEM - Puppet run on tools-exec-1209 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [22:56:59] PROBLEM - Puppet run on tools-webgrid-lighttpd-1401 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [22:57:03] PROBLEM - Puppet run on tools-webgrid-generic-1401 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [22:57:17] PROBLEM - Puppet run on tools-exec-1207 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [22:57:20] PROBLEM - Puppet run on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [22:58:04] PROBLEM - Puppet run on tools-static-02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [22:58:13] PROBLEM - Puppet run on tools-webgrid-lighttpd-1203 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [22:58:55] PROBLEM - Puppet run on tools-worker-1005 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [22:58:59] PROBLEM - Puppet run on tools-exec-1219 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [22:59:11] PROBLEM - Puppet run on tools-webgrid-lighttpd-1204 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [22:59:15] PROBLEM - Puppet run on tools-worker-1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [22:59:23] PROBLEM - Puppet run on tools-worker-1022 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [22:59:27] PROBLEM - Puppet run on tools-exec-1404 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [23:00:03] PROBLEM - Puppet run on tools-webgrid-lighttpd-1402 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [23:00:15] PROBLEM - Puppet run on tools-worker-1019 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [23:00:17] PROBLEM - Puppet run on tools-exec-1203 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [23:00:31] PROBLEM - Puppet run on tools-webgrid-lighttpd-1407 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [23:00:39] PROBLEM - Puppet run on tools-webgrid-generic-1403 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [23:00:45] PROBLEM - Puppet run on tools-exec-1408 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [23:00:50] PROBLEM - Puppet run on tools-webgrid-lighttpd-1404 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [23:01:13] PROBLEM - Puppet run on tools-worker-1015 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [23:01:19] PROBLEM - Puppet run on tools-webgrid-lighttpd-1406 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [23:01:39] PROBLEM - Puppet run on tools-webgrid-lighttpd-1413 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [23:02:12] PROBLEM - Puppet run on tools-webgrid-lighttpd-1414 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [23:03:36] PROBLEM - Puppet run on tools-worker-1003 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [23:04:08] PROBLEM - Puppet run on tools-worker-1025 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [23:04:11] There seems to be puppet failures on tools ^^ [23:04:16] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:04:26] PROBLEM - Puppet run on tools-worker-1014 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [23:05:06] ssh to tools isn't working, I guess it's related [23:05:25] chasemp andrewbogott Krenair ^^ [23:05:27] Web also fails [23:05:34] Tools website is also down [23:05:37] i carnt access it [23:05:53] yuvipanda: about? [23:06:34] PROBLEM - Puppet run on tools-static-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [23:06:37] chasemp i think he went out for a bit [23:06:49] anyway, I gotta step out for a bit, I'll brb [23:09:26] !log restart nginx on tools-proxy-01 [23:10:18] PROBLEM - Puppet run on tools-k8s-master-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [23:10:32] PROBLEM - Puppet run on tools-worker-1007 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [23:10:38] hey chaemp [23:10:38] PROBLEM - Puppet run on tools-worker-1013 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [23:10:59] looking [23:11:15] yuvipanda: I'm pretty confused atm on what's up [23:11:37] It seems a total failure [23:12:32] PROBLEM - Puppet run on tools-worker-1023 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [23:13:30] PROBLEM - Puppet run on tools-worker-1002 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [23:13:36] PROBLEM - Puppet run on tools-bastion-05 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [23:13:46] !log tools bumped up worker_connections, restarted nginx [23:14:02] PROBLEM - Puppet run on tools-mail-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [23:14:33] chasemp hmm, I'm confused as well [23:15:06] it seems like tools only [23:15:13] everything is responding with 499 [23:15:17] which is a special nginx error code [23:15:21] what's everythign? [23:15:21] ok [23:15:34] ugh, need to fix my permissions first I think [23:15:36] PROBLEM - Puppet run on tools-worker-1009 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [23:15:41] > HTTP 499 in Nginx means that the client closed the connection before the server answered the request. In my experience is usually caused by client side timeout. As I know it's an Nginx specific error code.Oct 19, 2012 [23:15:42] PROBLEM - Puppet run on tools-k8s-master-01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [23:15:43] tools-proxy-01 is a Tool labs generic web proxy (role::labs::tools::proxy) [23:15:45] chasemp access.log [23:15:46] PROBLEM - Puppet run on tools-worker-1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [23:15:47] The last Puppet run was at Mon Aug 29 22:53:41 UTC 2016 (21 minutes ago). Puppet is disabled. reason not specified [23:15:52] /varlog/nginx/access.log [23:15:55] krenair that's me [23:16:10] not sure why everything is 499 :| [23:16:44] since error.log is fine [23:17:16] shall we move to the other proxy server? [23:17:30] PROBLEM - Puppet run on tools-worker-1017 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [23:17:36] 499... would be an upstream proxy problem maybe? [23:17:36] yeah, failing over now [23:17:56] I'm having trouble getting on tools-bastion-03 as well [23:17:57] that's the status that means that the client closed the socket prematurely [23:17:57] but unsure if related [23:18:27] my opne shell on bastion-02 is dead [23:18:27] Same [23:18:31] `curl -vvv -H "Host: tools.wmflabs.org" http://instance-tools-proxy-02.tools.wmflabs.org.` works. it fails if I change it to proxy-01 [23:18:52] maybe it's a network issue? [23:19:06] that, uh, hmm [23:19:09] then how can I ssh? [23:19:29] it's something local to tools it seems like [23:19:31] ssh tools-proxy-02.tools.eqiad.wmflabs ? [23:19:34] I just got ssh to tools-elastic-01.tools to work, so not universal if its networking [23:19:47] wtf [23:19:58] assuming you have *.wmflabs going via the normal labs bastions, not login.tools.wmflabs.org [23:20:00] curl localhost on tools-proxy-02 is failing too [23:20:02] or is just 'hung' [23:21:00] it might not be bound to localhost? [23:21:00] curl tools-proxy-02 on tools-proxy-02 [23:21:07] it has to be an nfs thing? or a storage thing [23:21:12] load on tools-bastion-03 is like 90 [23:21:16] hmm [23:21:20] ssh to tools-exec-1410.tools is acting like the packets are blackholed somewhere [23:21:23] madhuvishy: are you about? doing anything for nfs? [23:21:24] no that doesn't change anything [23:21:36] curl on localhost to tools-proxy-01 and -02 produces 499 too [23:21:45] which also feels like something network related, in one way [23:21:56] not doing failover, since that'll have no effect I think [23:22:19] packets are coming in, and then somehow it's 499, so the connection isn't being established [23:22:23] I'm going to reboot tools-bastion-03 and I want to see effect on load and availability [23:23:04] chasemp: yeah i'm around - but no not doing anything on nfs [23:24:06] tools-static is dead too [23:24:50] ok soo... [23:24:55] tools-checker is alive-ish [23:25:03] in that I can hit it, but it's 502 [23:25:31] but I cna't actually ssh into it [23:27:00] nope, ssh'd in. just very, very slow [23:27:00] root@tools-exec-1410:~# uptime [23:27:01] 23:25:57 up 159 days, 1:16, 1 user, load average: 49.90, 58.44, 42.78 [23:27:01] ssh taking forever on tools-checker [23:27:01] let me check if things without NFS have [23:27:42] ok, tools-puppetmaster-01 ssh happened fine [23:28:13] I noticed outbound for NFS on labstoer1001 was doing almost nothing [23:28:19] so I restarted nfs-kernel-server [23:28:28] along w/ load super high on spot checked execs and the bastion [23:28:30] Traceback (most recent call last): [23:28:30] File "/usr/local/bin/log-command-invocation", line 25, in [23:28:31] import argparse [23:28:31] File "/usr/lib/python2.7/argparse.py", line 92, in [23:28:31] from gettext import gettext as _ [23:28:33] and I restarted on labstore1003 for fun [23:28:33] File "/usr/lib/python2.7/gettext.py", line 49, in [23:28:35] import locale, copy, os, re, struct, sys [23:28:38] File "/usr/lib/python2.7/locale.py", line 1471, in [23:28:40] 'ru_ua.koi8u': 'ru_UA.KOI8-U', [23:28:43] MemoryError [23:28:45] I had this output error [23:29:02] PROBLEM - Puppet run on tools-webgrid-generic-1404 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [23:29:06] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.027 second response time [23:29:14] I really think that did it [23:29:41] yeah [23:29:44] weird part is [23:29:50] it wasn't in any errant state on either side [23:29:53] that I can tell [23:30:05] Is pinging to labstore.svc supposed to show 'Packet Filtered'? [23:30:11] yup [23:30:12] yeah probably [23:30:16] proxy works too [23:30:18] wtf [23:30:23] tcping labstore.svc 2049 [23:30:35] proxy doesn't even have nfs [23:30:55] no but it tries to hit enough boxes that are hung / frozen that do [23:31:11] idk yet [23:31:22] we can't get off of this nfs setup fast enough [23:31:33] yeah but those usually show up as 5xxx errors in error.log [23:31:37] than 499s [23:31:42] yeah [23:31:50] most bizzare outage [23:31:54] I have no explanation for that particular bit yeah [23:31:59] I'm still puzzling on it [23:32:15] network saturation? [23:32:15] PROBLEM - Puppet run on tools-precise-dev is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [23:32:31] also now that fire is out i am going to try to get off the sidewalk [23:32:31] the few things I know point to NFS [23:32:35] ok [23:32:41] i'll brb [23:36:41] RECOVERY - Puppet run on tools-bastion-02 is OK: OK: Less than 1.00% above the threshold [0.0] [23:37:15] RECOVERY - Puppet run on tools-exec-1212 is OK: OK: Less than 1.00% above the threshold [0.0] [23:37:51] RECOVERY - Puppet run on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [23:37:53] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [23:38:03] !log tools added myself to the tools.admin service group earlier to try to figure out what was causing the outage, removed again now [23:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [23:38:23] RECOVERY - Puppet run on tools-grid-shadow is OK: OK: Less than 1.00% above the threshold [0.0] [23:38:58] so tools/nfs was out for ~15 minutes [23:39:13] RECOVERY - Puppet run on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0] [23:39:17] RECOVERY - Puppet run on tools-worker-1004 is OK: OK: Less than 1.00% above the threshold [0.0] [23:39:23] RECOVERY - Puppet run on tools-worker-1022 is OK: OK: Less than 1.00% above the threshold [0.0] [23:39:24] RECOVERY - Puppet run on tools-webgrid-lighttpd-1209 is OK: OK: Less than 1.00% above the threshold [0.0] [23:39:25] RECOVERY - Puppet run on tools-bastion-03 is OK: OK: Less than 1.00% above the threshold [0.0] [23:39:36] RECOVERY - Puppet run on tools-exec-1217 is OK: OK: Less than 1.00% above the threshold [0.0] [23:39:38] Problem here [23:39:44] My webservice isn't restarting [23:39:51] Nor starting [23:40:10] RECOVERY - Puppet run on tools-exec-1216 is OK: OK: Less than 1.00% above the threshold [0.0] [23:40:16] RECOVERY - Puppet run on tools-worker-1019 is OK: OK: Less than 1.00% above the threshold [0.0] [23:40:18] RECOVERY - Puppet run on tools-exec-1203 is OK: OK: Less than 1.00% above the threshold [0.0] [23:40:32] RECOVERY - Puppet run on tools-webgrid-lighttpd-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [23:41:19] RECOVERY - Puppet run on tools-exec-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [23:41:19] RECOVERY - Puppet run on tools-exec-1405 is OK: OK: Less than 1.00% above the threshold [0.0] [23:41:19] RECOVERY - Puppet run on tools-exec-1205 is OK: OK: Less than 1.00% above the threshold [0.0] [23:41:20] RECOVERY - Puppet run on tools-checker-01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:41:20] RECOVERY - Puppet run on tools-webgrid-lighttpd-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [23:41:20] RECOVERY - Puppet run on tools-worker-1015 is OK: OK: Less than 1.00% above the threshold [0.0] [23:42:44] RECOVERY - Puppet run on tools-exec-1218 is OK: OK: Less than 1.00% above the threshold [0.0] [23:42:45] RECOVERY - Puppet run on tools-exec-1209 is OK: OK: Less than 1.00% above the threshold [0.0] [23:42:45] RECOVERY - Puppet run on tools-webgrid-lighttpd-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [23:42:46] Tool Labs is down again [23:42:46] RECOVERY - Puppet run on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [23:42:46] RECOVERY - Puppet run on tools-webgrid-lighttpd-1414 is OK: OK: Less than 1.00% above the threshold [0.0] [23:42:46] RECOVERY - Puppet run on tools-exec-1207 is OK: OK: Less than 1.00% above the threshold [0.0] [23:42:46] RECOVERY - Puppet run on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [23:42:47] RECOVERY - Puppet run on tools-webgrid-lighttpd-1210 is OK: OK: Less than 1.00% above the threshold [0.0] [23:42:50] RECOVERY - Puppet run on tools-webgrid-lighttpd-1205 is OK: OK: Less than 1.00% above the threshold [0.0] [23:42:59] abian, what problem are you seeing? [23:43:13] Now it works [23:43:13] Now up xD [23:43:16] RECOVERY - Puppet run on tools-webgrid-lighttpd-1203 is OK: OK: Less than 1.00% above the threshold [0.0] [23:43:18] RECOVERY - Puppet run on tools-webgrid-lighttpd-1202 is OK: OK: Less than 1.00% above the threshold [0.0] [23:43:20] :) [23:43:31] RECOVERY - Puppet run on tools-exec-1220 is OK: OK: Less than 1.00% above the threshold [0.0] [23:43:37] RECOVERY - Puppet run on tools-bastion-05 is OK: OK: Less than 1.00% above the threshold [0.0] [23:43:39] RECOVERY - Puppet run on tools-worker-1003 is OK: OK: Less than 1.00% above the threshold [0.0] [23:43:41] I check out , which is a simple HTML page [23:43:57] RECOVERY - Puppet run on tools-exec-1219 is OK: OK: Less than 1.00% above the threshold [0.0] [23:43:57] RECOVERY - Puppet run on tools-exec-1215 is OK: OK: Less than 1.00% above the threshold [0.0] [23:44:07] RECOVERY - Puppet run on tools-webgrid-lighttpd-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [23:44:07] RECOVERY - Puppet run on tools-worker-1025 is OK: OK: Less than 1.00% above the threshold [0.0] [23:44:11] RECOVERY - Puppet run on tools-webgrid-lighttpd-1204 is OK: OK: Less than 1.00% above the threshold [0.0] [23:44:27] RECOVERY - Puppet run on tools-exec-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [23:44:27] RECOVERY - Puppet run on tools-worker-1014 is OK: OK: Less than 1.00% above the threshold [0.0] [23:44:49] RECOVERY - Puppet run on tools-webgrid-lighttpd-1201 is OK: OK: Less than 1.00% above the threshold [0.0] [23:45:01] RECOVERY - Puppet run on tools-webgrid-generic-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [23:45:03] RECOVERY - Puppet run on tools-webgrid-lighttpd-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [23:45:11] RECOVERY - Puppet run on tools-grid-master is OK: OK: Less than 1.00% above the threshold [0.0] [23:45:21] RECOVERY - Puppet run on tools-webgrid-lighttpd-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [23:45:25] RECOVERY - Puppet run on tools-cron-01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:45:25] RECOVERY - Puppet run on tools-worker-1007 is OK: OK: Less than 1.00% above the threshold [0.0] [23:45:37] RECOVERY - Puppet run on tools-worker-1013 is OK: OK: Less than 1.00% above the threshold [0.0] [23:45:39] RECOVERY - Puppet run on tools-webgrid-generic-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [23:45:51] RECOVERY - Puppet run on tools-webgrid-lighttpd-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [23:45:51] RECOVERY - Puppet run on tools-exec-1208 is OK: OK: Less than 1.00% above the threshold [0.0] [23:46:17] RECOVERY - Puppet run on tools-webgrid-lighttpd-1406 is OK: OK: Less than 1.00% above the threshold [0.0] [23:46:37] RECOVERY - Puppet run on tools-static-01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:46:39] RECOVERY - Puppet run on tools-webgrid-lighttpd-1413 is OK: OK: Less than 1.00% above the threshold [0.0] [23:46:59] PROBLEM - Puppet run on tools-exec-1409 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [23:47:16] RECOVERY - Puppet run on tools-precise-dev is OK: OK: Less than 1.00% above the threshold [0.0] [23:47:34] RECOVERY - Puppet run on tools-worker-1023 is OK: OK: Less than 1.00% above the threshold [0.0] [23:47:36] PROBLEM - Puppet run on tools-webgrid-lighttpd-1412 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [23:47:42] PROBLEM - Puppet run on tools-bastion-02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [23:48:14] PROBLEM - Puppet run on tools-exec-1212 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [23:48:22] PROBLEM - Puppet run on tools-webgrid-lighttpd-1210 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [23:48:30] RECOVERY - Puppet run on tools-worker-1002 is OK: OK: Less than 1.00% above the threshold [0.0] [23:48:50] PROBLEM - Puppet run on tools-mail is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [23:48:52] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 71.43% of data above the critical threshold [0.0] [23:48:52] PROBLEM - Puppet run on tools-webgrid-lighttpd-1205 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [23:49:04] RECOVERY - Puppet run on tools-mail-01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:50:06] PROBLEM - Puppet run on tools-webgrid-lighttpd-1408 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [23:50:12] PROBLEM - Puppet run on tools-exec-gift is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [23:50:14] PROBLEM - Puppet run on tools-exec-1410 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [23:50:16] RECOVERY - Puppet run on tools-k8s-master-02 is OK: OK: Less than 1.00% above the threshold [0.0] [23:50:24] PROBLEM - Puppet run on tools-webgrid-lighttpd-1209 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [23:50:36] PROBLEM - Puppet run on tools-exec-1217 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [23:50:44] RECOVERY - Puppet run on tools-worker-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [23:50:56] PROBLEM - Puppet run on tools-webgrid-generic-1402 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [23:51:06] PROBLEM - Puppet run on tools-exec-1216 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [23:51:13] PROBLEM - Puppet run on tools-grid-master is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [23:51:21] PROBLEM - Puppet run on tools-webgrid-lighttpd-1410 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [23:51:23] PROBLEM - Puppet run on tools-cron-01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [23:51:51] PROBLEM - Puppet run on tools-exec-1208 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [23:52:01] PROBLEM - Puppet run on tools-checker-01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [23:52:03] PROBLEM - Puppet run on tools-webgrid-lighttpd-1409 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [23:52:41] PROBLEM - Puppet run on tools-exec-1209 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [23:54:49] PROBLEM - Puppet run on tools-webgrid-lighttpd-1202 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [23:54:50] PROBLEM - Puppet run on tools-grid-shadow is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [23:54:50] PROBLEM - Puppet run on tools-exec-1220 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [23:55:47] PROBLEM - Puppet run on tools-webgrid-lighttpd-1201 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [23:57:15] PROBLEM - SSH on tools-exec-1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:57:59] PROBLEM - Puppet run on tools-webgrid-lighttpd-1401 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [23:58:18] PROBLEM - Puppet run on tools-exec-1207 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]