[00:12:59] <wikibugs>	 10PAWS: Increase file upload size limit - https://phabricator.wikimedia.org/T144146#2589807 (10Capt_Swing)
[00:30:51] <shinken-wm>	 PROBLEM - Free space - all mounts on tools-static-01 is CRITICAL: CRITICAL: tools.tools-static-01.diskspace._srv.byte_percentfree (<100.00%)
[00:37:51] <grrrit-wm>	 (03PS1) 10BryanDavis: Phabricator repo lookup is a "contains" not exact match [labs/striker] - 10https://gerrit.wikimedia.org/r/307229 (https://phabricator.wikimedia.org/T144139) 
[00:37:53] <grrrit-wm>	 (03PS1) 10BryanDavis: Only display repo urls that are visible in Phabricator [labs/striker] - 10https://gerrit.wikimedia.org/r/307230 (https://phabricator.wikimedia.org/T143957) 
[00:37:55] <grrrit-wm>	 (03PS1) 10BryanDavis: Mark http repo URLs as hidden by default [labs/striker] - 10https://gerrit.wikimedia.org/r/307231 (https://phabricator.wikimedia.org/T143957) 
[00:37:57] <grrrit-wm>	 (03PS1) 10BryanDavis: Add link to diffusion repo on detail page [labs/striker] - 10https://gerrit.wikimedia.org/r/307232 (https://phabricator.wikimedia.org/T143975) 
[00:40:22] <grrrit-wm>	 (03PS2) 10BryanDavis: Catch and log database errors while saving models [labs/striker] - 10https://gerrit.wikimedia.org/r/307110 (https://phabricator.wikimedia.org/T144082) 
[00:41:07] <grrrit-wm>	 (03CR) 10BryanDavis: [V: 032] Catch and log database errors while saving models [labs/striker] - 10https://gerrit.wikimedia.org/r/307110 (https://phabricator.wikimedia.org/T144082) (owner: 10BryanDavis)
[00:42:23] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 032] Catch and log database errors while saving models [labs/striker] - 10https://gerrit.wikimedia.org/r/307110 (https://phabricator.wikimedia.org/T144082) (owner: 10BryanDavis)
[00:45:02] <grrrit-wm>	 (03Merged) 10jenkins-bot: Catch and log database errors while saving models [labs/striker] - 10https://gerrit.wikimedia.org/r/307110 (https://phabricator.wikimedia.org/T144082) (owner: 10BryanDavis)
[00:45:28] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Update site.css [labs/striker/staticfiles] - 10https://gerrit.wikimedia.org/r/307136 (https://phabricator.wikimedia.org/T143972) (owner: 10BryanDavis)
[00:45:34] <grrrit-wm>	 (03Merged) 10jenkins-bot: Update site.css [labs/striker/staticfiles] - 10https://gerrit.wikimedia.org/r/307136 (https://phabricator.wikimedia.org/T143972) (owner: 10BryanDavis)
[00:49:26] <grrrit-wm>	 (03PS2) 10BryanDavis: Only display repo urls that are visible in Phabricator [labs/striker] - 10https://gerrit.wikimedia.org/r/307230 (https://phabricator.wikimedia.org/T143957) 
[00:49:28] <grrrit-wm>	 (03PS2) 10BryanDavis: Mark http repo URLs as hidden by default [labs/striker] - 10https://gerrit.wikimedia.org/r/307231 (https://phabricator.wikimedia.org/T143957) 
[00:49:30] <grrrit-wm>	 (03PS2) 10BryanDavis: Phabricator repo lookup is a "contains" not exact match [labs/striker] - 10https://gerrit.wikimedia.org/r/307229 (https://phabricator.wikimedia.org/T144139) 
[00:49:32] <grrrit-wm>	 (03PS2) 10BryanDavis: Add link to diffusion repo on detail page [labs/striker] - 10https://gerrit.wikimedia.org/r/307232 (https://phabricator.wikimedia.org/T143975) 
[00:59:33] <grrrit-wm>	 (03PS2) 10BryanDavis: Install/upgrade via wheels rather than complete venv reload [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/307085 
[01:00:03] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 032] Install/upgrade via wheels rather than complete venv reload [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/307085 (owner: 10BryanDavis)
[01:00:09] <grrrit-wm>	 (03Merged) 10jenkins-bot: Install/upgrade via wheels rather than complete venv reload [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/307085 (owner: 10BryanDavis)
[01:02:42] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Phabricator repo lookup is a "contains" not exact match [labs/striker] - 10https://gerrit.wikimedia.org/r/307229 (https://phabricator.wikimedia.org/T144139) (owner: 10BryanDavis)
[01:03:46] <grrrit-wm>	 (03Merged) 10jenkins-bot: Phabricator repo lookup is a "contains" not exact match [labs/striker] - 10https://gerrit.wikimedia.org/r/307229 (https://phabricator.wikimedia.org/T144139) (owner: 10BryanDavis)
[01:07:51] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Only display repo urls that are visible in Phabricator [labs/striker] - 10https://gerrit.wikimedia.org/r/307230 (https://phabricator.wikimedia.org/T143957) (owner: 10BryanDavis)
[01:10:31] <grrrit-wm>	 (03Merged) 10jenkins-bot: Only display repo urls that are visible in Phabricator [labs/striker] - 10https://gerrit.wikimedia.org/r/307230 (https://phabricator.wikimedia.org/T143957) (owner: 10BryanDavis)
[01:10:57] <bd808>	 you're a machine Krenair! Thanks :)
[01:13:00] <Krenair>	 thanks bd808. I'll look at the rest later
[01:14:39] <Krenair>	 bd808, can you add me as a reviewer so I don't forget?
[01:21:30] <shinken-wm>	 PROBLEM - Free space - all mounts on tools-web-static-02 is CRITICAL: CRITICAL: tools.tools-web-static-02.diskspace._srv.byte_percentfree (<40.00%)
[01:25:29] <wikibugs>	 10PAWS, 10MediaWiki-extensions-OAuth, 10Pywikibot-OAuth: PAWS can not login - https://phabricator.wikimedia.org/T136114#2589888 (10Tgr) @yuvipanda, but what are those credentials? Sharing the client_id/access_key should be safe and it's hard to debug the problem without that.
[01:36:37] <grrrit-wm>	 (03CR) 10BryanDavis: "cherry-picked to https://striker.wmflabs.org/ for verification." [labs/striker] - 10https://gerrit.wikimedia.org/r/307231 (https://phabricator.wikimedia.org/T143957) (owner: 10BryanDavis)
[01:36:54] <grrrit-wm>	 (03CR) 10BryanDavis: "cherry-picked to https://striker.wmflabs.org/ for verification" [labs/striker] - 10https://gerrit.wikimedia.org/r/307232 (https://phabricator.wikimedia.org/T143975) (owner: 10BryanDavis)
[01:49:29] <shinken-wm>	 PROBLEM - Free space - all mounts on tools-web-static-01 is CRITICAL: CRITICAL: tools.tools-web-static-01.diskspace._srv.byte_percentfree (<40.00%)
[01:55:46] <wikibugs>	 06Labs, 10Phabricator, 07Puppet: Update phabricator puppet role to support use on labs - https://phabricator.wikimedia.org/T144112#2589893 (10Peachey88)
[02:15:12] <grrrit-wm>	 (03PS1) 10BryanDavis: Add django_log_request_id wheel [labs/striker/wheels] - 10https://gerrit.wikimedia.org/r/307233 (https://phabricator.wikimedia.org/T143949) 
[02:17:28] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 032] Add django_log_request_id wheel [labs/striker/wheels] - 10https://gerrit.wikimedia.org/r/307233 (https://phabricator.wikimedia.org/T143949) (owner: 10BryanDavis)
[02:17:34] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add django_log_request_id wheel [labs/striker/wheels] - 10https://gerrit.wikimedia.org/r/307233 (https://phabricator.wikimedia.org/T143949) (owner: 10BryanDavis)
[02:23:39] <grrrit-wm>	 (03PS1) 10BryanDavis: Bump submodules: static, striker, wheels [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/307234 
[02:25:11] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 032] Bump submodules: static, striker, wheels [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/307234 (owner: 10BryanDavis)
[02:25:17] <grrrit-wm>	 (03Merged) 10jenkins-bot: Bump submodules: static, striker, wheels [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/307234 (owner: 10BryanDavis)
[04:52:38] <wikibugs>	 10PAWS, 10MediaWiki-extensions-OAuth, 10Pywikibot-OAuth: PAWS can not login - https://phabricator.wikimedia.org/T136114#2589908 (10yuvipanda) ah, sure. 0a73e346a40b07262b6e36bdba01cba4 is the client_id.
[04:54:34] <wikibugs>	 06Labs, 10Beta-Cluster-Infrastructure: puppet::self hosts now have two servers set - https://phabricator.wikimedia.org/T144108#2589909 (10yuvipanda) @mmodell https://phabricator.wikimedia.org/T120159
[04:55:51] <wikibugs>	 06Labs, 10Beta-Cluster-Infrastructure: puppet::self hosts now have two servers set - https://phabricator.wikimedia.org/T144108#2589910 (10yuvipanda) role::puppet::self for puppet *clients* is doubly terrible. I'll spend next week getting rid of that across labs - see https://phabricator.wikimedia.org/T120159#2...
[05:40:28] <wikibugs>	 06Labs: Clean up data in /data/scratch/mwoffliner - https://phabricator.wikimedia.org/T144025#2589926 (10Kelson) This directory works as a cache. This is why it's pretty big. One time T117095 is fully implemented, the big part of it will be removed. I plan to work on this during September.
[05:46:08] <wikibugs>	 10PAWS: Increase file upload size limit - https://phabricator.wikimedia.org/T144146#2589931 (10yuvipanda) I'll look into this! Right now, you can put it somewhere else and 'wget' it from 'new -> terminal' in PAWS.  I'll increase the file limit as well, but that can go only so far. I might also enable scp access...
[06:27:18] <wikibugs>	 10PAWS, 10MediaWiki-extensions-OAuth, 10Pywikibot-OAuth: PAWS can not login - https://phabricator.wikimedia.org/T136114#2589978 (10Tgr) The access token is `7cb5b315a11d0dcbe46d1c90332dd210` for Dvorapa (timestamp: 20160827125759) and `0d64b706435230a05213f605ff1ad8ac` for Framawiki (20160829061415). Does th...
[06:59:52] <wikibugs>	 10PAWS, 10MediaWiki-extensions-OAuth, 10Pywikibot-OAuth: PAWS can not login - https://phabricator.wikimedia.org/T136114#2589990 (10yuvipanda) Hmm, nope that doesn't match.
[07:00:45] <wikibugs>	 10PAWS: Disable password based login on pwb on PAWS - https://phabricator.wikimedia.org/T144151#2589991 (10yuvipanda)
[07:01:54] <wikibugs>	 10PAWS: Disable password based login on pwb on PAWS - https://phabricator.wikimedia.org/T144151#2589991 (10yuvipanda) (I've found and deleted a couple of these)
[07:03:17] <wikibugs>	 10PAWS, 10MediaWiki-extensions-OAuth, 10Pywikibot-OAuth: PAWS can not login - https://phabricator.wikimedia.org/T136114#2590006 (10yuvipanda) Also, @Dvorapa @Framawiki please do not use password based login. This is highly insecure - I've filed T144151 to disable it on PAWS. In the meantime, you two should p...
[07:04:14] <wikibugs>	 10PAWS, 10Jupyter-Hub: I can't login my bot in JUPYTER - https://phabricator.wikimedia.org/T135306#2590009 (10yuvipanda) @Maathavan I think I've fixed this now finally. Can you try again?
[07:30:57] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Move kubernetes authentication to using X.509 client certs - https://phabricator.wikimedia.org/T144153#2590052 (10yuvipanda)
[07:50:15] <wikibugs>	 10PAWS: Implement a 'signing OAuth Proxy' for PAWS - https://phabricator.wikimedia.org/T120469#2590093 (10Tgr) See [[https://www.sans.org/reading-room/whitepapers/application/attacks-oauth-secure-oauth-implementation-33644|this article]] (section 2) for the threat model. In short, # a hostile user can send reque...
[08:06:51] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Mark http repo URLs as hidden by default [labs/striker] - 10https://gerrit.wikimedia.org/r/307231 (https://phabricator.wikimedia.org/T143957) (owner: 10BryanDavis)
[08:07:31] <grrrit-wm>	 (03Merged) 10jenkins-bot: Mark http repo URLs as hidden by default [labs/striker] - 10https://gerrit.wikimedia.org/r/307231 (https://phabricator.wikimedia.org/T143957) (owner: 10BryanDavis)
[08:10:06] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Add link to diffusion repo on detail page [labs/striker] - 10https://gerrit.wikimedia.org/r/307232 (https://phabricator.wikimedia.org/T143975) (owner: 10BryanDavis)
[08:10:46] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add link to diffusion repo on detail page [labs/striker] - 10https://gerrit.wikimedia.org/r/307232 (https://phabricator.wikimedia.org/T143975) (owner: 10BryanDavis)
[08:55:34] <wikibugs>	 10PAWS, 10MediaWiki-extensions-OAuth, 10Pywikibot-OAuth: PAWS can not login - https://phabricator.wikimedia.org/T136114#2590203 (10Urbanecm) @yuvipanda Okay, so sorry for my advice :). I wanted to give them access to the service somehow and not reveal them passwords :). BTW, when they run chmod 600 on lwp fi...
[09:01:30] <wikibugs>	 10PAWS, 10MediaWiki-extensions-OAuth, 10Pywikibot-OAuth: PAWS can not login - https://phabricator.wikimedia.org/T136114#2590225 (10Urbanecm) I filled T144157 for set up PWB as I described.
[09:32:06] <wikibugs>	 10Tool-Labs-tools-Pageviews, 10Analytics, 10Pageviews-API: siteviews data for 2016 August 27 appears to be empty - https://phabricator.wikimedia.org/T144159#2590263 (10Amire80)
[09:48:20] <shinken-wm>	 PROBLEM - Puppet staleness on tools-exec-cyberbot is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0]
[10:38:07] <shinken-wm>	 PROBLEM - Puppet run on tools-static-02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[10:53:56] <wikibugs>	 10PAWS, 10MediaWiki-extensions-OAuth, 10Pywikibot-OAuth: PAWS can not login - https://phabricator.wikimedia.org/T136114#2590426 (10Framawiki) Ok, i understand. (@yuvipanda, I just send you an email)
[11:08:03] <shinken-wm>	 RECOVERY - Puppet run on tools-static-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[11:40:56] <shinken-wm>	 PROBLEM - Puppet staleness on tools-exec-1211 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0]
[11:41:54] <grrrit-wm>	 (03CR) 10Hashar: [C: 031] Add gerrit project to #wikimedia-releng channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/307071 (owner: 10Paladox)
[12:04:05] <shinken-wm>	 PROBLEM - Puppet run on tools-static-02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[12:39:04] <shinken-wm>	 RECOVERY - Puppet run on tools-static-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[12:56:17] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1207 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[12:58:09] <wikibugs>	 06Labs, 10Tool-Labs, 10Deployment-Systems: Add release engineering people to tools.jouncebot user group - https://phabricator.wikimedia.org/T144175#2590777 (10hashar)
[12:59:25] <hashar>	 if anyone is lurking around, we could use a restart of tool.jouncebot  https://wikitech.wikimedia.org/wiki/Tool:Jouncebot
[12:59:30] <hashar>	 it is idling / dead apparently :(
[13:05:04] <shinken-wm>	 PROBLEM - Puppet run on tools-static-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[13:30:26] <wikibugs>	 06Labs, 10Tool-Labs, 10DBA: u13367 is running 2 inefficient 9-day-long queries, causing high cpu usage - https://phabricator.wikimedia.org/T144180#2590904 (10jcrespo)
[13:35:33] <wikibugs>	 06Labs, 10Tool-Labs, 10Deployment-Systems: Add release engineering people to tools.jouncebot user group - https://phabricator.wikimedia.org/T144175#2590935 (10chasemp) p:05Triage>03Normal 18 members of https://phabricator.wikimedia.org/project/members/20/, some of whom I don't recognize.  Please provide...
[13:36:17] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1207 is OK: OK: Less than 1.00% above the threshold [0.0]
[13:37:11] <wikibugs>	 06Labs, 15User-Luke081515: Revert: Request increased quota for rcm labs project - https://phabricator.wikimedia.org/T142311#2590951 (10chasemp) 05Open>03Resolved
[13:37:13] <wikibugs>	 06Labs, 07Tracking: Existing Labs project quota increase requests (Tracking) - https://phabricator.wikimedia.org/T140904#2590952 (10chasemp)
[13:37:54] <wikibugs>	 06Labs, 10Tool-Labs, 10DBA: u13367 is running 2 inefficient 9-day-long queries, causing high cpu usage - https://phabricator.wikimedia.org/T144180#2590954 (10Urbanecm) I'm sorry for it. I'm trying to update list of 500 most linked disambigs at cswiki. I have some script but I want to convert it to one SQL qu...
[13:39:38] <wikibugs>	 06Labs, 10Tool-Labs, 10DBA: u13367 is running 2 inefficient 9-day-long queries, causing high cpu usage - https://phabricator.wikimedia.org/T144180#2590958 (10Urbanecm) ATM all my processes including any mysql consoles should be killed, no job should be running from my personal account, from my two tools only...
[13:40:41] <chasemp>	 !log tools restart jouncebot
[13:40:45] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master
[13:48:21] <wikibugs>	 06Labs, 13Patch-For-Review: Kill ldapsupportlib.py - https://phabricator.wikimedia.org/T114063#1683588 (10hashar) Just found that `ldaplist` is scheduled for deletion.  I am still relying on it because its syntax is quite trivial.  If I want to lookup my LDAP informations I just:   ldaplist -l passwd hashar  W...
[13:50:44] <wikibugs>	 06Labs, 10Tool-Labs, 10DBA, 07Tracking: Certain tools users create multiple long running queries that take all memory from labsdb hosts, slowing it down and potentially crashing (tracking) - https://phabricator.wikimedia.org/T119601#2590976 (10jcrespo)
[13:50:46] <wikibugs>	 06Labs, 10Tool-Labs, 10DBA, 15User-Urbanecm: u13367 is running 2 inefficient 9-day-long queries, causing high cpu usage - https://phabricator.wikimedia.org/T144180#2590973 (10jcrespo) 05Open>03Resolved a:03Urbanecm Yes, I killed the queries. Normally the query killer should limit these, but for some...
[13:52:06] <wikibugs>	 06Labs, 10Tool-Labs, 10Deployment-Systems: Add release engineering people to tools.jouncebot user group - https://phabricator.wikimedia.org/T144175#2590981 (10hashar)
[13:52:32] <wikibugs>	 06Labs, 10Tool-Labs, 10Deployment-Systems: Add release engineering people to tools.jouncebot user group - https://phabricator.wikimedia.org/T144175#2590777 (10hashar) My bad sorry. Edited with list of each of our labs shell accounts:  ``` dduvall demon gjg hashar twentyafterfour thcipriani zfilipin ```
[13:54:28] <halfak>	 o/ 
[13:54:37] <halfak>	 I've got an instance where I can't access "/srv
[13:54:43] <halfak>	 I'm trying to figure out what's going on
[13:54:53] <halfak>	 ores-compute-01.ores.eqiad.wmflabs
[13:55:59] <halfak>	 "sysfs"
[13:56:28] <chasemp>	 halfak: I can't seem to get on that atm
[13:56:55] <halfak>	 chasemp, Can't log in or can't access /sys on the machine?
[13:57:05] <chasemp>	 can't login
[13:57:15] <halfak>	 Hmm... I'm definitely able to log in
[14:01:29] <halfak>	 chasemp, any suggestions?
[14:01:33] <halfak>	 Maybe just reboot?
[14:01:54] <chasemp>	 halfak: I'm not sure it hangs on me, I can reboot for you if you'd like 
[14:02:13] <halfak>	 Sure if you've got it handy
[14:03:12] <chasemp>	 !log ores reboot 94886e74-5be4-4669-a1c1-840ce7c65de9 ores-compute-01 
[14:03:15] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master
[14:03:50] <halfak>	 thanks charitwo 
[14:03:56] <halfak>	 * chasemp 
[14:04:11] * halfak needs to type more characters before wildly hitting [TAB]
[14:05:14] <chasemp>	 I'm an enigma of tab completion
[14:05:41] * halfak tries to log back in. 
[14:10:04] <shinken-wm>	 RECOVERY - Puppet run on tools-static-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:11:36] <halfak>	 Yay!  It works!
[14:15:36] <wikibugs>	 06Labs, 06Operations: Connect secondary nic for labstore1004 and labstore1005 - https://phabricator.wikimedia.org/T144183#2591023 (10chasemp)
[14:37:49] <wikibugs>	 06Labs, 10Tool-Labs, 10Deployment-Systems: Add release engineering people to tools.jouncebot user group - https://phabricator.wikimedia.org/T144175#2591094 (10chasemp) It's possibly early morning fugue state but I don't see:  demon (I do see chad) twentyafterfour gjg  I added (to tools if necessary as well):...
[14:49:55] <wikibugs>	 06Labs, 10Tool-Labs, 10Deployment-Systems: Add release engineering people to tools.jouncebot user group - https://phabricator.wikimedia.org/T144175#2590777 (10Paladox) @chasemp that would be ^demon
[15:01:07] <shinken-wm>	 PROBLEM - Puppet run on tools-static-02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[15:03:33] <bd808>	 !log tools.jouncebot Stopped & started bot to try and get it back in #wikimedia-operations
[15:03:36] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.jouncebot/SAL, Master
[15:04:23] <chasemp>	 bd808: did that work?
[15:04:42] <bd808>	 nope :(
[15:05:13] <shinken-wm>	 PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1207 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0]
[15:05:24] <shinken-wm>	 PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1208 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0]
[15:05:36] <bd808>	 the last thing it logs is "2016-08-29 15:02:13,417 - INFO - Attempting to join channel #wikimedia-operations"
[15:05:43] <bd808>	 but no joy on the join
[15:06:24] <chasemp>	 bd808: could it be part of the anti bot / spam stuff that has been ongoing there?
[15:06:38] <chasemp>	 I really don't know but every time I look there is a new strategy
[15:06:58] <bd808>	 maybe, but I'd sort of expect it to log if it was klined or something
[15:07:36] <bd808>	 I'll poke at it a bit. I think it needs a rebuild of the venv too. There are some libxml warnings
[15:11:06] <shinken-wm>	 PROBLEM - Host secgroup-lag-102 is DOWN: CRITICAL - Host Unreachable (10.68.17.218)
[15:12:17] <wikibugs>	 10Tool-Labs-tools-Other, 10Deployment-Systems: Jouncebot not joining #wikimedia-operations - https://phabricator.wikimedia.org/T144189#2591190 (10bd808)
[15:20:55] <wikibugs>	 10Tool-Labs-tools-Other, 10Deployment-Systems: Jouncebot not joining #wikimedia-operations - https://phabricator.wikimedia.org/T144189#2591190 (10Paladox) Could it be someone quieted the ip of stashbot?
[15:37:57] <madhuvishy>	 !unban yuvipanda
[15:38:11] <wikibugs>	 06Labs, 10Tool-Labs, 07Wikimedia-Incident: Setup a simple service that pages when it is unreachable - https://phabricator.wikimedia.org/T143638#2591275 (10chasemp) we agreed to make a check for the proxy health itself today, I'll get added to the PAWS check and we'll iterate on this from there.
[15:38:12] <bd808>	 !log tools.jouncebot Cherry-picked https://gerrit.wikimedia.org/r/#/c/307315/ and restarted bot
[15:38:15] <wikibugs>	 10Tool-Labs-tools-Pageviews, 10Analytics, 10Pageviews-API: siteviews data for 2016 August 27 appears to be empty - https://phabricator.wikimedia.org/T144159#2590263 (10Milimetric) A job failed, is being rerun now, so things should be in order soon.  We'll track and close this when it's resolved.  Thanks for...
[15:38:16] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.jouncebot/SAL, Master
[15:38:53] <wikibugs>	 06Labs, 10Tool-Labs, 13Patch-For-Review, 07Wikimedia-Incident: Tune nginx config parameters for tools / labs proxies - https://phabricator.wikimedia.org/T143637#2591278 (10chasemp) Stauts from labs meeting: we are porting static to jessie to work out needed tuning
[15:38:58] <wikibugs>	 10Tool-Labs-tools-Pageviews, 10Analytics, 06Analytics-Kanban, 10Pageviews-API: siteviews data for 2016 August 27 appears to be empty - https://phabricator.wikimedia.org/T144159#2591279 (10Milimetric) p:05Triage>03Normal
[15:40:42] <wikibugs>	 10Tool-Labs-tools-Pageviews, 06Analytics-Kanban: siteviews data for 2016 August 27 appears to be empty - https://phabricator.wikimedia.org/T144159#2591287 (10Milimetric)
[15:46:53] <wikibugs>	 06Labs: Request increased quota for git labs project - https://phabricator.wikimedia.org/T143815#2579978 (10Andrew) +1 this is fine
[15:52:18] <wikibugs>	 10Tool-Labs-tools-Other, 10Deployment-Systems, 13Patch-For-Review: Jouncebot not joining #wikimedia-operations - https://phabricator.wikimedia.org/T144189#2591368 (10bd808) I patched in some additional logging, but am not seeing any clear reason why things aren't working yet: ``` 2016-08-29 15:47:58,387 - IN...
[15:57:40] <wikibugs>	 10Tool-Labs-tools-Other, 10Deployment-Systems, 13Patch-For-Review: Jouncebot not joining #wikimedia-operations - https://phabricator.wikimedia.org/T144189#2591407 (10bd808) ``` [09:56] jouncebot has userhost tools.joun@instance-tools-exec-1404.tools.wmflabs.org and real name “https://github.com/mattofak/joun...
[16:11:46] <wikibugs>	 06Labs, 10Phabricator, 07Puppet: Update phabricator puppet role to support use on labs - https://phabricator.wikimedia.org/T144112#2591508 (10mmodell) related {T131899}
[16:14:15] <wikibugs>	 10Tool-Labs-tools-Other, 10Deployment-Systems, 13Patch-For-Review: Jouncebot not joining #wikimedia-operations - https://phabricator.wikimedia.org/T144189#2591190 (10Platonides) Fixed by temporarily removing the inheritance from #wikimedia-bans Most probably, it was affected by the “ban everyone not register...
[16:16:46] <wikibugs>	 06Labs, 07Tracking: Existing Labs project quota increase requests (Tracking) - https://phabricator.wikimedia.org/T140904#2591567 (10chasemp)
[16:16:49] <wikibugs>	 06Labs: Request increased quota for git labs project - https://phabricator.wikimedia.org/T143815#2591564 (10chasemp) 05Open>03Resolved a:03chasemp Should be gtg
[16:24:12] <wikibugs>	 06Labs: Request increased quota for git labs project - https://phabricator.wikimedia.org/T143815#2591683 (10Paladox) @chasemp thanks and what do you mean by gtg?
[16:25:10] <wikibugs>	 10Tool-Labs-tools-Other, 10Deployment-Systems, 13Patch-For-Review: Jouncebot not joining #wikimedia-operations - https://phabricator.wikimedia.org/T144189#2591699 (10bd808) >>! In T144189#2591536, @Platonides wrote: > Fixed by temporarily removing the inheritance from #wikimedia-bans Most probably, it was af...
[16:29:02] <wikibugs>	 06Labs: Request increased quota for git labs project - https://phabricator.wikimedia.org/T143815#2591726 (10chasemp) good to go :)
[16:32:26] <wikibugs>	 06Labs: Request increased quota for git labs project - https://phabricator.wikimedia.org/T143815#2591769 (10Paladox) Ah thanks :)
[16:35:44] <yuvipanda>	 !log tools run chmod u+x /data/project/framabot
[16:35:48] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master
[16:41:03] <shinken-wm>	 RECOVERY - Puppet run on tools-static-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:46:26] <wikibugs>	 06Labs, 06Operations, 13Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#2591854 (10yuvipanda) I moved striker, and @bd808 reports it is all good. the instance with this issue in the shinken proje...
[16:52:05] <wikibugs>	 06Labs, 10Labs-project-Extdist: http://extdist.wmflabs.org/ 502's (Bad Gateway) - https://phabricator.wikimedia.org/T143209#2591874 (10chasemp) 05Open>03Resolved a:03chasemp this seems back now
[16:53:57] <wikibugs>	 06Labs, 10Tool-Labs: Maintainers are not shown in the Tools list - https://phabricator.wikimedia.org/T142684#2591891 (10chasemp) 05Open>03Resolved a:03chasemp Please look at https://toolsadmin.wikimedia.org/.  I think this is resolved there and should be considered canonical.
[16:55:37] <wikibugs>	 06Labs: Can not kill job on tools labs - https://phabricator.wikimedia.org/T138924#2414164 (10chasemp) @Magnus, is this still the case?  can you add some more details so we can look into it?
[17:03:17] <shinken-wm>	 PROBLEM - Host tools-secgroup-test-103 is DOWN: CRITICAL - Host Unreachable (10.68.21.22)
[17:07:03] <shinken-wm>	 PROBLEM - Puppet run on tools-static-02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[17:13:51] <wikibugs>	 06Labs, 10Tool-Labs: Setup monitoring for cdnjs git pull - https://phabricator.wikimedia.org/T144215#2591965 (10madhuvishy)
[17:22:28] <grrrit-wm>	 (03PS1) 10BryanDavis: Bump striker submodule [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/307334 
[17:22:47] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 032] Bump striker submodule [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/307334 (owner: 10BryanDavis)
[17:22:54] <grrrit-wm>	 (03Merged) 10jenkins-bot: Bump striker submodule [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/307334 (owner: 10BryanDavis)
[17:30:50] <shinken-wm>	 RECOVERY - Free space - all mounts on tools-static-01 is OK: OK: All targets OK
[17:32:23] <shinken-wm>	 RECOVERY - Free space - all mounts on tools-static-02 is OK: OK: All targets OK
[17:32:25] <shinken-wm>	 PROBLEM - Host tools-secgroup-test-102 is DOWN: CRITICAL - Host Unreachable (10.68.21.170)
[17:35:33] <shinken-wm>	 PROBLEM - Puppet run on tools-static-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[17:45:40] <shinken-wm>	 PROBLEM - Puppet run on tools-web-static-01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[18:01:51] <shinken-wm>	 PROBLEM - Free space - all mounts on tools-static-01 is CRITICAL: CRITICAL: tools.tools-static-01.diskspace._srv.byte_percentfree (<55.56%)
[18:06:23] <wikibugs>	 06Labs, 06Operations, 13Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#2592177 (10yuvipanda) servermon is done. I'll do the analytics ones next.
[18:10:34] <shinken-wm>	 RECOVERY - Puppet run on tools-static-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[18:12:04] <shinken-wm>	 RECOVERY - Puppet run on tools-static-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[18:13:22] <shinken-wm>	 PROBLEM - Free space - all mounts on tools-static-02 is CRITICAL: CRITICAL: tools.tools-static-02.diskspace._srv.byte_percentfree (<60.00%)
[18:17:33] <yuvipanda>	 !log analytics kill ldap entry for analytics303, doesn't exist
[18:17:36] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Analytics/SAL, Master
[18:19:00] <wikibugs>	 06Labs, 10Phabricator, 13Patch-For-Review, 07Puppet: Update phabricator puppet role to support use on labs - https://phabricator.wikimedia.org/T144112#2592218 (10Paladox)
[18:19:05] <wikibugs>	 06Labs, 10Phabricator, 07Puppet: Phabricator labs puppet role configures phabricator wrong - https://phabricator.wikimedia.org/T131899#2592220 (10Paladox)
[18:19:32] <wikibugs>	 06Labs, 10Phabricator, 13Patch-For-Review, 07Puppet: Update phabricator puppet role to support use on labs - https://phabricator.wikimedia.org/T144112#2588887 (10Paladox) We will fix production role first and after that remove the labs role once production role works in labs :)
[18:23:33] <wikibugs>	 06Labs, 06Operations, 13Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#2592244 (10yuvipanda) The analytics project doesn't actually seem to have any! These were just stale LDAP entries for insta...
[18:30:40] <shinken-wm>	 RECOVERY - Puppet run on tools-web-static-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[18:48:19] <yuvipanda>	 !log toolsbeta reboot toolsbeta-puppetmaster3, puppet run process became Zommmmbiiiieeee, ate all my brains
[18:48:23] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL, Master
[18:52:34] <grrrit-wm>	 (03CR) 1020after4: [C: 032] Add gerrit project to #wikimedia-releng channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/307071 (owner: 10Paladox)
[18:57:36] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add gerrit project to #wikimedia-releng channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/307071 (owner: 10Paladox)
[19:14:51] <wikibugs>	 06Labs, 10Phabricator, 07Puppet: Phabricator labs puppet role configures phabricator wrong - https://phabricator.wikimedia.org/T131899#2592521 (10Paladox) 05duplicate>03Open Re opening for now since the main role will take a while to fix.
[19:18:50] <yuvipanda>	 !log toolsbeta reboot toolsbeta-mail, seems, uh, stuck
[19:18:53] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL, Master
[19:20:24] <yuvipanda>	 !log toolsbeta reboot toolsbeta-master, seems, uh, stuck
[19:20:28] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL, Master
[19:32:41] <multichill>	 Platonides: Do you know what the current open source ipsec based cross platform vpn client of choice is?
[19:33:05] <Platonides>	 ipsec?
[19:33:26] <Platonides>	 my first thought was openvpn, but that's userland, not ipsec
[19:33:32] <Platonides>	 isn't that part of the kernel?
[19:34:21] <multichill>	 Microsoft has crap support for vpn's
[19:34:33] <Platonides>	 Linux Kernel ≥2.5.47 has a native IPsec stack
[19:34:53] <Platonides>	 well, it's Microsoft… ;)
[19:34:58] <multichill>	 Maybe Windows10 is better....
[19:36:24] <Platonides>	 Windows has "easy PPTP"
[19:36:41] <multichill>	 I've used Cisco client quite a lot (not free) or https://www.shrew.net/download/vpn (seemed to have stopped). 
[19:36:42] <Platonides>	 however, that doesn't mean it's secure… https://www.schneier.com/academic/pptp/
[19:36:54] <multichill>	 PPTP is not secure at all
[19:36:54] <Platonides>	 Cisco uses ipsec?
[19:37:12] <multichill>	 Yeah, Cisco ASA is all about ipsec
[19:38:28] <multichill>	 I'm not a big fan of the SSL based VPN because every company has a different implementation. 
[19:40:28] * bd808 liks ssh tunnels as vpn method, but is weird like that
[19:40:37] <valhallasw`cloud>	 multichill: I think the built-in client in win10 (and win7?) supports ipsec, but I'm not 100% sure
[19:41:02] <wikibugs>	 10PAWS, 10Discussion-modeling: Detox on Paws - https://phabricator.wikimedia.org/T144234#2592591 (10ellery)
[19:41:13] <multichill>	 xauth is missing in the win7 client afaik
[19:41:27] <Platonides>	 actually, with a good ssh client, their tunnels are great
[19:53:36] <bd808>	 !log tools.stashbot Updated to 04e6f98 (If authenticating, pause to let auth actually work) && added authentication as registered stashbot freenode account
[19:53:39] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL, Master
[19:56:00] <wikibugs>	 10PAWS, 10Discussion-modeling: Detox notebooks on PAWS - https://phabricator.wikimedia.org/T144234#2592641 (10DarTar)
[19:56:39] <wikibugs>	 10PAWS, 06Research-and-Data-Backlog, 07Epic, 03Research-and-Data-2017-Q1: Launch Open Notebooks Infrastructure - https://phabricator.wikimedia.org/T140430#2592644 (10DarTar)
[19:56:41] <wikibugs>	 10PAWS, 10Discussion-modeling: Detox notebooks on PAWS - https://phabricator.wikimedia.org/T144234#2592591 (10DarTar)
[19:59:17] <wikibugs>	 10PAWS, 06Research-and-Data-Backlog, 07Epic, 03Research-and-Data-2017-Q1: Launch Open Notebooks Infrastructure - https://phabricator.wikimedia.org/T140430#2592655 (10DarTar) @Capt_Swing if you have a few cycles to help with this, it'd be awesome. It would also be worth coordinating doc work with the (separ...
[20:09:34] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1403 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[20:11:00] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1201 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[20:11:45] <shinken-wm>	 PROBLEM - Free space - all mounts on tools-puppetmaster-01 is CRITICAL: CRITICAL: tools.tools-puppetmaster-01.diskspace._public_dumps.byte_percentfree (No valid datapoints found)tools.tools-puppetmaster-01.diskspace.root.byte_percentfree (<10.00%)
[20:12:25] <yuvipanda>	 waaat
[20:12:27] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1214 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[20:12:29] <shinken-wm>	 PROBLEM - Puppet run on tools-prometheus-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[20:12:31] <yuvipanda>	 also waaat, /public/dumps?!
[20:12:32] <yuvipanda>	 uh oh
[20:12:37] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1406 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[20:12:38] <shinken-wm>	 PROBLEM - Puppet run on tools-checker-02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[20:13:04] <shinken-wm>	 PROBLEM - Puppet run on tools-mail-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[20:16:13] <yuvipanda>	 looks like puppet did fail, but was transient
[20:16:59] <wikibugs>	 06Labs, 06Operations, 13Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#2592701 (10yuvipanda) did toolsbeta, or at least the toolsbeta instances that are still sshable. Many didn't come back up a...
[20:21:44] <shinken-wm>	 RECOVERY - Free space - all mounts on tools-puppetmaster-01 is OK: OK: tools.tools-puppetmaster-01.diskspace._public_dumps.byte_percentfree (No valid datapoints found)
[20:22:01] <yuvipanda>	 well done, shinken-wm
[20:22:06] <chasemp>	 ^ yuvipanda it probably makes sense not to check NFS on every run there yeah?
[20:22:18] <chasemp>	 it's never going to deviate from on the server
[20:22:26] <chasemp>	 but the check itself is load etc
[20:22:38] <chasemp>	 not too worrysome tho
[20:22:39] <yuvipanda>	 chasemp I have no idea why it's checking NFS tho
[20:22:42] <yuvipanda>	 it used to not
[20:22:47] <chasemp>	 ah even more interesting
[20:22:52] <yuvipanda>	 rather, there was no graphite metric for the NFS mounts at all
[20:23:01] <yuvipanda>	 so I was surprised when the alert popped up
[20:23:10] <chasemp>	 huh....
[20:23:18] <yuvipanda>	 the alert just has a * in there, so if there's nfs mount stats there, it'll alert
[20:23:33] <yuvipanda>	 but the diamond collector wasn't sending nfs mount stats when I first set it up, and I've never seen an alert like this before
[20:23:38] <chasemp>	 ah
[20:32:37] <shinken-wm>	 RECOVERY - Puppet run on tools-checker-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[20:47:29] <shinken-wm>	 RECOVERY - Puppet run on tools-prometheus-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[20:48:01] <shinken-wm>	 RECOVERY - Puppet run on tools-mail-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[20:49:33] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1403 is OK: OK: Less than 1.00% above the threshold [0.0]
[20:50:59] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1201 is OK: OK: Less than 1.00% above the threshold [0.0]
[20:52:25] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1214 is OK: OK: Less than 1.00% above the threshold [0.0]
[20:52:37] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1406 is OK: OK: Less than 1.00% above the threshold [0.0]
[20:55:12] <tom29739>	 bd808, SSH tunnels are good, but they're bad because you have to connect to localhost for the tunnel.
[20:55:52] <tom29739>	 So if something is configured to connect to a specific instance name.. e.g. a Javascript web interface
[20:56:06] <tom29739>	 Lots of head scratching. :(
[21:00:25] <Platonides>	 tom29739: you could use -D
[21:00:35] <Platonides>	 that removes flexibility
[21:00:42] <Platonides>	 but a vpn route does, too
[21:00:55] <tom29739>	 I use Windows with Putty,
[21:01:04] <tom29739>	 I don't think it has that option :/
[21:01:18] <tom29739>	 Nor does it have an easy ProxyCommand.
[21:01:25] * Platonides quotes himself: "with a good ssh client,…"
[21:02:04] <tom29739>	 SSH tunnels are sometimes called "a poor man's VPN"
[21:02:11] <wikibugs>	 10Tool-Labs-tools-Pageviews: Add "subpages" as a source to Massviews - https://phabricator.wikimedia.org/T144238#2592817 (10MusikAnimal)
[21:02:21] <tom29739>	 Why use an SSH tunnel if you could use a VPN?
[21:02:53] <Platonides>	 tunnel-stacking?
[21:03:14] <tom29739>	 It's slow on here.
[21:08:12] <tom29739>	 Platonides, Linux users have it easy :/
[21:10:52] * Platonides replaces tom29739's windows with Ubuntu
[21:11:10] <tom29739>	 Oi! Windows is good for some things!
[21:11:23] <tom29739>	 Like MS Office for instance.
[21:42:24] <halfak>	 o/ yuvipanda.  I'm having a PAWS issue.  My kernel keeps getting restarted.  Do you know why that's happening?
[21:46:44] <halfak>	 I guess I'll just dev locally for a while
[21:52:28] <yuvipanda>	 hey halfak
[21:52:32] <yuvipanda>	 taking a look
[21:53:08] <wikibugs>	 06Labs, 10Tool-Labs, 10Deployment-Systems: Add release engineering people to tools.jouncebot user group - https://phabricator.wikimedia.org/T144175#2592987 (10AlexMonk-WMF) 05Open>03Resolved a:03chasemp @chasemp, if you added through wikitech then it's possible you were trying to use the provided uids...
[21:53:24] <yuvipanda>	 halfak I started a new terminal, and it seems to work fine.
[21:53:29] <yuvipanda>	 which notebook was having issues for you?
[21:53:50] <halfak>	 yuvipanda, it happens in a loop at the end of this notebook:http://paws-public.wmflabs.org/paws-public/User:EpochFail/projects/vectors_demo/damage_detection_test.ipynb
[21:53:57] <halfak>	 I turned the loop into a batch just in case
[21:55:02] <halfak>	 Woops.  Looks like I made an edit that broke it.  Fixing now
[21:55:35] <halfak>	 fixed
[21:55:59] <halfak>	 The loop is running again.  Usually, I get through about 300 (out of ~20k) before it crashes
[21:56:04] <yuvipanda>	 halfak ok! :) I see it running, and I see it is ramping up in memory use
[21:56:05] <halfak>	 I'll get the exact error message this time
[21:56:18] <yuvipanda>	 but maybe not, I see it winding down as well
[21:56:18] <halfak>	 Ohhh.. That could totally be the issue. 
[21:56:20] <yuvipanda>	 let's see
[21:56:28] <halfak>	 What's my max memory?
[21:56:48] <yuvipanda>	 halfak 1G I think.
[21:56:56] <halfak>	 I see.  That could be the problem. 
[21:57:27] <halfak>	 I'll definitely need a lot of memory for this maneuver
[21:57:31] <yuvipanda>	 yeah. I should probably have a mechanism for user groups where people have higher limits.
[21:57:40] <yuvipanda>	 how much is 'lot of memory'?
[21:57:44] <halfak>	 I'm guessing ~6GB
[21:57:54] <halfak>	 Looks like I'm not doing too bad though. 
[21:58:00] <halfak>	 300-400 MB
[21:58:08] <yuvipanda>	 yeah
[21:58:09] <yuvipanda>	 so far so good
[21:58:18] <yuvipanda>	 has it crashed yet?
[21:58:22] <halfak>	 Nope
[21:58:24] <halfak>	 Maybe I hit a degenerate revision eventually
[21:58:30] <halfak>	 Up to 550 MB
[21:58:35] <halfak>	 back to 400MB
[21:59:14] <halfak>	 650MB
[21:59:19] <halfak>	 460MB
[21:59:21] <halfak>	 lol
[22:01:32] <yuvipanda>	 (I crashed my browser trying to get a dashboard)
[22:01:36] <yuvipanda>	 halfak has it crashed yet?
[22:01:52] <halfak>	 nope.  Might have made it longer than last time.
[22:02:19] <halfak>	 I just saved to give you more stderr dots :) 
[22:04:57] <halfak>	 Saved again
[22:05:03] <halfak>	 Still going.  Weird. 
[22:06:56] <grrrit-wm>	 (03Draft1) 10Paladox: Add GerritBot project to #wikimedia-releng channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/307439 
[22:09:32] <grrrit-wm>	 (03PS2) 10Paladox: Add GerritBot project to #wikimedia-releng channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/307439 
[22:10:56] <yuvipanda>	 the free internet I am leeching off decided to crap out just now. whee
[22:11:35] <halfak>	 Still not crashed yuvipanda 
[22:11:43] <halfak>	 -._o_.-
[22:12:12] <halfak>	 908MB
[22:12:16] <halfak>	 580MB
[22:12:22] <yuvipanda>	 right
[22:12:29] <yuvipanda>	 so if it crossed 1023 I think it'll crash
[22:12:55] <halfak>	 Strange.  there should be determinism in this execution.  
[22:13:04] <halfak>	 Oh well.  I have a script running locally now
[22:13:10] <halfak>	 So my work will be saved either way. 
[22:15:17] <yuvipanda>	 halfak I'm looking at fresh new https://grafana-labs-admin.wikimedia.org/dashboard/db/paws
[22:15:26] <yuvipanda>	 and I think you hit 1023 once
[22:15:33] <yuvipanda>	 (everything is duplicated, need to fix that)
[22:15:38] <yuvipanda>	 this is sampled only ever minute
[22:15:42] <yuvipanda>	 so won't catch everything
[22:15:55] <halfak>	 gotcha
[22:16:16] <halfak>	 Maybe I can make it
[22:16:21] <halfak>	 Quit looking at me watchdog!
[22:16:28] <yuvipanda>	 :D
[22:16:35] <halfak>	 I only want a lot of memory for a second
[22:16:39] <yuvipanda>	 I need to figure out how to properly make this apparent
[22:16:52] <yuvipanda>	 right. that works well for CPU, not so much for memory tho
[22:17:11] <yuvipanda>	 (with our current configuration, at least)
[22:17:23] <yuvipanda>	 you can spike CPU when others aren't using it, but not memory
[22:18:08] <yuvipanda>	 anyway, I gotta step out for a bit, I'll brb
[22:18:16] <halfak>	 Thanks for your help!
[22:20:17] <grrrit-wm>	 (03Draft1) 10Paladox: Add grrrit-wm project to #wikimedia-releng channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/307441 
[22:20:59] <grrrit-wm>	 (03PS2) 10Paladox: Add grrrit-wm project to #wikimedia-releng channel [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/307441 
[22:33:44] <wikibugs>	 06Labs: Clean up data in /data/scratch/mwoffliner - https://phabricator.wikimedia.org/T144025#2593137 (10madhuvishy) @Kelson Since it's this big and is cache data, we weren't planning to migrate this on to the new scratch setup in labstore1003 (This switch happens 8/31).
[22:36:12] <halfak>	 STILL GOING! 
[22:36:14] <halfak>	 MWAHAHAHA
[22:36:22] <halfak>	 It's going to crash right before it finishes. 
[22:36:27] <halfak>	 I know it
[22:49:35] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1217 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[22:49:43] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1018 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[22:49:47] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1012 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[22:50:07] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1216 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[22:50:49] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1208 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[22:51:05] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1409 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[22:51:15] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1016 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[22:51:41] <shinken-wm>	 PROBLEM - Puppet run on tools-bastion-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[22:51:41] <shinken-wm>	 PROBLEM - Puppet run on tools-web-static-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[22:52:03] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1011 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[22:52:07] <wikibugs>	 06Labs, 06Operations, 13Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#1846901 (10AlexMonk-WMF) So it seems the list is now:   - deployment-prep   - integration   - wikidata-query   - etcd  >>!...
[22:52:15] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1212 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[22:52:19] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1008 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[22:52:21] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1210 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[22:52:51] <shinken-wm>	 PROBLEM - Puppet run on tools-mail is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[22:52:53] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1205 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[22:53:17] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1202 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[22:53:21] <shinken-wm>	 PROBLEM - Puppet run on tools-grid-shadow is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[22:53:29] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1220 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[22:53:58] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1215 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[22:54:02] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1020 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[22:54:04] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1408 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[22:54:12] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-gift is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[22:54:12] <wikibugs>	 06Labs: Purge stale data from LDAP - https://phabricator.wikimedia.org/T138150#2391674 (10AlexMonk-WMF) I ran into this thing recently: ```dn: dc=basic.puppet.node,ou=hosts,dc=wikimedia,dc=org objectClass: domainrelatedobject objectClass: dnsdomain objectClass: domain objectClass: puppetclient objectClass: dcobj...
[22:54:26] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1209 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[22:54:28] <shinken-wm>	 PROBLEM - Puppet run on tools-bastion-03 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[22:54:31] <wikibugs>	 06Labs, 07LDAP: Document LDAP structure unambiguously - https://phabricator.wikimedia.org/T138151#2593231 (10AlexMonk-WMF)
[22:54:38] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1021 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[22:54:41] <wikibugs>	 06Labs, 07LDAP: Purge stale data from LDAP - https://phabricator.wikimedia.org/T138150#2593232 (10AlexMonk-WMF)
[22:54:48] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1010 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[22:54:50] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1201 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[22:54:54] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1006 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[22:54:58] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-generic-1402 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[22:55:12] <shinken-wm>	 PROBLEM - Puppet run on tools-grid-master is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[22:55:20] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1410 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[22:55:24] <shinken-wm>	 PROBLEM - Puppet run on tools-cron-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[22:55:38] <shinken-wm>	 PROBLEM - Puppet run on tools-web-static-02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[22:55:50] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1405 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[22:55:52] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1205 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[22:56:00] <shinken-wm>	 PROBLEM - Puppet run on tools-checker-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[22:56:38] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1218 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[22:56:41] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1209 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[22:56:59] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1401 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[22:57:03] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-generic-1401 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[22:57:17] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1207 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[22:57:20] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[22:58:04] <shinken-wm>	 PROBLEM - Puppet run on tools-static-02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[22:58:13] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1203 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[22:58:55] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1005 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[22:58:59] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1219 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[22:59:11] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1204 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[22:59:15] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[22:59:23] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1022 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[22:59:27] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1404 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[23:00:03] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1402 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[23:00:15] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1019 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[23:00:17] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1203 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[23:00:31] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1407 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[23:00:39] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-generic-1403 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[23:00:45] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1408 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[23:00:50] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1404 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[23:01:13] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1015 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[23:01:19] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1406 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[23:01:39] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1413 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[23:02:12] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1414 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[23:03:36] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1003 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[23:04:08] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1025 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[23:04:11] <paladox>	 There seems to be puppet failures on tools ^^
[23:04:16] <shinken-wm>	 PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:04:26] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1014 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[23:05:06] <jem>	 ssh to tools isn't working, I guess it's related
[23:05:25] <paladox>	 chasemp andrewbogott Krenair ^^
[23:05:27] <jem>	 Web also fails
[23:05:34] <paladox>	 Tools website is also down
[23:05:37] <paladox>	 i carnt access it
[23:05:53] <chasemp>	 yuvipanda: about?
[23:06:34] <shinken-wm>	 PROBLEM - Puppet run on tools-static-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[23:06:37] <paladox>	 chasemp i think he went out for a bit
[23:06:49] <paladox>	 <yuvipanda> anyway, I gotta step out for a bit, I'll brb
[23:09:26] <chasemp>	 !log restart nginx on tools-proxy-01
[23:10:18] <shinken-wm>	 PROBLEM - Puppet run on tools-k8s-master-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[23:10:32] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1007 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[23:10:38] <yuvipanda>	 hey chaemp
[23:10:38] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1013 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[23:10:59] <Krenair>	 looking
[23:11:15] <chasemp>	 yuvipanda: I'm pretty confused atm on what's up
[23:11:37] <jem>	 It seems a total failure
[23:12:32] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1023 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[23:13:30] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1002 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[23:13:36] <shinken-wm>	 PROBLEM - Puppet run on tools-bastion-05 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[23:13:46] <yuvipanda>	 !log tools bumped up worker_connections, restarted nginx
[23:14:02] <shinken-wm>	 PROBLEM - Puppet run on tools-mail-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[23:14:33] <yuvipanda>	 chasemp hmm, I'm confused as well
[23:15:06] <chasemp>	 it seems like tools only
[23:15:13] <yuvipanda>	 everything is responding with 499
[23:15:17] <yuvipanda>	 which is a special nginx error code
[23:15:21] <chasemp>	 what's everythign?
[23:15:21] <chasemp>	 ok
[23:15:34] <Krenair>	 ugh, need to fix my permissions first I think
[23:15:36] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1009 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[23:15:41] <yuvipanda>	 > HTTP 499 in Nginx means that the client closed the connection before the server answered the request. In my experience is usually caused by client side timeout. As I know it's an Nginx specific error code.Oct 19, 2012
[23:15:42] <shinken-wm>	 PROBLEM - Puppet run on tools-k8s-master-01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[23:15:43] <Krenair>	 tools-proxy-01 is a Tool labs generic web proxy (role::labs::tools::proxy)
[23:15:45] <yuvipanda>	 chasemp access.log
[23:15:46] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[23:15:47] <Krenair>	 The last Puppet run was at Mon Aug 29 22:53:41 UTC 2016 (21 minutes ago). Puppet is disabled. reason not specified
[23:15:52] <yuvipanda>	  /varlog/nginx/access.log
[23:15:55] <yuvipanda>	 krenair that's me
[23:16:10] <yuvipanda>	 not sure why everything is 499 :|
[23:16:44] <yuvipanda>	 since error.log is fine
[23:17:16] <Krenair>	 shall we move to the other proxy server?
[23:17:30] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1017 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[23:17:36] <bd808>	 499... would be an upstream proxy problem maybe?
[23:17:36] <yuvipanda>	 yeah, failing over now
[23:17:56] <chasemp>	 I'm having trouble getting on tools-bastion-03 as well
[23:17:57] <bd808>	 that's the status that means that the client closed the socket prematurely
[23:17:57] <chasemp>	 but unsure if related
[23:18:27] <bd808>	 my opne shell on bastion-02 is dead
[23:18:27] <Krenair>	 Same
[23:18:31] <Krenair>	 `curl -vvv -H "Host: tools.wmflabs.org" http://instance-tools-proxy-02.tools.wmflabs.org.` works. it fails if I change it to proxy-01
[23:18:52] <yuvipanda>	 maybe it's a network issue?
[23:19:06] <yuvipanda>	 that, uh, hmm
[23:19:09] <yuvipanda>	 then how can I ssh?
[23:19:29] <chasemp>	 it's something local to tools it seems like
[23:19:31] <Krenair>	 ssh tools-proxy-02.tools.eqiad.wmflabs ?
[23:19:34] <bd808>	 I just got ssh to tools-elastic-01.tools to work, so not universal if its networking
[23:19:47] <yuvipanda>	 wtf
[23:19:58] <Krenair>	 assuming you have *.wmflabs going via the normal labs bastions, not login.tools.wmflabs.org
[23:20:00] <yuvipanda>	 curl localhost on tools-proxy-02 is failing too
[23:20:02] <yuvipanda>	 or is just 'hung'
[23:21:00] <Krenair>	 it might not be bound to localhost?
[23:21:00] <Krenair>	 curl tools-proxy-02 on tools-proxy-02
[23:21:07] <chasemp>	 it has to be an nfs thing? or a storage thing
[23:21:12] <chasemp>	 load on tools-bastion-03 is like 90
[23:21:16] <Krenair>	 hmm
[23:21:20] <bd808>	 ssh to tools-exec-1410.tools is acting like the packets are blackholed somewhere
[23:21:23] <chasemp>	 madhuvishy: are you about? doing anything for nfs?
[23:21:24] <Krenair>	 no that doesn't change anything
[23:21:36] <yuvipanda>	 curl on localhost to tools-proxy-01 and -02 produces 499 too
[23:21:45] <yuvipanda>	 which also feels like something network related, in one way
[23:21:56] <yuvipanda>	 not doing failover, since that'll have no effect I think
[23:22:19] <yuvipanda>	 packets are coming in, and then somehow it's 499, so the connection isn't being established
[23:22:23] <chasemp>	 I'm going to reboot tools-bastion-03 and I want to see effect on load and availability
[23:23:04] <madhuvishy>	 chasemp: yeah i'm around - but no not doing anything on nfs
[23:24:06] <yuvipanda>	 tools-static is dead too
[23:24:50] <yuvipanda>	 ok soo...
[23:24:55] <yuvipanda>	 tools-checker is alive-ish
[23:25:03] <yuvipanda>	 in that I can hit it, but it's 502
[23:25:31] <yuvipanda>	 but I cna't actually ssh into it
[23:27:00] <yuvipanda>	 nope, ssh'd in. just very, very slow
[23:27:00] <chasemp>	 root@tools-exec-1410:~# uptime
[23:27:01] <chasemp>	  23:25:57 up 159 days,  1:16,  1 user,  load average: 49.90, 58.44, 42.78
[23:27:01] <yuvipanda>	 ssh taking forever on tools-checker
[23:27:01] <yuvipanda>	 let me check if things without NFS have
[23:27:42] <yuvipanda>	 ok, tools-puppetmaster-01 ssh happened fine
[23:28:13] <chasemp>	 I noticed outbound for NFS on labstoer1001 was doing almost nothing
[23:28:19] <chasemp>	 so I restarted nfs-kernel-server
[23:28:28] <chasemp>	 along w/ load super high on spot checked execs and the bastion
[23:28:30] <abian>	 Traceback (most recent call last):
[23:28:30] <abian>	   File "/usr/local/bin/log-command-invocation", line 25, in <module>
[23:28:31] <abian>	     import argparse
[23:28:31] <abian>	   File "/usr/lib/python2.7/argparse.py", line 92, in <module>
[23:28:31] <abian>	     from gettext import gettext as _
[23:28:33] <chasemp>	 and I restarted on labstore1003 for fun
[23:28:33] <abian>	   File "/usr/lib/python2.7/gettext.py", line 49, in <module>
[23:28:35] <abian>	     import locale, copy, os, re, struct, sys
[23:28:38] <abian>	   File "/usr/lib/python2.7/locale.py", line 1471, in <module>
[23:28:40] <abian>	     'ru_ua.koi8u':                          'ru_UA.KOI8-U',
[23:28:43] <abian>	 MemoryError
[23:28:45] <abian>	 I had this output error
[23:29:02] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-generic-1404 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0]
[23:29:06] <shinken-wm>	 RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.027 second response time
[23:29:14] <chasemp>	 I really think that did it
[23:29:41] <chasemp>	 yeah
[23:29:44] <chasemp>	 weird part is
[23:29:50] <chasemp>	 it wasn't in any errant state on either side
[23:29:53] <chasemp>	 that I can tell
[23:30:05] <Krenair>	 Is pinging to labstore.svc supposed to show 'Packet Filtered'?
[23:30:11] <yuvipanda>	 yup
[23:30:12] <chasemp>	 yeah probably
[23:30:16] <yuvipanda>	 proxy works too
[23:30:18] <yuvipanda>	 wtf
[23:30:23] <chasemp>	 tcping labstore.svc 2049
[23:30:35] <yuvipanda>	 proxy doesn't even have nfs
[23:30:55] <chasemp>	 no but it tries to hit enough boxes that are hung / frozen that do
[23:31:11] <chasemp>	 idk yet
[23:31:22] <chasemp>	 we can't get off of this nfs setup fast enough
[23:31:33] <yuvipanda>	 yeah but those usually show up as 5xxx errors in error.log
[23:31:37] <yuvipanda>	 than 499s
[23:31:42] <yuvipanda>	 yeah
[23:31:50] <yuvipanda>	 most bizzare outage
[23:31:54] <chasemp>	 I have no explanation for that particular bit yeah
[23:31:59] <chasemp>	 I'm still puzzling on it
[23:32:15] <yuvipanda>	 network saturation?
[23:32:15] <shinken-wm>	 PROBLEM - Puppet run on tools-precise-dev is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[23:32:31] <yuvipanda>	 also now that fire is out i am going to try to get off the sidewalk
[23:32:31] <chasemp>	 the few things I know point to NFS
[23:32:35] <chasemp>	 ok
[23:32:41] <yuvipanda>	 i'll brb
[23:36:41] <shinken-wm>	 RECOVERY - Puppet run on tools-bastion-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:37:15] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1212 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:37:51] <shinken-wm>	 RECOVERY - Puppet run on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0]
[23:37:53] <shinken-wm>	 PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[23:38:03] <Krenair>	 !log tools added myself to the tools.admin service group earlier to try to figure out what was causing the outage, removed again now
[23:38:07] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master
[23:38:23] <shinken-wm>	 RECOVERY - Puppet run on tools-grid-shadow is OK: OK: Less than 1.00% above the threshold [0.0]
[23:38:58] <Krenair>	 so tools/nfs was out for ~15 minutes
[23:39:13] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0]
[23:39:17] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1004 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:39:23] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1022 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:39:24] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1209 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:39:25] <shinken-wm>	 RECOVERY - Puppet run on tools-bastion-03 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:39:36] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1217 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:39:38] <jem>	 Problem here
[23:39:44] <jem>	 My webservice isn't restarting
[23:39:51] <jem>	 Nor starting
[23:40:10] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1216 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:40:16] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1019 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:40:18] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1203 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:40:32] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1407 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:41:19] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1408 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:41:19] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1405 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:41:19] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1205 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:41:20] <shinken-wm>	 RECOVERY - Puppet run on tools-checker-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:41:20] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1409 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:41:20] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1015 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:42:44] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1218 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:42:45] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1209 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:42:45] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1401 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:42:46] <abian>	 Tool Labs is down again
[23:42:46] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:42:46] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1414 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:42:46] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1207 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:42:46] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:42:47] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1210 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:42:50] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1205 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:42:59] <Krenair>	 abian, what problem are you seeing?
[23:43:13] <jem>	 Now it works
[23:43:13] <abian>	 Now up xD
[23:43:16] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1203 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:43:18] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1202 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:43:20] <jem>	 :)
[23:43:31] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1220 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:43:37] <shinken-wm>	 RECOVERY - Puppet run on tools-bastion-05 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:43:39] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1003 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:43:41] <abian>	 I check out <https://tools.wmflabs.org/status/>, which is a simple HTML page
[23:43:57] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1219 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:43:57] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1215 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:44:07] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1408 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:44:07] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1025 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:44:11] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1204 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:44:27] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1404 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:44:27] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1014 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:44:49] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1201 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:45:01] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-generic-1402 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:45:03] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1402 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:45:11] <shinken-wm>	 RECOVERY - Puppet run on tools-grid-master is OK: OK: Less than 1.00% above the threshold [0.0]
[23:45:21] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1410 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:45:25] <shinken-wm>	 RECOVERY - Puppet run on tools-cron-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:45:25] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1007 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:45:37] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1013 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:45:39] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-generic-1403 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:45:51] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1404 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:45:51] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1208 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:46:17] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1406 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:46:37] <shinken-wm>	 RECOVERY - Puppet run on tools-static-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:46:39] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1413 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:46:59] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1409 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[23:47:16] <shinken-wm>	 RECOVERY - Puppet run on tools-precise-dev is OK: OK: Less than 1.00% above the threshold [0.0]
[23:47:34] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1023 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:47:36] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1412 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[23:47:42] <shinken-wm>	 PROBLEM - Puppet run on tools-bastion-02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[23:48:14] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1212 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[23:48:22] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1210 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[23:48:30] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1002 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:48:50] <shinken-wm>	 PROBLEM - Puppet run on tools-mail is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[23:48:52] <shinken-wm>	 PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 71.43% of data above the critical threshold [0.0]
[23:48:52] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1205 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[23:49:04] <shinken-wm>	 RECOVERY - Puppet run on tools-mail-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:50:06] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1408 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[23:50:12] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-gift is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[23:50:14] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1410 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[23:50:16] <shinken-wm>	 RECOVERY - Puppet run on tools-k8s-master-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:50:24] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1209 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[23:50:36] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1217 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[23:50:44] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1001 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:50:56] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-generic-1402 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[23:51:06] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1216 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[23:51:13] <shinken-wm>	 PROBLEM - Puppet run on tools-grid-master is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[23:51:21] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1410 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[23:51:23] <shinken-wm>	 PROBLEM - Puppet run on tools-cron-01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[23:51:51] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1208 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[23:52:01] <shinken-wm>	 PROBLEM - Puppet run on tools-checker-01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[23:52:03] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1409 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[23:52:41] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1209 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[23:54:49] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1202 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[23:54:50] <shinken-wm>	 PROBLEM - Puppet run on tools-grid-shadow is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[23:54:50] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1220 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[23:55:47] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1201 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[23:57:15] <shinken-wm>	 PROBLEM - SSH on tools-exec-1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:57:59] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1401 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[23:58:18] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1207 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]