[01:10:04] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1421 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[01:45:04] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1421 is OK: OK: Less than 1.00% above the threshold [0.0]
[06:27:51] <wikibugs_>	 06Labs, 10Tool-Labs: Python environment weirdness on labs - https://phabricator.wikimedia.org/T161915#3147928 (10mahmoud) Ah, yes, I expected that the issue was related to the Ubuntu upgrade, I just didn't expect it to manifest as `datetime` being unimportable. I seem to recall the email saying a service resta...
[10:19:42] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-gift-trusty-01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[10:54:42] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-gift-trusty-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[13:07:56] <shinken-wm>	 PROBLEM - SSH on tools-worker-1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:08:44] <shinken-wm>	 PROBLEM - SSH on tools-prometheus-02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:08:48] <shinken-wm>	 PROBLEM - SSH on tools-grid-master is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:09:12] <shinken-wm>	 PROBLEM - High iowait on tools-grid-master is CRITICAL: CRITICAL: tools.tools-grid-master.cpu.total.iowait (>14.29%)
[13:09:25] <shinken-wm>	 PROBLEM - Host tools-worker-1009 is DOWN: PING CRITICAL - Packet loss = 100%
[13:09:42] <shinken-wm>	 PROBLEM - SSH on tools-exec-1412 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:09:50] <shinken-wm>	 PROBLEM - SSH on tools-exec-1417 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:10:08] <shinken-wm>	 PROBLEM - SSH on tools-worker-1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:12:48] <shinken-wm>	 RECOVERY - SSH on tools-worker-1022 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0)
[13:13:36] <shinken-wm>	 RECOVERY - SSH on tools-prometheus-02 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0)
[13:13:38] <shinken-wm>	 RECOVERY - SSH on tools-grid-master is OK: SSH OK - OpenSSH_6.9p1 Ubuntu-2~trusty1 (protocol 2.0)
[13:14:32] <shinken-wm>	 RECOVERY - SSH on tools-exec-1412 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0)
[13:14:42] <shinken-wm>	 RECOVERY - SSH on tools-exec-1417 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0)
[13:14:56] <shinken-wm>	 RECOVERY - SSH on tools-worker-1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0)
[13:21:20] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1001 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [0.0]
[13:24:12] <shinken-wm>	 RECOVERY - High iowait on tools-grid-master is OK: OK: All targets OK
[13:31:41] <chasemp>	 !log tools reboot tools-exec-1420
[13:31:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[13:40:48] <chasemp>	 !log tools restart nscd and nscld on tools-grid-master
[13:40:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[13:51:17] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1001 is OK: OK: Less than 1.00% above the threshold [0.0]
[13:52:44] <chasemp>	 !log tools tools-grid-master tc-setup clean
[13:52:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[14:00:23] <wikibugs_>	 06Labs, 10Tool-Labs: Tool Labs grid foobar - https://phabricator.wikimedia.org/T161950#3148193 (10Magnus)
[14:00:56] <chasemp>	 !log tools disable puppet on tools-grid-msater
[14:01:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[14:26:21] <chasemp>	 !log tools up nfs thresholds https://gerrit.wikimedia.org/r/#/c/345975/
[14:26:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[14:36:01] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1421 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[14:36:07] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1407 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0]
[14:36:19] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1412 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[14:37:37] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1406 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[14:37:43] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1405 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[14:38:00] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1403 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[14:38:09] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1413 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[14:38:13] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1420 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[14:38:40] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1408 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[14:38:50] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1402 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[14:39:02] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1415 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[14:39:28] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1409 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[14:39:29] <chasemp>	 ^ andrewbogott real?
[14:39:40] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1416 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[14:39:43] <andrewbogott>	 I'm looking, I'm not sure what it's about
[14:39:44] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1404 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[14:40:08] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1401 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0]
[14:40:20] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1410 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[14:40:27] <chasemp>	 andrewbogott: I aborted a first try run at puppet to reduce fanout maybe the canceled runs are reporting failure
[14:40:33] <chasemp>	 but spotchecking
[14:40:43] <andrewbogott>	 that's probably it, things look fine to me
[14:44:55] <wikibugs>	 06Labs, 10Tool-Labs: Tool Labs grid foobar - https://phabricator.wikimedia.org/T161950#3148208 (10chasemp) p:05Triage>03High thanks @magnus, @andrew and I are actively fighting something overwhelming the grid.  We are still not sure what the deal is but had to restart the master process.  https://wikitech....
[14:45:07] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1401 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:45:21] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1410 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:46:05] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1421 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:46:07] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1407 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:46:20] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1412 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:47:37] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1406 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:47:43] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1405 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:47:59] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1403 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:48:10] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1413 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:48:14] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1420 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:48:38] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1408 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:49:02] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1415 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:49:28] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1409 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:49:40] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1416 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:49:42] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1404 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:50:36] <madhuvishy>	 Cyberpower678: it looks like you are running over 60 jobs across tools exec nodes for iabot - the grid is overwhelmed right now, can you scale that back to a smaller number?
[14:52:57] <chasemp>	 for context that's 61 jobs for iabot on execs and 183 for /everything/ else
[14:54:07] <chasemp>	 Cyberpower678: if we can't get in touch we have to reduce load on the grid fyi
[15:09:45] <wikibugs_>	 06Labs, 10Tool-Labs, 10InternetArchiveBot: tools.iabot is overloading the grid by running too many workers in parallel - https://phabricator.wikimedia.org/T161951#3148214 (10madhuvishy)
[15:13:49] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1402 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:27:08] <wikibugs_>	 06Labs, 10Tool-Labs, 10InternetArchiveBot: tools.iabot is overloading the grid by running too many workers in parallel - https://phabricator.wikimedia.org/T161951#3148257 (10chasemp) I think the best approach is some kind of locking mechanism to prevent new workers from starting if an existing one of the sam...
[15:27:27] <chasemp>	 madhuvishy: by the time I killeed it there were 63 running so it was escalating quickly
[15:28:39] <andrewbogott>	 !log tools added five new exec nodes, tools-exec-1425 through 1429
[15:28:40] <madhuvishy>	 chasemp: right
[15:28:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[15:29:06] <yeryry>	 Wouldn't a better way be to not start new ones at all, and just have enough running continuously?
[15:29:50] <chasemp>	 yeryry: that's the rub, they are not running continuously, they are written as runnable crons every minute
[15:30:15] <chasemp>	 as soon as a few start to drag out it escalates
[15:30:21] <chasemp>	 ok I have to hop off madhuvishy but I have my stuff
[15:30:24] <wikibugs_>	 06Labs, 10Tool-Labs, 10InternetArchiveBot: tools.iabot is overloading the grid by running too many workers in parallel - https://phabricator.wikimedia.org/T161951#3148262 (10Cyberpower678) I should note the once flag is set so it shouldn't be submitting more if the worker is already running.  This sounds lik...
[15:30:29] <yeryry>	 Yeah, that's what I meant. Do away with that setup.
[15:31:02] <madhuvishy>	 chasemp: alright! cya
[16:18:44] <wikibugs>	 06Labs, 10Tool-Labs, 10InternetArchiveBot: tools.iabot is overloading the grid by running too many workers in parallel - https://phabricator.wikimedia.org/T161951#3148343 (10madhuvishy) Hmmm it does look like jsub -once should not start more workers with same name, there may be some funkiness going on there,...
[16:28:31] <kamsuri>	 kamsuri
[17:32:26] <wikibugs_>	 (03CR) 10Jean-Frédéric: Extract method normalize_identifier and add unit tests (032 comments) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/338008 (owner: 10Jean-Frédéric)
[17:39:23] <wikibugs>	 (03CR) 10Jean-Frédéric: "> maybe also add a test for stripping underscore" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/338008 (owner: 10Jean-Frédéric)
[17:39:59] <wikibugs_>	 (03PS3) 10Jean-Frédéric: Add unittest to populate_image_table.processSource [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/338007
[17:40:01] <wikibugs>	 (03PS3) 10Jean-Frédéric: Extract method normalize_identifier and add unit tests [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/338008
[17:40:03] <wikibugs_>	 (03PS3) 10Jean-Frédéric: Track number of tracked images (on top of found images) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/338009
[17:51:52] <shinken-wm>	 PROBLEM - High iowait on tools-webgrid-lighttpd-1415 is CRITICAL: CRITICAL: tools.tools-webgrid-lighttpd-1415.cpu.total.iowait (>20.00%)
[18:01:51] <shinken-wm>	 RECOVERY - High iowait on tools-webgrid-lighttpd-1415 is OK: OK: All targets OK
[18:03:30] <wikibugs>	 (03PS4) 10Jean-Frédéric: Extract method normalize_identifier and add unit tests [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/338008
[18:03:32] <wikibugs_>	 (03PS4) 10Jean-Frédéric: Track number of tracked images (on top of found images) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/338009
[18:04:39] <wikibugs_>	 (03CR) 10Jean-Frédéric: "> Would it be possible to add a test where tracked!=found?" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/338009 (owner: 10Jean-Frédéric)
[18:06:30] <wikibugs>	 (03CR) 10Jean-Frédéric: "Added a test for what happened when an Exception is raised. That caught a pretty awful bug!" (031 comment) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/338008 (owner: 10Jean-Frédéric)
[19:09:54] <wikibugs_>	 06Labs, 10Labs-Vagrant, 10MediaWiki-Vagrant: labs-vagrant vagrant up fails with 404 at Dropbox - https://phabricator.wikimedia.org/T161891#3148513 (10Nemo_bis) Is there some workaround in the meanwhile?
[20:40:20] <Cyberpower678>	 chasemp: ping
[20:40:26] <Cyberpower678>	 madhuvishy: ping
[21:07:50] <wikibugs_>	 06Labs, 10Labs-Vagrant, 10MediaWiki-Vagrant: labs-vagrant vagrant up fails with 404 at Dropbox - https://phabricator.wikimedia.org/T161891#3148595 (10Reedy) >>! In T161891#3148513, @Nemo_bis wrote: > Is there some workaround in the meanwhile?  Should be able to take a copy of the box image from someone else...
[21:08:47] <Nemo_bis>	 Nikerabbit: do you have a copy from your course machines?