[01:10:04] PROBLEM - Puppet run on tools-exec-1421 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [01:45:04] RECOVERY - Puppet run on tools-exec-1421 is OK: OK: Less than 1.00% above the threshold [0.0] [06:27:51] 06Labs, 10Tool-Labs: Python environment weirdness on labs - https://phabricator.wikimedia.org/T161915#3147928 (10mahmoud) Ah, yes, I expected that the issue was related to the Ubuntu upgrade, I just didn't expect it to manifest as `datetime` being unimportable. I seem to recall the email saying a service resta... [10:19:42] PROBLEM - Puppet run on tools-exec-gift-trusty-01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [10:54:42] RECOVERY - Puppet run on tools-exec-gift-trusty-01 is OK: OK: Less than 1.00% above the threshold [0.0] [13:07:56] PROBLEM - SSH on tools-worker-1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:08:44] PROBLEM - SSH on tools-prometheus-02 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:08:48] PROBLEM - SSH on tools-grid-master is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:09:12] PROBLEM - High iowait on tools-grid-master is CRITICAL: CRITICAL: tools.tools-grid-master.cpu.total.iowait (>14.29%) [13:09:25] PROBLEM - Host tools-worker-1009 is DOWN: PING CRITICAL - Packet loss = 100% [13:09:42] PROBLEM - SSH on tools-exec-1412 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:09:50] PROBLEM - SSH on tools-exec-1417 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:10:08] PROBLEM - SSH on tools-worker-1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:12:48] RECOVERY - SSH on tools-worker-1022 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [13:13:36] RECOVERY - SSH on tools-prometheus-02 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [13:13:38] RECOVERY - SSH on tools-grid-master is OK: SSH OK - OpenSSH_6.9p1 Ubuntu-2~trusty1 (protocol 2.0) [13:14:32] RECOVERY - SSH on tools-exec-1412 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [13:14:42] RECOVERY - SSH on tools-exec-1417 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [13:14:56] RECOVERY - SSH on tools-worker-1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [13:21:20] PROBLEM - Puppet run on tools-worker-1001 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [0.0] [13:24:12] RECOVERY - High iowait on tools-grid-master is OK: OK: All targets OK [13:31:41] !log tools reboot tools-exec-1420 [13:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:40:48] !log tools restart nscd and nscld on tools-grid-master [13:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:51:17] RECOVERY - Puppet run on tools-worker-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [13:52:44] !log tools tools-grid-master tc-setup clean [13:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:00:23] 06Labs, 10Tool-Labs: Tool Labs grid foobar - https://phabricator.wikimedia.org/T161950#3148193 (10Magnus) [14:00:56] !log tools disable puppet on tools-grid-msater [14:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:26:21] !log tools up nfs thresholds https://gerrit.wikimedia.org/r/#/c/345975/ [14:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:36:01] PROBLEM - Puppet run on tools-exec-1421 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [14:36:07] PROBLEM - Puppet run on tools-exec-1407 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [14:36:19] PROBLEM - Puppet run on tools-exec-1412 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [14:37:37] PROBLEM - Puppet run on tools-exec-1406 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [14:37:43] PROBLEM - Puppet run on tools-exec-1405 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [14:38:00] PROBLEM - Puppet run on tools-exec-1403 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [14:38:09] PROBLEM - Puppet run on tools-exec-1413 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [14:38:13] PROBLEM - Puppet run on tools-exec-1420 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [14:38:40] PROBLEM - Puppet run on tools-exec-1408 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [14:38:50] PROBLEM - Puppet run on tools-exec-1402 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [14:39:02] PROBLEM - Puppet run on tools-exec-1415 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [14:39:28] PROBLEM - Puppet run on tools-exec-1409 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [14:39:29] ^ andrewbogott real? [14:39:40] PROBLEM - Puppet run on tools-exec-1416 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [14:39:43] I'm looking, I'm not sure what it's about [14:39:44] PROBLEM - Puppet run on tools-exec-1404 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [14:40:08] PROBLEM - Puppet run on tools-exec-1401 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [14:40:20] PROBLEM - Puppet run on tools-exec-1410 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [14:40:27] andrewbogott: I aborted a first try run at puppet to reduce fanout maybe the canceled runs are reporting failure [14:40:33] but spotchecking [14:40:43] that's probably it, things look fine to me [14:44:55] 06Labs, 10Tool-Labs: Tool Labs grid foobar - https://phabricator.wikimedia.org/T161950#3148208 (10chasemp) p:05Triage>03High thanks @magnus, @andrew and I are actively fighting something overwhelming the grid. We are still not sure what the deal is but had to restart the master process. https://wikitech.... [14:45:07] RECOVERY - Puppet run on tools-exec-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [14:45:21] RECOVERY - Puppet run on tools-exec-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [14:46:05] RECOVERY - Puppet run on tools-exec-1421 is OK: OK: Less than 1.00% above the threshold [0.0] [14:46:07] RECOVERY - Puppet run on tools-exec-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [14:46:20] RECOVERY - Puppet run on tools-exec-1412 is OK: OK: Less than 1.00% above the threshold [0.0] [14:47:37] RECOVERY - Puppet run on tools-exec-1406 is OK: OK: Less than 1.00% above the threshold [0.0] [14:47:43] RECOVERY - Puppet run on tools-exec-1405 is OK: OK: Less than 1.00% above the threshold [0.0] [14:47:59] RECOVERY - Puppet run on tools-exec-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [14:48:10] RECOVERY - Puppet run on tools-exec-1413 is OK: OK: Less than 1.00% above the threshold [0.0] [14:48:14] RECOVERY - Puppet run on tools-exec-1420 is OK: OK: Less than 1.00% above the threshold [0.0] [14:48:38] RECOVERY - Puppet run on tools-exec-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [14:49:02] RECOVERY - Puppet run on tools-exec-1415 is OK: OK: Less than 1.00% above the threshold [0.0] [14:49:28] RECOVERY - Puppet run on tools-exec-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [14:49:40] RECOVERY - Puppet run on tools-exec-1416 is OK: OK: Less than 1.00% above the threshold [0.0] [14:49:42] RECOVERY - Puppet run on tools-exec-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [14:50:36] Cyberpower678: it looks like you are running over 60 jobs across tools exec nodes for iabot - the grid is overwhelmed right now, can you scale that back to a smaller number? [14:52:57] for context that's 61 jobs for iabot on execs and 183 for /everything/ else [14:54:07] Cyberpower678: if we can't get in touch we have to reduce load on the grid fyi [15:09:45] 06Labs, 10Tool-Labs, 10InternetArchiveBot: tools.iabot is overloading the grid by running too many workers in parallel - https://phabricator.wikimedia.org/T161951#3148214 (10madhuvishy) [15:13:49] RECOVERY - Puppet run on tools-exec-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [15:27:08] 06Labs, 10Tool-Labs, 10InternetArchiveBot: tools.iabot is overloading the grid by running too many workers in parallel - https://phabricator.wikimedia.org/T161951#3148257 (10chasemp) I think the best approach is some kind of locking mechanism to prevent new workers from starting if an existing one of the sam... [15:27:27] madhuvishy: by the time I killeed it there were 63 running so it was escalating quickly [15:28:39] !log tools added five new exec nodes, tools-exec-1425 through 1429 [15:28:40] chasemp: right [15:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:29:06] Wouldn't a better way be to not start new ones at all, and just have enough running continuously? [15:29:50] yeryry: that's the rub, they are not running continuously, they are written as runnable crons every minute [15:30:15] as soon as a few start to drag out it escalates [15:30:21] ok I have to hop off madhuvishy but I have my stuff [15:30:24] 06Labs, 10Tool-Labs, 10InternetArchiveBot: tools.iabot is overloading the grid by running too many workers in parallel - https://phabricator.wikimedia.org/T161951#3148262 (10Cyberpower678) I should note the once flag is set so it shouldn't be submitting more if the worker is already running. This sounds lik... [15:30:29] Yeah, that's what I meant. Do away with that setup. [15:31:02] chasemp: alright! cya [16:18:44] 06Labs, 10Tool-Labs, 10InternetArchiveBot: tools.iabot is overloading the grid by running too many workers in parallel - https://phabricator.wikimedia.org/T161951#3148343 (10madhuvishy) Hmmm it does look like jsub -once should not start more workers with same name, there may be some funkiness going on there,... [16:28:31] kamsuri [17:32:26] (03CR) 10Jean-Frédéric: Extract method normalize_identifier and add unit tests (032 comments) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/338008 (owner: 10Jean-Frédéric) [17:39:23] (03CR) 10Jean-Frédéric: "> maybe also add a test for stripping underscore" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/338008 (owner: 10Jean-Frédéric) [17:39:59] (03PS3) 10Jean-Frédéric: Add unittest to populate_image_table.processSource [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/338007 [17:40:01] (03PS3) 10Jean-Frédéric: Extract method normalize_identifier and add unit tests [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/338008 [17:40:03] (03PS3) 10Jean-Frédéric: Track number of tracked images (on top of found images) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/338009 [17:51:52] PROBLEM - High iowait on tools-webgrid-lighttpd-1415 is CRITICAL: CRITICAL: tools.tools-webgrid-lighttpd-1415.cpu.total.iowait (>20.00%) [18:01:51] RECOVERY - High iowait on tools-webgrid-lighttpd-1415 is OK: OK: All targets OK [18:03:30] (03PS4) 10Jean-Frédéric: Extract method normalize_identifier and add unit tests [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/338008 [18:03:32] (03PS4) 10Jean-Frédéric: Track number of tracked images (on top of found images) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/338009 [18:04:39] (03CR) 10Jean-Frédéric: "> Would it be possible to add a test where tracked!=found?" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/338009 (owner: 10Jean-Frédéric) [18:06:30] (03CR) 10Jean-Frédéric: "Added a test for what happened when an Exception is raised. That caught a pretty awful bug!" (031 comment) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/338008 (owner: 10Jean-Frédéric) [19:09:54] 06Labs, 10Labs-Vagrant, 10MediaWiki-Vagrant: labs-vagrant vagrant up fails with 404 at Dropbox - https://phabricator.wikimedia.org/T161891#3148513 (10Nemo_bis) Is there some workaround in the meanwhile? [20:40:20] chasemp: ping [20:40:26] madhuvishy: ping [21:07:50] 06Labs, 10Labs-Vagrant, 10MediaWiki-Vagrant: labs-vagrant vagrant up fails with 404 at Dropbox - https://phabricator.wikimedia.org/T161891#3148595 (10Reedy) >>! In T161891#3148513, @Nemo_bis wrote: > Is there some workaround in the meanwhile? Should be able to take a copy of the box image from someone else... [21:08:47] Nikerabbit: do you have a copy from your course machines?