[00:06:14] RECOVERY - Puppet failure on tools-webgrid-01 is OK: OK: Less than 1.00% above the threshold [0.0] [00:27:21] PROBLEM - Puppet failure on tools-exec-09 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [00:57:23] RECOVERY - Puppet failure on tools-exec-09 is OK: OK: Less than 1.00% above the threshold [0.0] [01:59:36] PROBLEM - Puppet failure on tools-webproxy-jessie is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [02:04:48] PROBLEM - Puppet failure on tools-redis is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [02:34:45] RECOVERY - Puppet failure on tools-redis is OK: OK: Less than 1.00% above the threshold [0.0] [03:45:02] RECOVERY - Puppet failure on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0] [04:02:50] PROBLEM - Puppet failure on tools-exec-04 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [04:27:50] RECOVERY - Puppet failure on tools-exec-04 is OK: OK: Less than 1.00% above the threshold [0.0] [04:48:17] PROBLEM - Puppet failure on tools-exec-09 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [05:03:53] PROBLEM - Puppet failure on tools-exec-04 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [05:13:16] RECOVERY - Puppet failure on tools-exec-09 is OK: OK: Less than 1.00% above the threshold [0.0] [05:28:48] RECOVERY - Puppet failure on tools-exec-04 is OK: OK: Less than 1.00% above the threshold [0.0] [07:21:43] 3Tool-Labs: Open Grid Engine Job dumps core (node) - https://phabricator.wikimedia.org/T86905#993583 (10scfc) Usually, when something works interactively, but not as a grid job, this is due to the job running out of memory as the default limit (256 MByte) is often insufficient and some programs have a very confu... [07:34:43] 3Tool-Labs: tools-trusty uses local crontab instead of remote (tools-cron?) crontab - https://phabricator.wikimedia.org/T86445#993588 (10scfc) a:3scfc Puppet installs `/usr/local/bin/xcrontab`. There are apparently manually installed symlinks on `tools-dev` and `tools-login` from `/usr/local/bin/crontab`. I'... [07:34:54] 3Tool-Labs: tools-trusty uses local crontab instead of remote (tools-cron?) crontab - https://phabricator.wikimedia.org/T86445#993590 (10scfc) p:5Triage>3Normal [08:38:14] 3Tool-Labs: Tools crontab replacement must check whether run as root - https://phabricator.wikimedia.org/T87527#993597 (10scfc) 3NEW a:3scfc [08:38:35] 3Tool-Labs: tools-trusty uses local crontab instead of remote (tools-cron?) crontab - https://phabricator.wikimedia.org/T86445#993606 (10scfc) [08:38:37] 3Tool-Labs: Tools crontab replacement must check whether run as root - https://phabricator.wikimedia.org/T87527#993605 (10scfc) [08:39:05] PROBLEM - Puppet failure on tools-dev is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [09:04:06] RECOVERY - Puppet failure on tools-dev is OK: OK: Less than 1.00% above the threshold [0.0] [09:29:19] PROBLEM - Puppet failure on tools-exec-03 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [09:58:47] PROBLEM - Puppet failure on tools-uwsgi-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [09:59:15] RECOVERY - Puppet failure on tools-exec-03 is OK: OK: Less than 1.00% above the threshold [0.0] [10:02:33] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [10:22:32] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [10:23:57] RECOVERY - Puppet failure on tools-uwsgi-01 is OK: OK: Less than 1.00% above the threshold [0.0] [10:54:03] PROBLEM - Puppet failure on tools-mail is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [11:00:01] PROBLEM - Puppet failure on tools-shadow is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [11:05:44] PROBLEM - Puppet failure on tools-exec-05 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [11:30:00] RECOVERY - Puppet failure on tools-shadow is OK: OK: Less than 1.00% above the threshold [0.0] [11:35:42] RECOVERY - Puppet failure on tools-exec-05 is OK: OK: Less than 1.00% above the threshold [0.0] [11:44:02] RECOVERY - Puppet failure on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [11:55:07] PROBLEM - Puppet failure on tools-dev is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [12:10:46] PROBLEM - Puppet failure on tools-redis is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [12:25:07] RECOVERY - Puppet failure on tools-dev is OK: OK: Less than 1.00% above the threshold [0.0] [12:40:46] RECOVERY - Puppet failure on tools-redis is OK: OK: Less than 1.00% above the threshold [0.0] [12:50:17] PROBLEM - Puppet failure on tools-exec-03 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [13:14:27] PROBLEM - Puppet failure on tools-exec-06 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [13:14:48] PROBLEM - Puppet failure on tools-uwsgi-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [13:22:17] PROBLEM - Puppet failure on tools-webgrid-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [13:23:33] PROBLEM - Puppet failure on tools-exec-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [13:35:13] RECOVERY - Puppet failure on tools-exec-03 is OK: OK: Less than 1.00% above the threshold [0.0] [13:35:57] PROBLEM - Puppet failure on tools-webgrid-tomcat is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [13:39:26] RECOVERY - Puppet failure on tools-exec-06 is OK: OK: Less than 1.00% above the threshold [0.0] [13:44:49] RECOVERY - Puppet failure on tools-uwsgi-01 is OK: OK: Less than 1.00% above the threshold [0.0] [13:47:14] RECOVERY - Puppet failure on tools-webgrid-01 is OK: OK: Less than 1.00% above the threshold [0.0] [13:55:01] PROBLEM - Puppet failure on tools-mail is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [13:57:05] PROBLEM - Puppet failure on tools-dev is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [14:05:54] RECOVERY - Puppet failure on tools-webgrid-tomcat is OK: OK: Less than 1.00% above the threshold [0.0] [14:08:34] RECOVERY - Puppet failure on tools-exec-02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:25:02] RECOVERY - Puppet failure on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [14:27:06] RECOVERY - Puppet failure on tools-dev is OK: OK: Less than 1.00% above the threshold [0.0] [14:32:38] PROBLEM - Puppet failure on tools-webgrid-generic-01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [14:34:55] PROBLEM - Puppet failure on tools-webgrid-05 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [14:37:34] PROBLEM - Puppet failure on tools-exec-12 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [14:41:43] PROBLEM - Puppet failure on tools-webgrid-04 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [14:49:08] Hi. When I run "python /shared/pywikipedia/core/scripts/commonscat.py" I keep getting an error like this "pywikibot.exceptions.UnknownSite: Language 'mai' does not exist in family wikipedia". Are new wikis not added automatic? [14:59:55] RECOVERY - Puppet failure on tools-webgrid-05 is OK: OK: Less than 1.00% above the threshold [0.0] [15:02:37] RECOVERY - Puppet failure on tools-webgrid-generic-01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:05:53] PROBLEM - Puppet failure on tools-exec-04 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [15:07:36] RECOVERY - Puppet failure on tools-exec-12 is OK: OK: Less than 1.00% above the threshold [0.0] [15:11:42] RECOVERY - Puppet failure on tools-webgrid-04 is OK: OK: Less than 1.00% above the threshold [0.0] [15:18:12] PROBLEM - Puppet failure on tools-webgrid-01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [15:30:54] RECOVERY - Puppet failure on tools-exec-04 is OK: OK: Less than 1.00% above the threshold [0.0] [15:36:56] PROBLEM - Puppet failure on tools-webgrid-tomcat is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [15:43:15] RECOVERY - Puppet failure on tools-webgrid-01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:06:54] RECOVERY - Puppet failure on tools-webgrid-tomcat is OK: OK: Less than 1.00% above the threshold [0.0] [16:08:44] PROBLEM - Puppet failure on tools-exec-05 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [16:39:31] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [16:41:49] PROBLEM - Puppet failure on tools-exec-04 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [16:42:39] PROBLEM - Puppet failure on tools-submit is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [16:57:53] PROBLEM - Puppet failure on tools-exec-15 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [17:04:34] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [17:11:50] RECOVERY - Puppet failure on tools-exec-04 is OK: OK: Less than 1.00% above the threshold [0.0] [17:12:43] RECOVERY - Puppet failure on tools-submit is OK: OK: Less than 1.00% above the threshold [0.0] [17:22:57] RECOVERY - Puppet failure on tools-exec-15 is OK: OK: Less than 1.00% above the threshold [0.0] [17:35:48] PROBLEM - Puppet failure on tools-uwsgi-01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [17:38:06] PROBLEM - Puppet failure on tools-dev is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [17:43:40] PROBLEM - Puppet failure on tools-submit is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [18:03:06] RECOVERY - Puppet failure on tools-dev is OK: OK: Less than 1.00% above the threshold [0.0] [18:03:10] any know network problem? my bot is unable to connect via api to wmf wiki [18:03:41] yeah, I said that too. [18:04:03] check mailing list too. I said that there too. [18:04:44] lookups fails, connects fails. but randomly. [18:05:09] i have these problem since about 10 minutes at my bot on labs [18:05:37] haven't you have the problem before? [18:05:46] RECOVERY - Puppet failure on tools-uwsgi-01 is OK: OK: Less than 1.00% above the threshold [0.0] [18:06:37] i only check the logs of the last hour. during the 50 minutes before is was ok [18:07:24] keep it running in console, you'll be able to see other errors. [18:07:48] on tools-dev: "wget http://de.wikipedia.org/wiki/api.php" -> Connecting to de.wikipedia.org (de.wikipedia.org)|2620:0:861:ed1a::1|:80... failed: Network is unreachable. [18:07:59] yes. [18:08:14] I said it to Vasiliev. [18:09:21] the bot keeps waiting 5 to 120 seconds, I think before each trying again then gives up. [18:11:27] huh? i haven't got any mails about that [18:11:40] but the dns lookup is correct [18:11:56] but i experience network failures for days now [18:12:36] gifti: https://lists.wikimedia.org/pipermail/labs-l/2015-January/003267.html [18:13:21] ah [18:13:39] RECOVERY - Puppet failure on tools-submit is OK: OK: Less than 1.00% above the threshold [0.0] [18:13:50] ok network is working again on tool-dev [18:14:23] the DNS server might probably be failing? [18:14:25] it will go down after a while, I'm telling you. [18:14:27] I just restarted it [18:14:31] to see if that helps fix it [18:14:44] all of ops have been in meetings and other things like that all of last week, and for the next two days too :( [18:15:48] god [18:19:22] I ran a script just now and I got "ServerNotFoundError: Unable to find the server at mdf.wikipedia.org" [18:21:21] YuviPanda: by the dns response was corrent, see my wget response above [18:21:53] Merlissimo: yeah, am unsure what that’s about, but it’s possible that our openstack network service is dying / being DoS'd [18:21:56] or just not robust enough [18:22:16] andrewbogott_afk: would know more about it but this is the worst time for any of this to happen, because ops aren’t being opsy atm. [18:22:21] I’ll poke around? [18:22:31] I did put some measures in place.. [18:22:41] mostly just restarting dnsmasq and rasising some kernel params [18:22:47] hopefully that mitigates it for a little bit? [18:25:02] are those dns server accessible from the outside? do we need protection on those. or can we put labs IP on wmf projects' whitelist? [18:25:13] PROBLEM - Puppet failure on tools-exec-10 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [18:31:51] we’ve separate dns servers for ‘outside’ and ‘inside’, and I’m not sure what you mean by wmf projects’ whitelist [18:31:56] but I’ve to go, sadly :( [18:32:34] I’ve been running ‘dig’ and ‘curl’ in a loop and have been getting no failures atm [18:33:30] don't run loop on same wiki. [22:35:44] 3Tool-Labs: Open Grid Engine Job dumps core (node) - https://phabricator.wikimedia.org/T86905#993932 (10edsu) Yes, this seemed to work: jstart -mem 1g -N anon -mem 1g /usr/bin/node anon/anon.js --config /data/project/anon/config.json --verbose honestly, I had no idea that it was using that much memory. Is... [23:53:48] 3Tool-Labs: Open Grid Engine Job dumps core (node) - https://phabricator.wikimedia.org/T86905#993954 (10scfc) In general, that is not a problem. If your jobs were continuously using hundreds of gigabytes, there would need to be a discussion if your tool/bot could be optimized or new hardware resources must be a...