[00:20:03] PROBLEM - Free space - all mounts on tools-exec-09 is CRITICAL: CRITICAL: tools.tools-exec-09.diskspace._var.byte_percentfree.value (<37.50%) [01:07:26] 3Tool-Labs: Migrate tools to trusty - https://phabricator.wikimedia.org/T88228#1006562 (10yuvipanda) 3NEW [02:17:06] !log tools test [02:17:14] no? hmpf [02:21:44] labs-morebots: hi? [02:21:44] I am a logbot running on tools-exec-13. [02:21:44] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [02:21:45] To log a message, type !log . [02:54:33] PROBLEM - Free space - all mounts on tools-webgrid-02 is CRITICAL: CRITICAL: tools.tools-webgrid-02.diskspace._var.byte_percentfree.value (<100.00%) [02:55:28] goddam [02:55:39] PROBLEM - Free space - all mounts on tools-exec-03 is CRITICAL: CRITICAL: tools.tools-exec-03.diskspace._var.byte_percentfree.value (<55.56%) [03:01:21] !log tools ran salt -G 'instanceproject:tools' cmd.run 'sudo rm -rf /var/tmp/core’ because disks were getting full. [03:01:26] Logged the message, Master [03:09:34] RECOVERY - Free space - all mounts on tools-webgrid-02 is OK: OK: All targets OK [03:10:04] RECOVERY - Free space - all mounts on tools-exec-09 is OK: OK: All targets OK [03:10:38] RECOVERY - Free space - all mounts on tools-exec-03 is OK: OK: All targets OK [03:11:42] RECOVERY - Free space - all mounts on tools-exec-06 is OK: OK: All targets OK [03:12:40] PROBLEM - Puppet failure on tools-trusty is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [03:12:48] RECOVERY - Free space - all mounts on tools-exec-08 is OK: OK: All targets OK [03:17:47] PROBLEM - Puppet failure on tools-master is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [03:20:23] PROBLEM - Puppet failure on tools-exec-02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [03:26:18] RECOVERY - Puppet staleness on tools-exec-07 is OK: OK: Less than 1.00% above the threshold [3600.0] [03:40:51] 3Wikimedia-Labs-Infrastructure: Internal DNS look-ups fail every once in a while - https://phabricator.wikimedia.org/T72076#1006644 (10yuvipanda) The hand tuning has made things fairly stable. Still need to puppetize, though. @akosiaris can you help? Need to set notrack on the DNS ports on a machine without ferm... [03:41:34] 3Wikimedia-Labs-Infrastructure: Internal DNS look-ups fail every once in a while - https://phabricator.wikimedia.org/T72076#1006646 (10yuvipanda) p:5Unbreak!>3Normal [03:42:39] RECOVERY - Puppet failure on tools-trusty is OK: OK: Less than 1.00% above the threshold [0.0] [03:45:25] RECOVERY - Puppet failure on tools-exec-02 is OK: OK: Less than 1.00% above the threshold [0.0] [03:57:45] RECOVERY - Puppet failure on tools-master is OK: OK: Less than 1.00% above the threshold [0.0] [04:09:58] !log tools widar moved to trusty [04:10:03] Logged the message, Master [04:24:02] 3Labs: Puppetize & fix tools-db - https://phabricator.wikimedia.org/T88234#1006649 (10yuvipanda) 3NEW a:3coren [04:52:40] !log tools migrating all of magnus’ tools, after consultation with him (https://etherpad.wikimedia.org/p/tools-trusty-move for status) [04:52:41] !ping [04:52:41] !pong [04:52:43] Logged the message, Master [05:01:34] 3Labs: Create butterfly project for testing - https://phabricator.wikimedia.org/T88235#1006657 (10MZMcBride) 3NEW [05:03:34] 3Labs: New Labs project requests (Tracking) - https://phabricator.wikimedia.org/T76375#1006667 (10yuvipanda) [05:03:35] 3Labs: Create butterfly project for testing - https://phabricator.wikimedia.org/T88235#1006664 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Done [05:30:14] PROBLEM - Puppet failure on tools-webgrid-06 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [05:37:40] !log tools added tools-webgrid-06 as trusty webnode, operational now [05:37:43] Logged the message, Master [05:40:07] RECOVERY - Puppet failure on tools-webgrid-06 is OK: OK: Less than 1.00% above the threshold [0.0] [05:47:02] !log tools completed migrating magnus' tools to trusty, more details at https://etherpad.wikimedia.org/p/tools-trusty-move [05:47:05] Logged the message, Master [06:27:34] 3Tool-Labs: Setup *.labsdb as DNS entries instead of *manually* set /etc/hosts entries - https://phabricator.wikimedia.org/T88236#1006672 (10yuvipanda) 3NEW a:3coren [06:32:11] 3Tool-Labs: Setup *.labsdb as DNS entries instead of *manually* set /etc/hosts entries - https://phabricator.wikimedia.org/T88236#1006680 (10yuvipanda) There are some 800+ entries in that /etc/hosts file. [06:32:36] 3Tool-Labs: Setup *.labsdb as DNS entries instead of *manually* set /etc/hosts entries - https://phabricator.wikimedia.org/T88236#1006681 (10yuvipanda) [06:41:49] !log tools set chmod +xw manually on /var/run/lighttpd on webgrid-05, need to investigate why it was necessary [06:41:53] Logged the message, Master [06:52:38] PROBLEM - Puppet failure on tools-webgrid-02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [06:54:42] hi GerardM- [06:54:45] did something break? [07:14:59] Hoi ... I just woke up [07:17:44] GerardM-: heh, so nothing broke then. [07:17:45] not ba [07:17:46] d [07:22:39] RECOVERY - Puppet failure on tools-webgrid-02 is OK: OK: Less than 1.00% above the threshold [0.0] [07:28:30] I have not done much yet ... breakfast coffee for the wife [07:28:43] but so far I am happy [07:29:14] PS did you read the reply that I wrote ? [07:30:09] GerardM-: which one? [07:31:30] a few minutes ago [07:32:52] it is not meant as an attack but to point out that expectation and perspective are so starkly different [07:48:54] YuviPanda: how is the load balancing working out ? [07:49:58] GerardM-: re: email, it was a misunderstanding about what he called a minor ailment (wrong status being displayed, not instance being down) [07:50:03] GerardM-: which load balancing? wdq? [07:51:01] yes [07:51:07] !log tools cleared error state of stuck queues [07:51:11] Logged the message, Master [07:51:16] GerardM-: seems ok. No outages reported in a while, afaik [07:51:30] (outside of the great restart, that is) [07:51:36] so I can blog about it ? [07:52:21] 3Tool-Labs: Track and alert based on gridengine error states - https://phabricator.wikimedia.org/T88237#1006683 (10yuvipanda) 3NEW [07:53:02] GerardM-: sure. [07:54:10] If I may I will have you proof read it [07:54:16] GerardM-: sure! [07:54:48] GerardM-: I can also add technical details somewhere if you wish [07:54:56] in fact, let me do that now [07:56:12] yes please [07:56:35] It is important to write about positive things [07:56:40] the more public the better [07:57:46] GerardM-: yeah, am wriitng up now. just technical details. moment [08:00:39] GerardM-: https://wikitech.wikimedia.org/wiki/Nova_Resource:Wdq-mm/Documentation [08:01:24] thank you [08:01:32] I will write about it in my afternoon [08:01:36] GerardM-: yw. talk to magnus as well, see how he found the process :) [08:20:27] 3Tool-Labs: Track and alert based on gridengine error states - https://phabricator.wikimedia.org/T88237#1006699 (10Bgwhite) A thousand years ago when I ran SGE, there was an administrator email setting. This would send emails if there were problems. Nagios also had SGE plugins. You whippersnappers run some ne... [09:20:44] 3Tool-Labs: Setup *.labsdb as DNS entries instead of *manually* set /etc/hosts entries - https://phabricator.wikimedia.org/T88236#1006708 (10scfc) [09:20:46] 3Wikimedia-Labs-Infrastructure: Move LabsDB aliases and NAT to DNS and LabsDB servers - https://phabricator.wikimedia.org/T63897#1006709 (10scfc) [09:21:45] 3Wikimedia-Labs-Infrastructure: Move LabsDB aliases and NAT to DNS and LabsDB servers - https://phabricator.wikimedia.org/T63897#653806 (10scfc) The NAT part of this task is probably obsolete now. [09:38:16] Hoi, I posted about diplomats and ambassadors .. [09:38:36] the point is very much in how serious we take the understanding of structure [09:41:55] PROBLEM - Host tools-webgrid-06 is DOWN: CRITICAL - Host Unreachable (10.68.17.163) [09:41:59] wat [09:42:00] sigh [09:42:26] andrewbogott_afk: ^ this seems to happen every time I set up a new toollabs host [09:42:36] am I just getting on the wrong boxes by pure bad luck? [09:46:34] RECOVERY - Host tools-webgrid-06 is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [09:50:02] 3Tool-Labs: Track and alert based on gridengine error states - https://phabricator.wikimedia.org/T88237#1006717 (10scfc) Oh, mails are sent indeed :-). (My) problem however was (and is) that both a job that failed because the output file could not be created due to the user's choice of name and a failed job tha... [09:51:06] PROBLEM - Puppet failure on tools-webgrid-06 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [09:53:27] 3Tool-Labs: Track and alert based on gridengine error states - https://phabricator.wikimedia.org/T88237#1006721 (10yuvipanda) I guess easiest(?) thing to do is write a diamond collector that tracks gridengine queue stats :) [10:16:11] RECOVERY - Puppet failure on tools-webgrid-06 is OK: OK: Less than 1.00% above the threshold [0.0] [11:13:18] hi [11:14:34] it's been more than 3 days my tools are down, is there any indication of when "everithing will be up again" ? [15:17:17] PROBLEM - Puppet staleness on tools-exec-07 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [43200.0] [16:25:57] 3Wikimedia-Labs-wikistats: Add 300 000 wikia wikis to stats table - https://phabricator.wikimedia.org/T38291#1006858 (10Nemo_bis) I don't see any duplicates in the WikiTeam repo list (updated in 72d43634d566d4a2cc1601324b24b5eeac79eb13 ). I can update that list again if you want to use it. I don't see duplicates... [17:21:19] Is there a backup of toolserver svn and fisheye? [17:45:10] escalier: yes [17:45:34] where? [17:46:02] in nosy's home dir [17:46:18] http://thread.gmane.org/gmane.org.wikimedia.toolserver/6630/focus=6636 [17:49:14] escalier: which repository are you looking for? [17:54:01] Are you talking about /home/nosy/svn-ts.tgz ? [17:54:10] valhallasw`cloud: this one http://web.archive.org/web/20140118095059/https://fisheye.toolserver.org/browse/erwin85 [17:54:19] escalier: yes [18:03:37] PROBLEM - Free space - all mounts on tools-webproxy is CRITICAL: CRITICAL: tools.tools-webproxy.diskspace._var.byte_percentfree.value (<22.22%) [18:23:40] RECOVERY - Free space - all mounts on tools-webproxy is OK: OK: All targets OK [18:30:47] 3Wikimedia-Labs-Infrastructure: Move LabsDB aliases and NAT to DNS and LabsDB servers - https://phabricator.wikimedia.org/T63897#1007006 (10scfc) IIRC the last time I thought about that there were basically two alternatives: # Put the aliases in `operations/dns:templates/wmnet` under `labsdb.svc.eqiad.wmnet` an...