[00:29:50] <matt_flaschen>	 Anyone know why I'm getting a 502 on http://logstash-beta.wmflabs.org/ ?
[02:36:23] <wikibugs>	 6Labs: Officetools instance (running sugarCRM) is in suspended state and won't reboot). - https://phabricator.wikimedia.org/T99339#1290135 (10Jalexander) 3NEW
[02:57:07] <shinken-wm>	 PROBLEM - ToolLabs Home Page on toollabs is CRITICAL - Socket timeout after 10 seconds
[03:16:17] <shinken-wm>	 RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 771088 bytes in 2.268 second response time
[03:22:26] <shinken-wm>	 PROBLEM - ToolLabs Home Page on toollabs is CRITICAL - Socket timeout after 10 seconds
[03:37:19] <shinken-wm>	 RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 771088 bytes in 2.567 second response time
[04:08:51] <andrewbogott>	 !log wikidata-dev stopping instance wikidata-wdq-mm to give its host a bit of breathing room
[04:08:59] <labs-morebots>	 Logged the message, dummy
[04:18:22] <wikibugs>	 6Labs, 6operations, 10ops-eqiad: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290156 (10Andrew) 3NEW a:3Cmjohnson
[04:19:40] <wikibugs>	 6Labs, 6operations, 10ops-eqiad: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290168 (10Andrew) There's a fair amount of other ugliness in dmesg, e.g.  [1843134.114144] INFO: task gmond:61831 blocked for more than 120 seconds. [1843134.145729]       Not tainted 3.13.0-49-generic #83-Ub...
[04:20:34] <wikibugs>	 6Labs, 6operations, 10ops-eqiad: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290169 (10Andrew) I'm going to leave the system up for now, since we might as well minimize the labs outage.  I can't imagine this isn't going to require a dc visit though :(
[04:20:53] <wikibugs>	 6Labs, 6operations, 10ops-eqiad: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290170 (10Andrew) p:5Triage>3Unbreak!
[04:31:20] <sDrewth>	 !log bots changed admins config from admin to root for self
[04:31:25] <labs-morebots>	 Logged the message, Master
[04:36:51] <wikibugs>	 6Labs, 6operations, 10ops-eqiad: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290172 (10Andrew) Oh, btw, sshd and ganglia-monitor are comatose on that system for reasons that are unclear to me.  The mgmt console is working fine.
[06:05:13] <wikibugs>	 6Labs, 6operations, 10ops-eqiad: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290209 (10Joe) @andrew why leaving this up would have "minimized the labs outage" is not clear to me. You've basically left a completely broken system (and an UBN!) ticket open to be consumed over the weekend...
[06:06:12] <wikibugs>	 6Labs, 6operations, 10ops-eqiad: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290210 (10Joe) The only reason why I'm not rebooting this machine is that Andrew implied it would mean having downtime for labs, but I don't really see an alternative to an hard powercycle for now.
[06:13:14] <wikibugs>	 6Labs, 6operations, 10ops-eqiad: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290226 (10mark) According to the ILO sensors both the fans and the temp sensors indicate OK/good health, so I doubt it's actually a matter of overheating.
[07:02:41] <wikibugs>	 6Labs, 6operations, 10ops-eqiad: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290273 (10ArielGlenn) The labs instances on the box seem to be working fine fwiw.
[08:47:36] <shinken-wm>	 PROBLEM - Host tools-exec-wmt is DOWN: CRITICAL - Host Unreachable (10.68.16.41)
[08:47:40] <shinken-wm>	 PROBLEM - Host tools-checker-01 is DOWN: CRITICAL - Host Unreachable (10.68.16.97)
[08:47:46] <shinken-wm>	 PROBLEM - Host tools-webgrid-lighttpd-1207 is DOWN: CRITICAL - Host Unreachable (10.68.16.215)
[08:48:09] <shinken-wm>	 PROBLEM - Host tools-exec-07 is DOWN: CRITICAL - Host Unreachable (10.68.16.36)
[08:48:37] <shinken-wm>	 PROBLEM - Host tools-webgrid-generic-1401 is DOWN: CRITICAL - Host Unreachable (10.68.18.51)
[08:49:23] <shinken-wm>	 PROBLEM - Host tools-exec-1219 is DOWN: CRITICAL - Host Unreachable (10.68.18.40)
[08:49:24] <shinken-wm>	 PROBLEM - Host tools-trusty is DOWN: CRITICAL - Host Unreachable (10.68.16.63)
[08:49:56] <shinken-wm>	 PROBLEM - Host tools-exec-1216 is DOWN: CRITICAL - Host Unreachable (10.68.17.255)
[08:50:00] <shinken-wm>	 PROBLEM - Host tools-webgrid-lighttpd-1402 is DOWN: CRITICAL - Host Unreachable (10.68.16.35)
[08:50:09] <shinken-wm>	 PROBLEM - Host tools-mail is DOWN: CRITICAL - Host Unreachable (10.68.16.27)
[08:50:20] <shinken-wm>	 PROBLEM - Host tools-webgrid-lighttpd-1210 is DOWN: CRITICAL - Host Unreachable (10.68.17.163)
[08:50:20] <shinken-wm>	 PROBLEM - Host tools-exec-1407 is DOWN: CRITICAL - Host Unreachable (10.68.18.16)
[08:50:32] <shinken-wm>	 PROBLEM - Host tools-checker-02 is DOWN: CRITICAL - Host Unreachable (10.68.16.17)
[08:50:38] <shinken-wm>	 PROBLEM - Host tools-webgrid-lighttpd-1407 is DOWN: CRITICAL - Host Unreachable (10.68.17.251)
[08:50:56] <shinken-wm>	 PROBLEM - Host tools-services-02 is DOWN: CRITICAL - Host Unreachable (10.68.18.36)
[08:51:00] <shinken-wm>	 PROBLEM - Host tools-webgrid-lighttpd-1202 is DOWN: CRITICAL - Host Unreachable (10.68.18.46)
[09:26:44] <shinken-wm>	 RECOVERY - Host tools-checker-01 is UPING OK - Packet loss = 0%, RTA = 0.98 ms
[09:26:54] <shinken-wm>	 RECOVERY - Host tools-checker-02 is UPING OK - Packet loss = 0%, RTA = 0.77 ms
[09:27:46] <shinken-wm>	 RECOVERY - Host tools-exec-07 is UPING OK - Packet loss = 0%, RTA = 0.60 ms
[09:28:06] <shinken-wm>	 RECOVERY - Host tools-exec-1407 is UPING OK - Packet loss = 0%, RTA = 0.78 ms
[09:28:14] <shinken-wm>	 RECOVERY - Host tools-mail is UPING OK - Packet loss = 0%, RTA = 0.58 ms
[09:28:18] <shinken-wm>	 RECOVERY - Host tools-exec-1219 is UPING OK - Packet loss = 0%, RTA = 0.85 ms
[09:28:50] <shinken-wm>	 RECOVERY - Host tools-trusty is UPING OK - Packet loss = 0%, RTA = 0.86 ms
[09:29:24] <shinken-wm>	 RECOVERY - Host tools-webgrid-lighttpd-1202 is UPING OK - Packet loss = 0%, RTA = 0.98 ms
[09:29:25] <shinken-wm>	 RECOVERY - Host tools-webgrid-lighttpd-1210 is UPING OK - Packet loss = 0%, RTA = 0.88 ms
[09:29:26] <shinken-wm>	 RECOVERY - Host tools-webgrid-lighttpd-1207 is UPING OK - Packet loss = 0%, RTA = 0.96 ms
[09:29:39] <shinken-wm>	 RECOVERY - Host tools-exec-wmt is UPING OK - Packet loss = 0%, RTA = 0.86 ms
[09:29:43] <shinken-wm>	 RECOVERY - Host tools-webgrid-lighttpd-1402 is UPING OK - Packet loss = 0%, RTA = 0.80 ms
[09:29:58] <shinken-wm>	 RECOVERY - Host tools-exec-1216 is UPING OK - Packet loss = 0%, RTA = 0.90 ms
[09:30:21] <shinken-wm>	 RECOVERY - Host tools-services-02 is UPING OK - Packet loss = 0%, RTA = 0.74 ms
[09:30:33] <shinken-wm>	 RECOVERY - Host tools-webgrid-lighttpd-1407 is UPING OK - Packet loss = 0%, RTA = 0.86 ms
[09:31:43] <shinken-wm>	 RECOVERY - Host tools-webgrid-generic-1401 is UPING OK - Packet loss = 0%, RTA = 0.98 ms
[09:33:01] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-1407 is CRITICAL 60.00% of data above the critical threshold [0.0]
[09:39:13] <wikibugs>	 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Move toollabs instances around to minimize damage from a single downed virt* host - https://phabricator.wikimedia.org/T91072#1290305 (10yuvipanda) So this bit us again today. I guess we should write a small script that identifies failover instances on the same virt* ho...
[09:41:37] <wikibugs>	 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Move toollabs instances around to minimize damage from a single downed virt* host - https://phabricator.wikimedia.org/T91072#1290306 (10mark) p:5Low>3High
[09:42:30] <wikibugs>	 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Migrate tools-checker-02 away from labvirt1003 - https://phabricator.wikimedia.org/T99347#1290308 (10coren) 3NEW
[09:43:01] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-1407 is OK Less than 1.00% above the threshold [0.0]
[09:52:32] <wikibugs>	 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Migrate tools-checker-02 away from labvirt1003 - https://phabricator.wikimedia.org/T99347#1290314 (10yuvipanda) We should just delete checker-01 now and rebuild it - and hope that shows up on another host :)
[10:19:30] <valhallasw>	 Coren: are those SGE panic mails related to the virt1003 outage?
[10:19:55] <Coren>	 valhallasw: They are, though everything should be back to full happy by now.
[10:21:04] <valhallasw>	 Coren: this was just shutting down some hosts to offload virt1003?
[10:21:28] <Coren>	 valhallasw: No, we had to emergency reboot virt1003 entirely
[10:21:34] <valhallasw>	 ah, ok
[13:30:43] <wikibugs>	 6Labs, 6operations, 10ops-eqiad: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290542 (10Andrew) 5Open>3Resolved Detailed report is here:  https://wikitech.wikimedia.org/wiki/Incident_documentation/20150515-LabsOutage
[17:00:02] <github>	 [13intuition] 15siebrand pushed 1 new commit to 06master: 02https://github.com/Krinkle/intuition/commit/6df6774f083fbeb2e4406792b897ba56b10300ad
[17:00:03] <github>	 13intuition/06master 146df6774 15Siebrand Mazeland: Localisation updates from https://translatewiki.net.
[21:46:34] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1403 is CRITICAL 50.00% of data above the critical threshold [0.0]
[22:01:29] <wikibugs>	 10Tool-Labs: kmlexport perl script memory usage - https://phabricator.wikimedia.org/T99236#1290997 (10valhallasw) Happening again:  ``` 19202 tools.kmlexport     20   0  474656 423336   1648 S   0.0  5.2   3:11.94 /usr/sbin/lighttpd -f /var/run/lighttpd/kmlexport.conf -D 22130 tools.kmlexport     20   0  487712...
[22:12:24] <valhallasw>	 Coren: ^ I can't find anything about lighttpd being able to kill cgi processes :/ Apache apparently can do this, which explains why it wasn't an issue on the TS
[22:36:33] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1403 is OK Less than 1.00% above the threshold [0.0]
[23:12:08] <wm-bot>	 Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Tullis was created, changed by Tullis link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Tullis edit summary: Created page with "{{Tools Access Request |Justification=I'm not 100% sure at the moment. I don't have a specific use in mind in the short-term, but I would like to become more familiar with the..."
[23:12:57] <wm-bot>	 Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Tullis was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=159570 edit summary: 
[23:22:33] <shinken-wm>	 PROBLEM - Puppet failure on tools-webgrid-lighttpd-1403 is CRITICAL 20.00% of data above the critical threshold [0.0]
[23:52:34] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1403 is OK Less than 1.00% above the threshold [0.0]