[00:29:50] Anyone know why I'm getting a 502 on http://logstash-beta.wmflabs.org/ ? [02:36:23] 6Labs: Officetools instance (running sugarCRM) is in suspended state and won't reboot). - https://phabricator.wikimedia.org/T99339#1290135 (10Jalexander) 3NEW [02:57:07] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL - Socket timeout after 10 seconds [03:16:17] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 771088 bytes in 2.268 second response time [03:22:26] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL - Socket timeout after 10 seconds [03:37:19] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 771088 bytes in 2.567 second response time [04:08:51] !log wikidata-dev stopping instance wikidata-wdq-mm to give its host a bit of breathing room [04:08:59] Logged the message, dummy [04:18:22] 6Labs, 6operations, 10ops-eqiad: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290156 (10Andrew) 3NEW a:3Cmjohnson [04:19:40] 6Labs, 6operations, 10ops-eqiad: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290168 (10Andrew) There's a fair amount of other ugliness in dmesg, e.g. [1843134.114144] INFO: task gmond:61831 blocked for more than 120 seconds. [1843134.145729] Not tainted 3.13.0-49-generic #83-Ub... [04:20:34] 6Labs, 6operations, 10ops-eqiad: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290169 (10Andrew) I'm going to leave the system up for now, since we might as well minimize the labs outage. I can't imagine this isn't going to require a dc visit though :( [04:20:53] 6Labs, 6operations, 10ops-eqiad: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290170 (10Andrew) p:5Triage>3Unbreak! [04:31:20] !log bots changed admins config from admin to root for self [04:31:25] Logged the message, Master [04:36:51] 6Labs, 6operations, 10ops-eqiad: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290172 (10Andrew) Oh, btw, sshd and ganglia-monitor are comatose on that system for reasons that are unclear to me. The mgmt console is working fine. [06:05:13] 6Labs, 6operations, 10ops-eqiad: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290209 (10Joe) @andrew why leaving this up would have "minimized the labs outage" is not clear to me. You've basically left a completely broken system (and an UBN!) ticket open to be consumed over the weekend... [06:06:12] 6Labs, 6operations, 10ops-eqiad: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290210 (10Joe) The only reason why I'm not rebooting this machine is that Andrew implied it would mean having downtime for labs, but I don't really see an alternative to an hard powercycle for now. [06:13:14] 6Labs, 6operations, 10ops-eqiad: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290226 (10mark) According to the ILO sensors both the fans and the temp sensors indicate OK/good health, so I doubt it's actually a matter of overheating. [07:02:41] 6Labs, 6operations, 10ops-eqiad: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290273 (10ArielGlenn) The labs instances on the box seem to be working fine fwiw. [08:47:36] PROBLEM - Host tools-exec-wmt is DOWN: CRITICAL - Host Unreachable (10.68.16.41) [08:47:40] PROBLEM - Host tools-checker-01 is DOWN: CRITICAL - Host Unreachable (10.68.16.97) [08:47:46] PROBLEM - Host tools-webgrid-lighttpd-1207 is DOWN: CRITICAL - Host Unreachable (10.68.16.215) [08:48:09] PROBLEM - Host tools-exec-07 is DOWN: CRITICAL - Host Unreachable (10.68.16.36) [08:48:37] PROBLEM - Host tools-webgrid-generic-1401 is DOWN: CRITICAL - Host Unreachable (10.68.18.51) [08:49:23] PROBLEM - Host tools-exec-1219 is DOWN: CRITICAL - Host Unreachable (10.68.18.40) [08:49:24] PROBLEM - Host tools-trusty is DOWN: CRITICAL - Host Unreachable (10.68.16.63) [08:49:56] PROBLEM - Host tools-exec-1216 is DOWN: CRITICAL - Host Unreachable (10.68.17.255) [08:50:00] PROBLEM - Host tools-webgrid-lighttpd-1402 is DOWN: CRITICAL - Host Unreachable (10.68.16.35) [08:50:09] PROBLEM - Host tools-mail is DOWN: CRITICAL - Host Unreachable (10.68.16.27) [08:50:20] PROBLEM - Host tools-webgrid-lighttpd-1210 is DOWN: CRITICAL - Host Unreachable (10.68.17.163) [08:50:20] PROBLEM - Host tools-exec-1407 is DOWN: CRITICAL - Host Unreachable (10.68.18.16) [08:50:32] PROBLEM - Host tools-checker-02 is DOWN: CRITICAL - Host Unreachable (10.68.16.17) [08:50:38] PROBLEM - Host tools-webgrid-lighttpd-1407 is DOWN: CRITICAL - Host Unreachable (10.68.17.251) [08:50:56] PROBLEM - Host tools-services-02 is DOWN: CRITICAL - Host Unreachable (10.68.18.36) [08:51:00] PROBLEM - Host tools-webgrid-lighttpd-1202 is DOWN: CRITICAL - Host Unreachable (10.68.18.46) [09:26:44] RECOVERY - Host tools-checker-01 is UPING OK - Packet loss = 0%, RTA = 0.98 ms [09:26:54] RECOVERY - Host tools-checker-02 is UPING OK - Packet loss = 0%, RTA = 0.77 ms [09:27:46] RECOVERY - Host tools-exec-07 is UPING OK - Packet loss = 0%, RTA = 0.60 ms [09:28:06] RECOVERY - Host tools-exec-1407 is UPING OK - Packet loss = 0%, RTA = 0.78 ms [09:28:14] RECOVERY - Host tools-mail is UPING OK - Packet loss = 0%, RTA = 0.58 ms [09:28:18] RECOVERY - Host tools-exec-1219 is UPING OK - Packet loss = 0%, RTA = 0.85 ms [09:28:50] RECOVERY - Host tools-trusty is UPING OK - Packet loss = 0%, RTA = 0.86 ms [09:29:24] RECOVERY - Host tools-webgrid-lighttpd-1202 is UPING OK - Packet loss = 0%, RTA = 0.98 ms [09:29:25] RECOVERY - Host tools-webgrid-lighttpd-1210 is UPING OK - Packet loss = 0%, RTA = 0.88 ms [09:29:26] RECOVERY - Host tools-webgrid-lighttpd-1207 is UPING OK - Packet loss = 0%, RTA = 0.96 ms [09:29:39] RECOVERY - Host tools-exec-wmt is UPING OK - Packet loss = 0%, RTA = 0.86 ms [09:29:43] RECOVERY - Host tools-webgrid-lighttpd-1402 is UPING OK - Packet loss = 0%, RTA = 0.80 ms [09:29:58] RECOVERY - Host tools-exec-1216 is UPING OK - Packet loss = 0%, RTA = 0.90 ms [09:30:21] RECOVERY - Host tools-services-02 is UPING OK - Packet loss = 0%, RTA = 0.74 ms [09:30:33] RECOVERY - Host tools-webgrid-lighttpd-1407 is UPING OK - Packet loss = 0%, RTA = 0.86 ms [09:31:43] RECOVERY - Host tools-webgrid-generic-1401 is UPING OK - Packet loss = 0%, RTA = 0.98 ms [09:33:01] PROBLEM - Puppet failure on tools-exec-1407 is CRITICAL 60.00% of data above the critical threshold [0.0] [09:39:13] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Move toollabs instances around to minimize damage from a single downed virt* host - https://phabricator.wikimedia.org/T91072#1290305 (10yuvipanda) So this bit us again today. I guess we should write a small script that identifies failover instances on the same virt* ho... [09:41:37] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Move toollabs instances around to minimize damage from a single downed virt* host - https://phabricator.wikimedia.org/T91072#1290306 (10mark) p:5Low>3High [09:42:30] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Migrate tools-checker-02 away from labvirt1003 - https://phabricator.wikimedia.org/T99347#1290308 (10coren) 3NEW [09:43:01] RECOVERY - Puppet failure on tools-exec-1407 is OK Less than 1.00% above the threshold [0.0] [09:52:32] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Migrate tools-checker-02 away from labvirt1003 - https://phabricator.wikimedia.org/T99347#1290314 (10yuvipanda) We should just delete checker-01 now and rebuild it - and hope that shows up on another host :) [10:19:30] Coren: are those SGE panic mails related to the virt1003 outage? [10:19:55] valhallasw: They are, though everything should be back to full happy by now. [10:21:04] Coren: this was just shutting down some hosts to offload virt1003? [10:21:28] valhallasw: No, we had to emergency reboot virt1003 entirely [10:21:34] ah, ok [13:30:43] 6Labs, 6operations, 10ops-eqiad: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290542 (10Andrew) 5Open>3Resolved Detailed report is here: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150515-LabsOutage [17:00:02] [13intuition] 15siebrand pushed 1 new commit to 06master: 02https://github.com/Krinkle/intuition/commit/6df6774f083fbeb2e4406792b897ba56b10300ad [17:00:03] 13intuition/06master 146df6774 15Siebrand Mazeland: Localisation updates from https://translatewiki.net. [21:46:34] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1403 is CRITICAL 50.00% of data above the critical threshold [0.0] [22:01:29] 10Tool-Labs: kmlexport perl script memory usage - https://phabricator.wikimedia.org/T99236#1290997 (10valhallasw) Happening again: ``` 19202 tools.kmlexport 20 0 474656 423336 1648 S 0.0 5.2 3:11.94 /usr/sbin/lighttpd -f /var/run/lighttpd/kmlexport.conf -D 22130 tools.kmlexport 20 0 487712... [22:12:24] Coren: ^ I can't find anything about lighttpd being able to kill cgi processes :/ Apache apparently can do this, which explains why it wasn't an issue on the TS [22:36:33] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1403 is OK Less than 1.00% above the threshold [0.0] [23:12:08] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Tullis was created, changed by Tullis link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Tullis edit summary: Created page with "{{Tools Access Request |Justification=I'm not 100% sure at the moment. I don't have a specific use in mind in the short-term, but I would like to become more familiar with the..." [23:12:57] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Tullis was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=159570 edit summary: [23:22:33] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1403 is CRITICAL 20.00% of data above the critical threshold [0.0] [23:52:34] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1403 is OK Less than 1.00% above the threshold [0.0]