[01:32:14] !log tools killed tools-checker-01 instance, recreating [01:32:20] Logged the message, Master [01:33:37] PROBLEM - Host tools-checker-01 is DOWN: CRITICAL - Host Unreachable (10.68.16.97) [01:33:57] andrewbogott_afk: don't worry, shinken is me. [01:34:54] yuvipanda: good to know :) [01:35:41] andrewbogott: I'll set up accounts soon so I can start marking downtime and stuff [01:35:47] andrewbogott: and not give you scares :) [01:36:45] RECOVERY - Host tools-checker-01 is UPING OK - Packet loss = 0%, RTA = 0.62 ms [01:36:55] !log tools created new tools-checker-01, applying role and provisioning [01:37:01] Logged the message, Master [01:42:21] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Move toollabs instances around to minimize damage from a single downed virt* host - https://phabricator.wikimedia.org/T91072#1292546 (10yuvipanda) [01:42:29] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Migrate tools-checker-02 away from labvirt1003 - https://phabricator.wikimedia.org/T99347#1292543 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Recreated tools-checker-01, is on libvirt1009 now. [01:45:58] PROBLEM - Puppet failure on tools-checker-01 is CRITICAL 42.86% of data above the critical threshold [0.0] [01:55:59] RECOVERY - Puppet failure on tools-checker-01 is OK Less than 1.00% above the threshold [0.0] [06:46:44] 10Tool-Labs: apt fighting over /var/cache/apt/pkgcache et al - https://phabricator.wikimedia.org/T99495#1292741 (10Legoktm) [07:04:51] 10Tool-Labs: apt fighting over /var/cache/apt/pkgcache et al - https://phabricator.wikimedia.org/T99495#1292748 (10yuvipanda) Same as T92491? [07:06:35] 10Tool-Labs: Unattended upgrades are failing from time to time - https://phabricator.wikimedia.org/T92491#1292751 (10valhallasw) ``` /etc/cron.daily/apt: Traceback (most recent call last): File "/usr/bin/unattended-upgrade", line 906, in main(options) File "/usr/bin/unattended-upgrade", line 695... [07:06:44] 10Tool-Labs: Unattended upgrades are failing from time to time - https://phabricator.wikimedia.org/T92491#1292753 (10valhallasw) [07:06:45] 10Tool-Labs: apt fighting over /var/cache/apt/pkgcache et al - https://phabricator.wikimedia.org/T99495#1292752 (10valhallasw) [07:07:12] 10Tool-Labs: apt fighting over /var/cache/apt/pkgcache et al - https://phabricator.wikimedia.org/T99495#1292735 (10valhallasw) Yes, but couldn't find that one by searching for 'tool-labs apt' :/ [09:38:46] !log bots removed all backups for year 2014 in order to save some space (freed ~ 20GB) on /data/project storage [09:38:51] Logged the message, Master [10:52:50] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Jakob was modified, changed by Jakob link https://wikitech.wikimedia.org/w/index.php?diff=159627 edit summary: [11:12:17] 10Tool-Labs, 10Wikimedia-Hackathon-2015: Tool-labs meeting agenda for Lyon Hackathon - https://phabricator.wikimedia.org/T98912#1293089 (10Qgil) It is time to promote #Wikimedia-Hackathon-2015 activities in the [[ https://www.mediawiki.org/wiki/Wikimedia_Hackathon_2015/Program | program ]] (training sessions a... [11:12:31] 10Tool-Labs, 10Wikimedia-Hackathon-2015: Conduct a Tool Labs workshop at Lyon Hackathon - https://phabricator.wikimedia.org/T91058#1293106 (10Qgil) It is time to promote #Wikimedia-Hackathon-2015 activities in the [[ https://www.mediawiki.org/wiki/Wikimedia_Hackathon_2015/Program | program ]] (training session... [11:12:50] 10Tool-Labs, 10Wikimedia-Hackathon-2015, 7Documentation: Re-organize Tool Labs documentation - https://phabricator.wikimedia.org/T91509#1293125 (10Qgil) It is time to promote #Wikimedia-Hackathon-2015 activities in the [[ https://www.mediawiki.org/wiki/Wikimedia_Hackathon_2015/Program | program ]] (training... [11:13:02] 6Labs, 10Wikimedia-Hackathon-2015: iPython for Labs: call for an interactive coding plattform - https://phabricator.wikimedia.org/T92506#1293135 (10Qgil) It is time to promote #Wikimedia-Hackathon-2015 activities in the [[ https://www.mediawiki.org/wiki/Wikimedia_Hackathon_2015/Program | program ]] (training s... [12:45:03] Anyone around? I can't ssh from a new host after copying the id_rsa.pub to authorized_keys... [13:01:52] 6Labs, 10Beta-Cluster: No DNS entry for deployment-logstash1.eqiad.wmflabs ? - https://phabricator.wikimedia.org/T99521#1293401 (10mobrovac) 3NEW [13:03:35] Hi, please restart webservice https://tools.wmflabs.org/wp-world/earth.php [13:08:50] 6Labs, 10Beta-Cluster: No DNS entry for deployment-logstash1.eqiad.wmflabs ? - https://phabricator.wikimedia.org/T99521#1293422 (10mobrovac) [13:09:16] 6Labs, 10Beta-Cluster: No DNS entry for deployment-logstash1.eqiad.wmflabs ? - https://phabricator.wikimedia.org/T99521#1293401 (10mobrovac) [13:20:59] Is here anybody who is able to restart webservice https://tools.wmflabs.org/wp-world/earth.php ? [13:22:43] 6Labs, 10Beta-Cluster, 6Release-Engineering: No DNS entry for deployment-logstash1.eqiad.wmflabs ? - https://phabricator.wikimedia.org/T99521#1293441 (10Aklapper) [13:35:09] 6Labs, 10Beta-Cluster, 6Release-Engineering: No DNS entry for deployment-logstash1.eqiad.wmflabs ? - https://phabricator.wikimedia.org/T99521#1293479 (10Joe) We narrowed down the problem to be local to the ldap-backed dns, @Coren is looking into it at the moment. [13:37:46] 6Labs, 6WMF-Legal: Request to review privacy policy and rules - https://phabricator.wikimedia.org/T97844#1293485 (10d33tah) >>! In T97844#1266008, @jeremyb wrote: >>>! In T97844#1265952, @d33tah wrote: >> Yes, the last version of the website got 10k unique views within two days - I assumed that this is okay. >... [13:38:17] 6Labs, 6WMF-Legal: Discussion: can I park WikiSpy under a separate, simpler domain? - https://phabricator.wikimedia.org/T97846#1293490 (10d33tah) >>! In T97846#1265220, @coren wrote: > I agree there is nothing in the TOS that prevents it (provided Legal has no objection), but I should point out that we have no... [13:44:34] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1403 is CRITICAL 20.00% of data above the critical threshold [0.0] [13:57:54] Is here anybody who is able to restart webservice https://tools.wmflabs.org/wp-world/earth.php ? [14:06:46] 6Labs, 10Beta-Cluster, 6Release-Engineering: No DNS entry for deployment-logstash1.eqiad.wmflabs ? - https://phabricator.wikimedia.org/T99521#1293546 (10coren) The issue may actually lie within dnsmasq after all; the instance is properly listed in the list it should be serving the name of, yet gives a SRVFAI... [14:10:46] Coren: See Gerrytrieuo Comment [14:14:34] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1403 is OK Less than 1.00% above the threshold [0.0] [14:15:32] 6Labs, 6operations: Andrew needs to get paged for labs cluster icinga alerts. - https://phabricator.wikimedia.org/T99524#1293563 (10Andrew) 3NEW a:3Andrew [14:19:43] Betacommand: cf https://phabricator.wikimedia.org/T99064 I emailed Kolossos with no news [14:20:35] I'm a little hesitant to just truncate logs without his okay [14:33:52] ah [14:34:17] Coren how quickly did it fill the 1TB? [14:44:20] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1409 is CRITICAL 44.44% of data above the critical threshold [0.0] [14:47:33] PROBLEM - Puppet failure on tools-webproxy-02 is CRITICAL 60.00% of data above the critical threshold [0.0] [15:01:40] Betacommand: Small number of weeks; but it looks like basically every hit on the web service causes a pile of error messages so it's very traffic-bound. [15:02:26] Betacommand: If the tool is on the critical path of some workflow, I suppose I can just truncate the log and restart it for a while. I'd rather Kolossos take a look at it first, but in a pinch... [15:03:37] No problem, just curious [15:09:23] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1409 is OK Less than 1.00% above the threshold [0.0] [15:12:31] RECOVERY - Puppet failure on tools-webproxy-02 is OK Less than 1.00% above the threshold [0.0] [15:17:48] !log deployment-prep rebooting deployment-logstash1 [15:17:57] Logged the message, dummy [15:26:25] 6Labs, 10Beta-Cluster, 6Release-Engineering: No DNS entry for deployment-logstash1.eqiad.wmflabs ? - https://phabricator.wikimedia.org/T99521#1293748 (10Andrew) 5Open>3Resolved a:3Andrew No idea why this broke, but a reboot fixed it. [16:11:56] PROBLEM - Host tools-exec-1207 is DOWN: CRITICAL - Host Unreachable (10.68.17.113) [16:14:24] ^ works for me [16:16:56] RECOVERY - Host tools-exec-1207 is UPING OK - Packet loss = 0%, RTA = 0.69 ms [16:22:35] Is the topic status correct? [16:25:56] Coren: I'm getting timeouts on tool labs [16:26:13] Timeouts on what? Web? [16:26:24] Web [16:27:03] sounds performance related which is why I asked about the "possible performance issues due to labvirt1003 outage / NFS backup" [16:27:28] Hm. Anecdotally, I see nothing wrong. The landing page is snappy for me and so are the 3-4 tools I just randomly checked. What tool are you getting the timeouts with? [16:27:56] Coren: supercount and meta [16:28:36] I see supercount as ill too. Lemme see if I can find some obvious cause. [16:29:21] supercount worked after a few tries but meta is still timing out [16:39:33] 6Labs, 5Patch-For-Review: Get Labs openstack service dbs on a proper db server - https://phabricator.wikimedia.org/T92693#1293877 (10jcrespo) [16:41:47] Negative24: I see nothing systemic. Lemme see if I can kick the tools themselves. [16:42:43] Negative24: They seem to be feeling better. [16:43:37] Hey guys, is there any reason why particular tools cant be loaded. For example: https://tools.wmflabs.org/catscan2/catscan2.php [16:44:08] PROBLEM - Host tools-exec-1206 is DOWN: CRITICAL - Host Unreachable (10.68.17.105) [16:44:15] 6Labs, 10Beta-Cluster, 6Release-Engineering: No DNS entry for deployment-logstash1.eqiad.wmflabs ? - https://phabricator.wikimedia.org/T99521#1293890 (10coren) As a further note (so that we can recognize the bug if it recurs), dnsmasq /did/ have a valid lease for that IP and had the correct name associated w... [16:45:06] nls: Each tool has its own webserver, sometimes it's just a matter of them being overloaded or busy. [16:46:41] PROBLEM - Host tools-exec-1202 is DOWN: CRITICAL - Host Unreachable (10.68.16.57) [16:47:09] PROBLEM - Host tools-mailrelay-01 is DOWN: PING CRITICAL - Packet loss = 100% [16:47:12] PROBLEM - Host tools-redis-slave is DOWN: PING CRITICAL - Packet loss = 100% [16:47:13] Okey, just wait and watch [16:47:20] PROBLEM - Host tools-exec-1213 is DOWN: CRITICAL - Host Unreachable (10.68.17.252) [16:47:28] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL - Socket timeout after 10 seconds [16:47:38] PROBLEM - Host tools-exec-1209 is DOWN: PING CRITICAL - Packet loss = 100% [16:48:00] PROBLEM - Host tools-exec-1201 is DOWN: CRITICAL - Host Unreachable (10.68.17.49) [16:48:22] thanks :-* [16:48:36] PROBLEM - Host tools-exec-1408 is DOWN: PING CRITICAL - Packet loss = 100% [16:48:36] PROBLEM - Host tools-bastion-02 is DOWN: CRITICAL - Host Unreachable (10.68.16.44) [16:48:37] PROBLEM - Host tools-exec-1217 is DOWN: CRITICAL - Host Unreachable (10.68.18.20) [16:48:39] labs bastion not responding :/ [16:49:32] PROBLEM - Host tools-exec-1218 is DOWN: CRITICAL - Host Unreachable (10.68.18.19) [16:49:34] PROBLEM - Host tools-exec-1204 is DOWN: CRITICAL - Host Unreachable (10.68.17.88) [16:49:42] PROBLEM - Host tools-webgrid-generic-1404 is DOWN: PING CRITICAL - Packet loss = 100% [16:49:44] PROBLEM - Host tools-webgrid-lighttpd-1410 is DOWN: CRITICAL - Host Unreachable (10.68.18.44) [16:49:53] ummm [16:49:58] andrewbogott ^^ seems a labs-wide issue [16:50:01] Coren: ? [16:50:15] legoktm: I'm looking at it now. [16:50:17] [09:46:29] PROBLEM - dhclient process on labvirt1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:50:17] [09:46:49] PROBLEM - salt-minion processes on labvirt1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:50:19] PROBLEM - Host tools-static-02 is DOWN: CRITICAL - Host Unreachable (10.68.16.216) [16:50:19] PROBLEM - Host tools-webgrid-lighttpd-1409 is DOWN: PING CRITICAL - Packet loss = 100% [16:50:20] okay [16:50:39] PROBLEM - Host tools-exec-cyberbot is DOWN: CRITICAL - Host Unreachable (10.68.16.39) [16:52:56] It looks like we just lost one of the virt hosts [16:53:31] Tools lost a couple exec nodes, but should otherwise survive this. [16:57:20] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 773822 bytes in 2.324 second response time [17:04:16] Coren: waiting for labvirt1001 to post and then I’ll restart things [17:06:49] andrewbogott_afk: Oh, ah, you're the one I kicked off the console then? :-) [17:06:51] andrewbogott_afk: I asked on the other channel if anyone was on it. [17:09:47] deployment- hosts are coming back... [17:09:51] Coren: sorry, probably missed it to do associated name change [17:10:12] Coren: anyway… what the heck was that? [17:10:30] andrewbogott: I have no idea. afaict, labvirt1001 just asplode. [17:10:54] andrewbogott: I'm going to dig in logs now that it's back up. [17:11:03] I was running another scripted suspend/resume. So once again I suspect that as a cause… although it was fine until just before croaking. [17:11:27] Eventually we’ll restart all of these instances by having to reboot every single compute node :( [17:11:28] ... "fine until in broke" is rarely a useful metric. :-) [17:15:39] RECOVERY - Host tools-bastion-02 is UPING OK - Packet loss = 0%, RTA = 0.69 ms [17:16:15] RECOVERY - Host tools-exec-1202 is UPING OK - Packet loss = 0%, RTA = 0.63 ms [17:16:23] RECOVERY - Host tools-exec-1213 is UPING OK - Packet loss = 0%, RTA = 0.93 ms [17:16:31] RECOVERY - Host tools-exec-1201 is UPING OK - Packet loss = 0%, RTA = 0.71 ms [17:16:53] RECOVERY - Host tools-exec-1218 is UPING OK - Packet loss = 0%, RTA = 0.71 ms [17:17:11] RECOVERY - Host tools-exec-1209 is UPING OK - Packet loss = 0%, RTA = 3.48 ms [17:17:15] RECOVERY - Host tools-exec-1204 is UPING OK - Packet loss = 0%, RTA = 0.68 ms [17:17:15] RECOVERY - Host tools-exec-1217 is UPING OK - Packet loss = 0%, RTA = 0.75 ms [17:17:29] RECOVERY - Host tools-exec-cyberbot is UPING OK - Packet loss = 0%, RTA = 0.83 ms [17:17:39] RECOVERY - Host tools-exec-1206 is UPING OK - Packet loss = 0%, RTA = 0.74 ms [17:17:51] RECOVERY - Host tools-redis-slave is UPING OK - Packet loss = 0%, RTA = 0.83 ms [17:18:13] RECOVERY - Host tools-webgrid-lighttpd-1410 is UPING OK - Packet loss = 0%, RTA = 424.62 ms [17:18:56] andrewbogott: Inspection of the logs is not very illuminating. Nothing out of the ordinary until the kernel losses it after the VM system gets apparently corrupted (two Bad RIP values, followed by pretty much all CPUs softlocked) [17:19:09] RECOVERY - Host tools-webgrid-generic-1404 is UPING OK - Packet loss = 0%, RTA = 2.75 ms [17:19:23] RECOVERY - Host tools-mailrelay-01 is UPING OK - Packet loss = 0%, RTA = 0.65 ms [17:19:28] RECOVERY - Host tools-static-02 is UPING OK - Packet loss = 0%, RTA = 0.95 ms [17:19:46] RECOVERY - Host tools-webgrid-lighttpd-1409 is UPING OK - Packet loss = 0%, RTA = 3.03 ms [17:26:22] Coren: can you add what you’ve learned (even if it isn’t much) to https://wikitech.wikimedia.org/wiki/Incident_documentation/20150518-LabsOutage ? [17:29:18] andrewbogott: Sure. [17:29:21] andrewbogott: wow, 1001 too? [17:29:26] also hello everyone [17:29:36] yuvipanda: you in Europe now, or still in CA? [17:29:46] I'm in New York but apparently on a CA timezone... [17:30:00] yuvipanda: when do you fly next? [17:30:05] andrewbogott: thursday [17:30:08] to Lyon [17:30:10] cool [17:30:14] then we won’t all be traveling at once. [17:30:18] ah right [17:30:36] I don't have a bouncer anymore, what happened to labvirt1001? [17:30:38] I’m flying on Wednesday, will be doing touristy stuff on Thursday but will hopefully be textable. [17:31:02] yuvipanda: I believe Coren’s phrase was ‘lost its shit’ [17:31:08] ah, heh [17:31:20] so it's been evacuated? [17:31:23] or just a restart? [17:31:28] just a reboot [17:31:33] hmm sigh [17:31:37] And I upgrade the kernel just for the hell of it [17:31:40] *upgraded [17:31:58] right [17:32:10] same symptoms as 1003? [17:32:17] not really [17:32:28] on 1003 the VMs kept running just fine while the host freaked out [17:32:37] on 1001 I believe VMs were the first to go. [17:32:37] ah, here everything freaked out? [17:32:53] hmm sigh, ok [17:33:29] andrewbogott: so everything's ok now? [17:33:42] yuvipanda: back up at least [17:33:50] andrewbogott: 1001's learned diagnosis from me: "Kernel spontaneously asplode". [17:34:10] yay [17:34:46] It's admitedly neither informative nor reassuring - but absolutely everything in the logs looks perfectly ordinary until the kernel losses all of its bolts at once. [17:35:10] I can’t imagine why the suspend/resumes would trigger this. [17:35:22] But it correlates [17:35:56] I suppose we can’t blame this on the cable cut :( [17:36:16] there was a cable cut? [17:36:26] hmm, living without backscroll has interesting consequences :D [17:36:39] I'll wait for the incident report, I guess [17:37:09] yuvipanda: the cable cut isn’t WMF specific, just causing some effects us-internet-wide. [17:37:20] ah, right [17:37:21] ok [17:42:16] ok, I need lunch before the meeting [17:46:41] Need smoke before same meeting. [17:47:59] Can someone help? I can't ssh from a new host after copying the id_rsa.pub to authorized_keys [17:53:19] jem: hey! is this on labs? authorized_keys doesn't usually work - you've to use wikitech.wikimedia.org preferences to manage your keys [18:35:25] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Jakob was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=159779 edit summary: [18:36:54] 6Labs, 10hardware-requests, 6operations: New server for labs dns recursor - https://phabricator.wikimedia.org/T99133#1294310 (10Andrew) Hold off on this; Mark thinks we can do this in place on the existing server. [18:47:21] 6Labs, 10hardware-requests, 6operations: labnet1002 - https://phabricator.wikimedia.org/T98740#1294364 (10RobH) a:3RobH [19:03:54] Solved thanks to the indication of yuvipanda (I had completely forgotten) [19:26:05] boo [19:26:07] Coren: looks good from here. Thanks [19:26:08] IRC problems >_> [19:27:43] yuvipanda-wmf: do you use a bouncer? [19:28:00] I've been experimenting. I used to use a bouncer until yesterday [19:28:11] and now am experimenting to see if my productivity varies if I stop using a bouncer [19:28:15] and stop reading scrollback [19:28:20] heh [19:28:32] andrewbogott: are there any examples I can use to see how to script nova commands? [19:28:41] I opted for a bouncer so I could spy on what everyone does during the day :P [19:28:44] I want to basically get a list of instances on a project (tools) and hosts they are on... [19:29:01] Negative24: yeah, I did that, but I also tracked how much time I spent on IRC every week and it's about 20hours... [19:29:11] so I've an IRC addiction problem :) [19:29:42] :) [19:36:00] yuvipanda-wmf: you could look at ~root/novastats/whatever [19:36:09] looking [19:36:30] in theory that’s just $ nova —os-tenant-id list —host [19:36:43] But as I recall that particular combination is buggy in icehouse :) [19:38:26] heh [19:40:37] andrewbogott: so I'm going to do two things: write a script that figures out which instances should move to a different host (for redundancy), and then actually move them (cold migrate) [19:40:47] and then I'll make that into an icinga check script so we know if it's fucking up. [19:41:33] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Move toollabs instances around to minimize damage from a single downed virt* host - https://phabricator.wikimedia.org/T91072#1294654 (10yuvipanda) a:3yuvipanda [19:42:45] yuvipanda-wmf: you want a script that automatically moves instances? [19:42:59] yuvipanda-wmf: can't we just create instances on othre nodes? [19:43:47] brb [19:44:07] yuvipanda-wmf: because basically it's limited to {tools-master, tools-secondary-master}, {checker-1, checker-2} on different hosts (maybe mailrelay-1 and -2 in the future), then spread the grid load between multiple hosts, right? [19:44:55] (if you'd rather migrate, that's fine with me, but it sounds a bit like overkill to me) [19:51:57] yuvipanda-wmf: or do you want to auto-migrate nodes so they are spread out to virt*s with the least lode? [19:52:00] load* [20:08:13] yuvipanda-wmf: Thanks for your indication, that was it, I had completely forgotten about that [20:26:43] 10Tool-Labs, 7Tracking: Missing Toolserver features in Tools (tracking) - https://phabricator.wikimedia.org/T60791#1294806 (10Yurik) [20:43:23] yuvipanda-wmf: help, shinken. [20:43:30] slash SGE [20:43:33] * valhallasw prods shinken-wm [20:43:42] oh no, those were earlier [20:46:17] 10Tool-Labs: SGE job failed for wp-world webservice: can't open output file "/data/project/wp-world/error.log" - https://phabricator.wikimedia.org/T99576#1294865 (10valhallasw) 3NEW [20:46:49] 10Tool-Labs: SGE job failed for wp-world webservice: can't open output file "/data/project/wp-world/error.log" - https://phabricator.wikimedia.org/T99576#1294872 (10valhallasw) 5Open>3Resolved a:3valhallasw /data/project/wp-world/error.log was owned by kolossos:tools.wp-world. Chowned to tools.wp-world:too... [20:48:06] 10Tool-Labs: SGE job failed for wp-world webservice: can't open output file "/data/project/wp-world/error.log" - https://phabricator.wikimedia.org/T99576#1294875 (10valhallasw) 5Resolved>3Open Reopening, actually, because we should provide a clearer error when this happens. @Yuvipanda, could you add a check... [20:56:13] Coren: I suspec that NFS is… stopped. [20:58:12] hm, back. Traffic spike [21:10:09] valhallasw: ah, no - I mostly wanted to see if I *could* live migrate :) [21:10:16] err [21:10:17] :D [21:10:17] migrate at all [21:10:27] valhallasw: but the core thing would be the script itself. [21:10:43] andrewbogott: is NFS dead? [21:10:55] yeah, checking for virt*-spreading = woot woot! [21:11:09] although I'm wondering whether we should have extra exec nodes stand by on those hosts then [21:11:13] yuvipanda-wmf: it’s been hesitating off and on due to traffic spikes [21:11:34] seems ok now [21:12:25] andrewbogott: ok [21:25:37] yuvipanda|maybe: https://gerrit.wikimedia.org/r/#/c/211519/ can haz review? [21:26:04] valhallasw: do you know where it comes from? A package? [21:26:15] yuvipanda|maybe: that file? yeah, apt installs it [21:26:42] I also realized maybe I should do it differently [21:27:31] something like /usr/sbin/logrotate -v /etc/logrotate.conf > /var/log/logrotate.log 2>&1 || cat /var/log/logrotate.log >&2 [21:28:14] but this should work already, I think [21:28:53] we'll see when the first logrotate mail enters our mailbox :-p [21:28:55] valhallasw: hmm, so two things - if it's sourced off a package, can you note the package version it came from? And also why we are overriding it? [21:29:01] valhallasw: this is so we avoid stuff like: https://phabricator.wikimedia.org/T85910 [21:29:24] valhallasw: and actually, if it comes from a package, I'd rather have us apt-source it and rebuild it with a patch if all we want to do is add logging but this is ok too if it's a patch [21:29:36] it's temporarily [21:29:39] but adding an .orig is a good idea [21:29:50] yeah. I think just noting the version number is good enough [21:30:02] pam config paravoi.d had to manually check a few before he found one [21:30:37] I'm not sure what good a version number is; we should just replace with the most recent one from apt I think? [21:32:09] valhallasw: you based that off an existing logrotate cron, right? [21:32:19] yeah, it's the default logrotate cron file [21:32:24] on precise or trusty? [21:32:28] apart from the > /... [21:32:29] trusty [21:32:35] :/ [21:32:35] right, so then it's Version: 3.8.7-1ubuntu1 [21:32:43] just put that in a comment and we're good I think [21:32:56] let me check what the version on precise is [21:33:00] ok [21:33:03] and whether the config is compatible :P [21:33:07] right :) [21:33:16] yeah, that might be another thing - this might break on precise :) [21:33:31] (probably will not) [21:33:42] exact same file :-p [21:34:00] heh :D still, put in a version number? [21:34:20] also /usr/sbin/logrotate -v /etc/logrotate.conf > /var/log/logrotate.log [21:34:23] not >>? [21:34:38] no, that's on purpose [21:34:43] so we don't need to logrotate the logrotate log ;-) [21:34:48] haha :P [21:34:48] fine [21:34:52] it just gets overwritten nightly [21:34:55] which is fine, I think [21:34:57] you should probably comment that too [21:35:04] yeah [21:35:10] I'm not sure what to do with stderr [21:35:17] if I merge them, cron won't mail anymore, I think [21:36:33] valhallasw: hmm, right. [21:36:40] yuvipanda|maybe: let's try this first [21:36:43] eventually-logstash, but that's a month away at least [21:36:48] yeah [21:36:58] I found out that our current logstash setup uses trebuchet [21:37:04] so my enthusiasm is a bit dampened :) [21:37:11] oh, I actually added the >2 stuff here. [21:37:12] even better :p [21:37:12] hehehehehe [21:37:39] bd808: I don't think I can even set up a standalone trebuchet deployment server, can I? [21:37:47] that might have been partly my fault, but only partly... [21:37:48] nope :/ [21:38:16] bd808: why not just use git clone + a deploy branch? [21:38:16] I've been meaning to refactor that since I figured out how tangled it is with the scap role [21:38:19] yuvipanda|maybe: i'm aligning my arrows as any sany pythonista would! [21:38:43] valhallasw: yeah, we can start a revolution and not align arrows like crazy people but I don't know if that battle has already been lost [21:38:44] yuvipanda|maybe: because why have yet another f'ing way to deploy? [21:38:54] yuvipanda|maybe: we can also ditch puppet and start a revolution! [21:38:57] bd808: I don't know, we are installing things, not really deploying [21:39:01] no more DAGs! [21:39:13] bd808: I mean, if Kibana was a debian package, would we be using trebuchet? [21:39:18] s/was/had/ [21:39:19] nope [21:39:22] same for the ES plugins [21:39:28] so maybe I can just make a Kibana debian package :D [21:39:35] heh [21:39:42] don't look at kibana 4 [21:39:49] it will make your eyes bleed [21:39:51] isn't grafana 2 based on that? [21:39:59] it has a server side component now, I think... [21:40:09] yeah for no good reason "yet" [21:40:17] right [21:40:23] but I can understand why they would want to [21:40:43] 'omg pure client side applications!!!1' sound a lot like 'what if everyone in the world spoke english!!!' [21:41:07] all it does is act as a proxy though [21:41:19] yeah, for now. [21:41:25] assuming it doesn't stay that way forever... [21:41:36] kibana has been rewritten from scratch 4 times. :/ [21:41:57] I contributed to the initial PHP based version [21:42:07] * valhallasw reconsiders kibana [21:42:19] everytime they switch to adding ruby to the stack I run away [21:42:39] php -> ruby -> nothing -> ruby [21:42:41] I also like how the logstash page looks super web 2.0, but then it doesn't actually have a great web interface [21:43:44] oh, and now that they are part of elastic, you have to 'register to watch' their 0-60 in 60 [21:43:48] We are running the kibana3 branch still. I haven't had the stomach to figure out how to make kibana4 deployable for WMF [21:44:18] heh [21:44:24] oh wow, such webscale [21:44:42] oh yeah, and that's not 0 to 60 in 60 seconds, as one would guess, but in 60 minutes [21:44:49] I wouldn't buy a car that does 0 to 60 in 60 minutes [21:45:37] heh [21:51:35] PROBLEM - Puppet failure on tools-bastion-01 is CRITICAL 20.00% of data above the critical threshold [0.0] [21:51:35] PROBLEM - Puppet failure on tools-exec-1410 is CRITICAL 20.00% of data above the critical threshold [0.0] [21:51:36] PROBLEM - Puppet failure on tools-exec-1409 is CRITICAL 22.22% of data above the critical threshold [0.0] [21:51:59] I think I broke something [21:52:07] PROBLEM - Puppet failure on tools-exec-1210 is CRITICAL 44.44% of data above the critical threshold [0.0] [21:52:37] yuvipanda|maybe: can you revert? [21:52:37] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Invalid parameter requires at /etc/puppet/modules/toollabs/manifests/init.pp:217 on node i-00000bd6.eqiad.wmflabs [21:52:39] >_< [21:52:43] yuvipanda: can you revert? [21:52:43] PROBLEM - Puppet failure on tools-exec-1202 is CRITICAL 40.00% of data above the critical threshold [0.0] [21:52:44] PROBLEM - Puppet failure on tools-bastion-02 is CRITICAL 30.00% of data above the critical threshold [0.0] [21:52:46] ok [21:52:49] PROBLEM - Puppet failure on tools-exec-1208 is CRITICAL 30.00% of data above the critical threshold [0.0] [21:53:12] yuvipanda|maybe: https://gerrit.wikimedia.org/r/#/c/211890/ [21:53:49] because it's require and not requires [21:53:53] valhallasw: gaaah [21:53:53] * valhallasw groans [21:54:44] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1407 is CRITICAL 40.00% of data above the critical threshold [0.0] [21:54:45] PROBLEM - Puppet failure on tools-trusty is CRITICAL 50.00% of data above the critical threshold [0.0] [21:54:46] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1205 is CRITICAL 50.00% of data above the critical threshold [0.0] [21:55:09] PROBLEM - Puppet failure on tools-exec-1408 is CRITICAL 66.67% of data above the critical threshold [0.0] [21:55:35] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1402 is CRITICAL 60.00% of data above the critical threshold [0.0] [21:55:51] PROBLEM - Puppet failure on tools-mail is CRITICAL 60.00% of data above the critical threshold [0.0] [21:55:59] and now shinken is going to stalk us for the next half-hour or so? :/ [21:56:49] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1203 is CRITICAL 50.00% of data above the critical threshold [0.0] [21:56:58] PROBLEM - Puppet failure on tools-checker-01 is CRITICAL 50.00% of data above the critical threshold [0.0] [21:57:06] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1404 is CRITICAL 44.44% of data above the critical threshold [0.0] [21:57:30] PROBLEM - Puppet failure on tools-static-01 is CRITICAL 60.00% of data above the critical threshold [0.0] [21:57:30] PROBLEM - Puppet failure on tools-exec-1215 is CRITICAL 60.00% of data above the critical threshold [0.0] [21:57:46] PROBLEM - Puppet failure on tools-exec-1207 is CRITICAL 30.00% of data above the critical threshold [0.0] [21:57:54] PROBLEM - Puppet failure on tools-webgrid-generic-1401 is CRITICAL 60.00% of data above the critical threshold [0.0] [21:57:58] PROBLEM - Puppet failure on tools-exec-1209 is CRITICAL 50.00% of data above the critical threshold [0.0] [21:58:16] PROBLEM - Puppet failure on tools-webgrid-generic-1404 is CRITICAL 55.56% of data above the critical threshold [0.0] [22:00:13] also, why didn't puppetlint figure this out [22:00:24] probably because a parameter /can/ be called requires or something [22:00:26] PROBLEM - Puppet failure on tools-exec-1408 is CRITICAL 66.67% of data above the critical threshold [0.0] [22:01:58] PROBLEM - Puppet failure on tools-exec-catscan is CRITICAL 60.00% of data above the critical threshold [0.0] [22:04:16] yuvipanda|maybe: let's try again tomorrow, also maybe earlier so I don't have to go to bed :P [22:11:38] yuvipanda|maybe: If i make a web proxy for a tool is it by default at .tools.wmflabs.org or just .wmflabs.org? [22:13:14] RECOVERY - Puppet failure on tools-webgrid-generic-1404 is OK Less than 1.00% above the threshold [0.0] [22:20:08] RECOVERY - Puppet failure on tools-exec-1408 is OK Less than 1.00% above the threshold [0.0] [22:20:36] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1402 is OK Less than 1.00% above the threshold [0.0] [22:20:52] RECOVERY - Puppet failure on tools-mail is OK Less than 1.00% above the threshold [0.0] [22:21:34] RECOVERY - Puppet failure on tools-bastion-01 is OK Less than 1.00% above the threshold [0.0] [22:21:36] RECOVERY - Puppet failure on tools-exec-1410 is OK Less than 1.00% above the threshold [0.0] [22:21:37] RECOVERY - Puppet failure on tools-exec-1409 is OK Less than 1.00% above the threshold [0.0] [22:21:56] RECOVERY - Puppet failure on tools-checker-01 is OK Less than 1.00% above the threshold [0.0] [22:22:08] RECOVERY - Puppet failure on tools-exec-1210 is OK Less than 1.00% above the threshold [0.0] [22:22:28] RECOVERY - Puppet failure on tools-static-01 is OK Less than 1.00% above the threshold [0.0] [22:22:30] RECOVERY - Puppet failure on tools-exec-1215 is OK Less than 1.00% above the threshold [0.0] [22:22:43] RECOVERY - Puppet failure on tools-exec-1202 is OK Less than 1.00% above the threshold [0.0] [22:22:44] RECOVERY - Puppet failure on tools-bastion-02 is OK Less than 1.00% above the threshold [0.0] [22:22:50] RECOVERY - Puppet failure on tools-exec-1208 is OK Less than 1.00% above the threshold [0.0] [22:22:54] RECOVERY - Puppet failure on tools-webgrid-generic-1401 is OK Less than 1.00% above the threshold [0.0] [22:24:40] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1407 is OK Less than 1.00% above the threshold [0.0] [22:24:44] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1205 is OK Less than 1.00% above the threshold [0.0] [22:24:48] RECOVERY - Puppet failure on tools-trusty is OK Less than 1.00% above the threshold [0.0] [22:26:52] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1203 is OK Less than 1.00% above the threshold [0.0] [22:26:58] RECOVERY - Puppet failure on tools-exec-catscan is OK Less than 1.00% above the threshold [0.0] [22:27:04] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1404 is OK Less than 1.00% above the threshold [0.0] [22:27:45] RECOVERY - Puppet failure on tools-exec-1207 is OK Less than 1.00% above the threshold [0.0] [22:28:01] RECOVERY - Puppet failure on tools-exec-1209 is OK Less than 1.00% above the threshold [0.0] [22:30:59] valhallasw: sorry, network issues [22:31:15] andrewbogott: it's tools.wmflabs.org/ [22:31:26] yuvipanda|maybe: oh, that’s easy then! [22:32:22] andrewbogott: what's it for? [22:32:34] yuvipanda|maybe: um… split horizon [22:32:42] another different attempt [22:33:02] andrewbogott: oh, hmm. [22:33:28] andrewbogott: the recursor wasn't too bad [22:33:41] yeah, requires a whole other server and component though... [22:33:56] true [22:34:02] using the pipe backend will work, although it will require proper name/ip pairs rather than just public/private ip pairs [22:36:30] andrewbogott: we can map that from ldap, I guess? [22:36:34] what's the 'pipe backend'? [22:36:50] yuvipanda|maybe: hopefully it won’t be in ldap, ultimately [22:36:59] ah right [22:37:31] PROBLEM - Puppet failure on tools-bastion-01 is CRITICAL 50.00% of data above the critical threshold [0.0] [22:37:57] PROBLEM - Puppet failure on tools-exec-catscan is CRITICAL 20.00% of data above the critical threshold [0.0] [22:38:19] PROBLEM - Puppet failure on tools-shadow is CRITICAL 20.00% of data above the critical threshold [0.0] [22:38:47] PROBLEM - Puppet failure on tools-exec-1207 is CRITICAL 40.00% of data above the critical threshold [0.0] [22:38:59] PROBLEM - Puppet failure on tools-exec-1209 is CRITICAL 50.00% of data above the critical threshold [0.0] [22:39:15] PROBLEM - Puppet failure on tools-exec-1204 is CRITICAL 40.00% of data above the critical threshold [0.0] [22:41:11] PROBLEM - Puppet failure on tools-checker-02 is CRITICAL 55.56% of data above the critical threshold [0.0] [22:41:19] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1401 is CRITICAL 44.44% of data above the critical threshold [0.0] [22:41:23] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1409 is CRITICAL 30.00% of data above the critical threshold [0.0] [22:41:24] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1210 is CRITICAL 44.44% of data above the critical threshold [0.0] [22:43:01] PROBLEM - Puppet failure on tools-services-01 is CRITICAL 40.00% of data above the critical threshold [0.0] [22:43:11] PROBLEM - Puppet failure on tools-submit is CRITICAL 44.44% of data above the critical threshold [0.0] [22:43:22] PROBLEM - Puppet failure on tools-exec-cyberbot is CRITICAL 44.44% of data above the critical threshold [0.0] [22:44:00] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1204 is CRITICAL 50.00% of data above the critical threshold [0.0] [22:44:04] PROBLEM - Puppet failure on tools-exec-1217 is CRITICAL 55.56% of data above the critical threshold [0.0] [22:44:22] PROBLEM - Puppet failure on tools-exec-1205 is CRITICAL 66.67% of data above the critical threshold [0.0] [22:44:40] PROBLEM - Puppet failure on tools-exec-1402 is CRITICAL 60.00% of data above the critical threshold [0.0] [22:52:30] RECOVERY - Puppet failure on tools-bastion-01 is OK Less than 1.00% above the threshold [0.0] [23:06:13] RECOVERY - Puppet failure on tools-checker-02 is OK Less than 1.00% above the threshold [0.0] [23:07:57] RECOVERY - Puppet failure on tools-exec-catscan is OK Less than 1.00% above the threshold [0.0] [23:08:21] RECOVERY - Puppet failure on tools-shadow is OK Less than 1.00% above the threshold [0.0] [23:08:47] RECOVERY - Puppet failure on tools-exec-1207 is OK Less than 1.00% above the threshold [0.0] [23:08:58] RECOVERY - Puppet failure on tools-exec-1209 is OK Less than 1.00% above the threshold [0.0] [23:09:05] RECOVERY - Puppet failure on tools-exec-1217 is OK Less than 1.00% above the threshold [0.0] [23:09:12] RECOVERY - Puppet failure on tools-exec-1204 is OK Less than 1.00% above the threshold [0.0] [23:09:20] RECOVERY - Puppet failure on tools-exec-1205 is OK Less than 1.00% above the threshold [0.0] [23:09:40] RECOVERY - Puppet failure on tools-exec-1402 is OK Less than 1.00% above the threshold [0.0] [23:11:20] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1210 is OK Less than 1.00% above the threshold [0.0] [23:11:23] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1409 is OK Less than 1.00% above the threshold [0.0] [23:11:23] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1401 is OK Less than 1.00% above the threshold [0.0] [23:13:00] RECOVERY - Puppet failure on tools-services-01 is OK Less than 1.00% above the threshold [0.0] [23:13:10] RECOVERY - Puppet failure on tools-submit is OK Less than 1.00% above the threshold [0.0] [23:13:23] RECOVERY - Puppet failure on tools-exec-cyberbot is OK Less than 1.00% above the threshold [0.0] [23:13:59] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1204 is OK Less than 1.00% above the threshold [0.0]