[02:51:29] WHATS UP [02:52:37] THE CHICKENS ARE COMING HOME TO ROOST [02:54:02] ALL U HAD TO DO WAS LEAVE ME THE FUCK ALONE I TOLD U I KNEW U WERE THERE [02:54:25] ... what. [02:54:51] WHOEVER THE FUCK U ARE U COULD HAVE KEPT FOLLOWING NOW U GUYS ARE JUST PLAIN OLE DISRESPECTFUL [02:54:55] THAT IS ALL [02:56:36] That was odd. [02:58:25] hah, yea [05:16:17] (03PS1) 10Tim Landscheidt: Import robots.txt [labs/toollabs] - 10https://gerrit.wikimedia.org/r/192761 [05:16:44] (03CR) 10Ricordisamoa: "Are we waiting for Gerrit and .gitreview being phased out?" [labs/tools/wikicaptcha] - 10https://gerrit.wikimedia.org/r/180741 (owner: 10Ricordisamoa) [05:20:55] (03CR) 10Legoktm: [C: 031] "Sorry, this got lost in my review queue. I don't have +2 in this repository :(" [labs/tools/wikicaptcha] - 10https://gerrit.wikimedia.org/r/180741 (owner: 10Ricordisamoa) [06:01:53] Guess I need some SQL help here :/ I have http://quarry.wmflabs.org/query/2236 but it's really slow, due to "distinct" I think. I'm trying to get a list of users who have used AFCH in February. Any idea on how to optimize it? [06:02:28] It gets killed when I set a larger limit. [06:30:20] PROBLEM - Puppet failure on tools-webgrid-06 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [06:37:59] (03PS1) 10Tim Landscheidt: Import htmlpurifier 4.5.0 [labs/toollabs] - 10https://gerrit.wikimedia.org/r/192765 [06:58:14] 6Labs, 10Tool-Labs: Define expected service level agreement for tools - https://phabricator.wikimedia.org/T90535#1064911 (10yuvipanda) Tools infrastructure. I'd say NFS, GridEngine, LabsDB, ToolsDB, Redis, Proxy. [07:00:24] RECOVERY - Puppet failure on tools-webgrid-06 is OK: OK: Less than 1.00% above the threshold [0.0] [07:03:58] 6Labs, 10Tool-Labs, 7Tracking: Make sure that toollabs can function fully even with one virt* host fully down - https://phabricator.wikimedia.org/T90542#1064920 (10yuvipanda) @scfc: We don't need full on redundancy like how we have for prod, but just enough that tools will limp along rather than just die. We... [07:52:14] PROBLEM - Free space - all mounts on tools-redis is CRITICAL: CRITICAL: tools.tools-redis.diskspace.root.byte_percentfree.value (<25.00%) [07:57:32] umm, YuviPanda|zzz ^ ? looks like we're running out of space again... [08:00:53] (03CR) 10Ricordisamoa: "You are not to blame :)" [labs/tools/wikicaptcha] - 10https://gerrit.wikimedia.org/r/180741 (owner: 10Ricordisamoa) [08:11:59] (03CR) 10Ricordisamoa: "I could merge this myself, but then I'd feel bad about it." [labs/tools/wikicaptcha] - 10https://gerrit.wikimedia.org/r/180741 (owner: 10Ricordisamoa) [08:13:52] legoktm: on vacation today as well. Sorry [08:14:05] YuviPanda|zzz: enjoy :) [08:33:14] is there a tgr|away in the house? [08:33:16] I guess not [09:42:14] werdna: you can leave a message after the beep [10:14:49] does "labs-redis" bot sound familiar to anybody? [12:08:41] hi guys [12:08:47] just a quick question... [12:09:46] i was wondering: can I send tasks and execute code from a different terminal than my laptop (like my cellphone) or is it associated with the private key and there is no way to pass it there? [12:22:57] RECOVERY - Puppet staleness on tools-exec-15 is OK: OK: Less than 1.00% above the threshold [3600.0] [12:31:24] tgr|away: ori suggested I ask you about the best way to add a vagrant role that executes Bower. [12:59:57] Is something broken with Tool Labs redis? [13:00:49] Coren: ^ [13:17:06] anomie: Sorry, no idea: I just sat down with my morning coffee. What symptoms do you see? [13:18:16] Coren: AnomieBOT is stalled out with errors about redis, the grrrit bot is missing from IRC, and telnet tools-redis 6379 (which I think should work) from tools-login says unable to connect. [13:18:51] Hm. [13:19:01] * Coren looks into it. [13:21:59] Aha. The local filesystem is OOM. I expect redis doesn't like that. [13:23:41] yeah, sounds familiar [13:24:02] Ew. The instance is genuinely filled to capacity it seems. [13:25:14] Ah; the lvm space was never allocated; I can give that space to redis. [13:29:48] anomie: redis now has an extra 100G of space. Try it? [13:30:12] Coren: The telnet command works, although PING says it's loading [13:30:50] anomie: Its database is some 15G atm so it may take a while to eat its index and be ready. [13:31:04] PING now returns PONG [13:32:37] And AnomieBOT is editing again. Looks good, Coren1 [13:34:15] Coren: I think the IRC bots need a kick, though. [13:34:56] Do you know which bits need the kick? [13:35:13] No idea, sorry. All I know is they're not even on IRC. [13:35:59] Ah, whoever maintains them was nice enough to leave a readme. :-) [13:41:43] anomie: I think that's all of 'em [13:42:14] RECOVERY - Free space - all mounts on tools-redis is OK: OK: All targets OK [13:43:12] Coren: I don't see that grrrit-wm or wikibugs are online yet, although I don't know if they need start-up time before joining. [13:43:40] Ah, grrrit is a different beast not part of morebots. I'll have to look into them separately. [13:45:03] Hm. No happy fun documentation for them. [13:45:07] YuviPanda|zzz, ^demon|away, marktraceur, legoktm, aude, Krenair, and James_F|Away have access for grrrit-wm, according to https://wikitech.wikimedia.org/wiki/Grrrit-wm [13:45:30] Yeah, most of 'em are off-timezone atm though. [13:45:30] hm [13:45:51] we had a page about kicking grrrit-wm somewhere [13:45:53] I'm here, it needs a kick? [13:46:04] legoktm and YuviPanda|zzz are the only ones here and listed at https://www.mediawiki.org/wiki/wikibugs [13:46:35] marktraceur: Tool Labs redis had a full filesystem, and the bot is now MIA [13:46:42] Your job 8426739 ("lolrrit-wm") has been submitted [13:47:06] there it is [13:47:39] now the question is whether the steam is running [13:47:44] stream* [13:47:58] oh, yep, it is (-dev) [13:48:40] anomie: I have wikibugs a blind reschedule; that may suffice. [13:49:04] anomie: They are queued for restart now. [13:50:25] anomie: They've restarted - but I have no idea how to make sure they are healthy. [13:51:03] Coren: Thanks. I see them both talking in #wikimedia-dev, so I'd guess they're healthy. [13:53:07] That sounds like a reasonable assumption. :-) [14:00:37] Just accidentally deleted a file on a tool. Is there anyway to restore it somehow? Maybe on a weekly backup or so? [14:01:37] * anomie writes a README for AnomieBOT, just in case [14:02:11] ebraminio: Not at this time. :-( Yesterday's maintenance was a necessary step towards restoring that functionality, but it's not yet available. :-( [14:02:34] Coren: No problem, thank you [14:04:33] * ebraminio wonders how bash completion tricked me bad for that accidentally delete :( [14:05:19] Speaking of... [14:05:26] * Coren goes back to configuring that new hardware. [14:57:16] <^demon|away> anomie: I have access but I don't even know how to login to it to do anything about it :p [17:12:28] (03CR) 10Ricordisamoa: "What about using an opt-in mechanism instead of opt-out? Would that work?" [labs/tools/faces] - 10https://gerrit.wikimedia.org/r/192096 (owner: 10Ricordisamoa) [17:32:44] (03CR) 10Merlijn van Deen: "GPL is not very useful for web apps; either use the AGPL if you care about copyleft, or something freeer (e.g. Apache) if you don't." [labs/tools/faces] - 10https://gerrit.wikimedia.org/r/192096 (owner: 10Ricordisamoa) [17:50:23] (03CR) 10Ricordisamoa: "I will surely use flask-mwoauth, but the OAuth client isn't approved yet, so I don't know how to test it." [labs/tools/faces] - 10https://gerrit.wikimedia.org/r/192096 (owner: 10Ricordisamoa) [17:56:01] (03CR) 10Merlijn van Deen: "Even if the client is not yet approved, the user /who requested the client/ can log in already, so you should be able to test." [labs/tools/faces] - 10https://gerrit.wikimedia.org/r/192096 (owner: 10Ricordisamoa) [18:58:57] werdna: not sure why, him, bd808 and marxarelli are the vagrant experts, I'm just fumbling around [18:59:08] that said, happy to help if I can [18:59:45] what do you mean by execute? install so the user can execute it from the shell / automatically execute on provision / something else? [19:05:09] (03CR) 10Yuvipanda: [C: 032] Import robots.txt [labs/toollabs] - 10https://gerrit.wikimedia.org/r/192761 (owner: 10Tim Landscheidt) [20:25:53] 10Tool-Labs: Block xovibot user-agent - https://phabricator.wikimedia.org/T90636#1067315 (10scfc) p:5Triage>3Normal a:3scfc [20:26:45] Any labs admins around? andrewbogott? [20:26:47] Anyone about who can give deployment-bastion a proper kick to restart it? [20:27:14] Krenair: It's my timezone more than Andrew's or Yuvi. :-) What's up? [20:27:14] Krenair: I’m here. [20:27:20] Reedy: by ‘a proper kick’ what do you mean? [20:27:36] andrewbogott: I used restart via web interface, and it's done nothing [20:27:44] Reedy: ok, lemme try [20:27:46] Reedy is too fast :) [20:27:55] Needs a hard restart type of thing [20:28:14] Pro tip: keystone tokens still suck - I've had restarts fail until I logged back into wikitech [20:28:24] heh [20:28:33] I'd only just logged into wikitech (was logged out already) [20:28:36] Especially if I've been logged in for many day [20:28:46] deployment-bastion shows as rebooting — how long have you waited? [20:29:07] 40 minutes? [20:29:16] the console output is still showing the same thing [20:29:27] Feb 25 19:48:18 * Reedy reboots it [20:29:35] it's 20:29 now [20:30:07] Coren, wait, instance restart is tied to the restarter's session on wikitech? [20:30:26] Any action should be [20:30:29] Because it's gotta auth you [20:30:34] "can you do this" [20:30:44] Krenair: Andrew can confirm, but it's your keystone creds that allow you to act on openstack [20:30:51] It does show as restarting [20:31:00] okay, so if auth fails it just silently does nothing? [20:31:13] Krenair: That's what I've seen in the past [20:31:26] that sucks.. [20:31:31] Does it show as it's restarting? [20:31:49] I think that if Reedy can see the instance in the list to reboot it, his token is surely working [20:32:03] "you don't have credentials, so i'm gonna make it look as if i'm rebooting but then just stop" sounds very odd [20:32:31] "SHUTOFF (powering-on)" is the status now [20:32:37] I think when it happened to me, the instance page had been there open for many hours; the token expired in the interval [20:34:11] Krenair: yeah, that’s my doing. It doesn’t look like it’s coming back though [20:34:35] you ran some command on the host to try to force it to stop? [20:34:58] most vm environments have control stuff that acts from the outside [20:35:06] simulating power button presses essentially [20:35:22] Krenair: yes [20:38:52] ok, you didn't just press the restart button again from wikitech then :) [20:38:58] so now it's just "ERROR" [20:41:06] http://i1.ytimg.com/vi/dXbp6y_7eac/hqdefault.jpg [20:45:14] Reedy, Krenair: how’s that, better? [20:45:34] andrewbogott: it's up and ssh-able at least ;) [20:45:35] Thanks [20:45:43] ok [20:45:48] andrewbogott: What did you inflict on it? [20:46:05] Reedy, Krenair, you were trying all the right things. I think a service crashed on virt1002 [20:46:30] Coren: the nova-compute and qemu logs were very quiet, so I restarted nova-compute on virt1002 and everything perked up. [20:46:53] andrewbogott: Hm. I wish we could actually /monitor/ that [20:46:57] (I also tried a long series of “nova reset-state; nova stop; nova start” but that didn’t do anything until I restarted the service.) [20:47:27] It should be possible, in theory -compute calls home with periodic status updates. [20:50:30] 6Labs: Buy at least one more virt server for eqiad - https://phabricator.wikimedia.org/T90783#1067394 (10Andrew) 3NEW a:3RobH [20:50:54] * Reedy removes some old kernels from deployment-bastion [20:51:32] 6Labs, 7Monitoring: Monitor nova services - https://phabricator.wikimedia.org/T90784#1067402 (10Andrew) 3NEW a:3Andrew [20:52:04] Reedy: if there are old kernels does that mean you’ve been doing kernel upgrades on that box? I’m surprised that that works! [20:52:25] Well, it's on 3.2.0-59... It's got upto 3.2.0-77 installed [20:52:49] hm [20:52:52] it complains /boot/grub/menu.lst is modified [20:53:09] That’s probably on purpose, we hack the grub menu to make boot faster [20:53:28] yeah [20:53:41] I didn't think it was the best idea to just overwrite it [20:54:34] maybe the auto security updates feature updates the kernel [20:54:50] or attempts to :) [20:58:03] │ -kernel /boot/vmlinuz-3.2.0-59-virtual root=UUID=3cbf0c17-65ab-4d22-a2f1-69d1697cf8d0 ro quiet splash console=ttyS0 [20:58:03] │ +kernel /boot/vmlinuz-3.2.0-77-virtual root=UUID=3cbf0c17-65ab-4d22-a2f1-69d1697cf8d0 ro elevator=deadline splash [20:58:16] oh, because the unused kernels are newer than the one it’s actually running? [20:58:43] yup [20:59:03] and there was a load of them [20:59:55] I wonder if it'd be worth duplicating the entries, changing for 77 and rebooting... [20:59:58] then remove the old ones [21:00:26] Better yet just build a new instance and throw the old one away [21:00:34] or do nothing :) [21:00:52] heh [21:01:02] building a new one probably isn't worth the effort :) [21:01:32] If it’s effort then your design is broken :p [21:03:10] Stuff not documented etc [21:03:10] 6Labs, 10Wikimedia-Labs-wikitech-interface, 6operations: 404 on http://wmflabs.org (wikitech migration) - https://phabricator.wikimedia.org/T90787#1067457 (10Dzahn) [21:03:17] It's pretty entangled [21:03:43] I think it's on the todo list for people for rebuild it all from scratch etc [21:09:13] The /bastion/ is entangled? Or are you speaking of beta in general? [21:14:34] 10Tool-Labs: Job labs-toollabs-debian-glue fails - https://phabricator.wikimedia.org/T90790#1067527 (10scfc) 3NEW a:3scfc [21:42:49] 6Labs, 10Wikimedia-Labs-wikitech-interface, 6operations: 404 on http://wmflabs.org (wikitech migration) - https://phabricator.wikimedia.org/T90787#1067708 (10Andrew) thanks for diagnosing this! I've updated the dns rec (it was in LDAP) so this should sort itself out soon. [21:50:26] 6Labs, 10Wikimedia-Labs-wikitech-interface, 6operations: 404 on http://wmflabs.org (wikitech migration) - https://phabricator.wikimedia.org/T90787#1067732 (10Andrew) 5Open>3Resolved a:3Andrew [22:04:05] hi, the webservice for erwin is giving 504 errors again, I can of course restart it myself, but is that something bigbrother can monitor? [22:04:55] akoopal: mind if i just copy/paste that line in a bug? that is probably most effective [22:05:10] mutante: not at all [22:05:42] so I assume it is not possible at the moment [22:06:18] what's erwin? i'm not familiar with it [22:06:38] https://tools.wmflabs.org/erwin85/ [22:07:06] tools that a user 'erwin' wrote, and where migrated, some are quite popular [22:07:09] akoopal: i think it is, but it needs somebody to work on it and many things are in the queue [22:07:41] ok, so not a simple line in .bigbrotherrc [22:08:03] i don't know, but i'm making the ticket to find out [22:09:03] thanks [22:09:34] i see your project alreayd has a phabricator project, i'll use that [22:10:31] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Erwin's-tools, 7Monitoring: monitor webservice / 504 errors for erwin - https://phabricator.wikimedia.org/T90800#1067807 (10Dzahn) 3NEW [22:10:41] will the toolserver admins see that? [22:10:57] toolserver is dead [22:11:12] he means tool labs i'm sure [22:11:17] :) [22:11:17] and yea, i added the tool labs project [22:11:29] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Erwin's-tools, 7Monitoring: monitor webservice / 504 errors for erwin - https://phabricator.wikimedia.org/T90800#1067815 (10Dzahn) [22:11:33] ah :-) [22:11:34] akoopal: https://phabricator.wikimedia.org/T90800 [22:11:39] heh, yea [22:11:54] saw it, thanks to wikibugs :-) [22:29:06] RECOVERY - Host tools-webproxy-jessie is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [22:33:24] 6Labs, 10Tool-Labs, 7Tracking: Make toollabs reliable enough (Tracking) - https://phabricator.wikimedia.org/T90534#1067877 (10Pine) Just some comments about why this important: * Tool Labs is supposed to be more stable than Beta Labs. * Tool Labs is supposed to be stable enough that bot operators can consi... [22:36:04] PROBLEM - Host tools-webproxy-jessie is DOWN: CRITICAL - Host Unreachable (10.68.17.147) [22:39:17] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Erwin's-tools, 7Monitoring: monitor webservice / 504 errors for erwin - https://phabricator.wikimedia.org/T90800#1067913 (10Dzahn) akoopal asked if this is something that can be monitored by bigbrother. can it? is this something that is part of T90569 then? [22:39:58] 6Labs, 10Tool-Labs: Provide webservice bigbrotherrc for actively used tools - https://phabricator.wikimedia.org/T90569#1062122 (10Dzahn) would T90800 be part of this or does that need shinken monitoring or other? [22:42:38] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Erwin's-tools, 7Monitoring: monitor webservice / 504 errors for erwin - https://phabricator.wikimedia.org/T90800#1067935 (10valhallasw) [23:16:09] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Erwin's-tools, 7Monitoring: monitor webservice / 504 errors for erwin - https://phabricator.wikimedia.org/T90800#1068068 (10Akoopal) The webservice itself is monitored via the .bigbrotherrc, but it looks to be only looking for the proces. If the webservice is running b... [23:29:20] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Erwin's-tools, 7Monitoring: monitor webservice / 504 errors for erwin - https://phabricator.wikimedia.org/T90800#1068135 (10coren) That is correct, bigbrother checks that the //job// is running (and therefore the process is alive), but doesn't know what it is supposed...