[00:01:58] Thanks Coren, seems to have worked.. [01:06:20] [bz] (8NEW - created by: 2Pietrodn, priority: 4Unprioritized - 6normal) [Bug 52370] Replicated DB fawiki_p is missing revision table - https://bugzilla.wikimedia.org/show_bug.cgi?id=52370 [01:41:39] tools-login is slow again [01:41:55] can't get passed \n\nIf you are having access problems, please see: https://wikitech.wikimedia.org/wiki/Access#Accessing_public_and_private_instances [04:11:48] (03CR) 10Tim Landscheidt: "I meant "dialog --menu"; for playing around: "dialog --menu 'Select an snapshot to mount' 0 40 10 $(sed -e p /data/project/.snaplist) 2> a" [labs/toollabs] - 10https://gerrit.wikimedia.org/r/76313 (owner: 10Platonides) [06:30:55] .. [06:30:57] meh [06:31:02] &ping] [06:31:04] &ping [06:31:04] Pinging all local filesystems, hold on [06:31:05] Written and deleted 4 bytes on /tmp in 00:00:00.0005390 [06:31:22] Written and deleted 4 bytes on /data/project in 00:00:17.9862600 [06:47:43] @notify killiondude [06:47:43] I'll let you know when I see killiondude around here [09:22:22] http://commons.wikimedia.beta.wmflabs.org/w/index.php?title=Special:UserLogin&returnto=Main+Page [09:22:34] PHP fatal error in /data/project/apache/common-local/php-master/includes/AutoLoader.php line 1158: [09:22:34] require() [function.require]: Failed opening required '/data/project/apache/common-local/php-master/ext ... [09:47:12] !ping [09:47:12] !pong [09:48:51] &ping [09:48:51] Pinging all local filesystems, hold on [09:48:52] Written and deleted 4 bytes on /tmp in 00:00:00.0006640 [09:48:53] Written and deleted 4 bytes on /data/project in 00:00:00.0069240 [11:20:16] (03PS1) 10Yuvipanda: Updated requirements.txt [labs/tools/gerrit-to-redis] - 10https://gerrit.wikimedia.org/r/77495 [11:20:41] (03CR) 10Yuvipanda: [C: 032 V: 032] Updated requirements.txt [labs/tools/gerrit-to-redis] - 10https://gerrit.wikimedia.org/r/77495 (owner: 10Yuvipanda) [12:44:28] &ping [12:44:29] Pinging all local filesystems, hold on [12:44:29] Written and deleted 4 bytes on /tmp in 00:00:00.0005540 [12:44:34] blergh [12:46:01] so painfull [12:46:51] Written and deleted 4 bytes on /data/project in 00:02:22.7323050 [12:49:19] [bz] (8NEW - created by: 2Johannes Kroll (WMDE), priority: 4Unprioritized - 6critical) [Bug 52500] Broken Disk Controller is Broken - https://bugzilla.wikimedia.org/show_bug.cgi?id=52500 [12:58:28] o rly, disk controller? [12:58:37] wasn't it NFS? [13:03:19] MaxSem: whatever it is its rather annoying... :< [15:59:51] &ping [15:59:51] Pinging all local filesystems, hold on [15:59:52] Written and deleted 4 bytes on /tmp in 00:00:00.0007450 [16:00:12] ... [16:01:00] Written and deleted 4 bytes on /data/project in 00:01:09.5011990 [16:04:58] :< [16:27:09] Coren: around? And in theroy I can queue up 150 jobs? 15 will execute at a time? when one finished the next will start? [16:31:10] and I won't block anyone elses jobs? :P [16:32:44] or petan do you know the answer? :> [16:35:16] I think petan is away for a couple weeks. [16:35:51] true! [16:36:17] well, as far as I know it should work!, so lets go for it! [17:20:17] addshore: 150 jobs? same redis based thing? :D [17:21:10] The load on the tools webserver isn't dropping. :/ [17:22:23] There it goes. [17:22:26] Finally. [17:22:49] yuvipanda: no actually [17:22:50] Around 03:13:23Z, exim "fork of queue-runner process failed: Cannot allocate memory"'d on tools-webserver-01, BTW, so we may need to look how to limit resources consumed by the webserver. [17:23:16] scfc_de: thats not good :O [17:24:03] Cyberpower678: the disk thing is just killing everything http://ganglia.wmflabs.org/latest/?r=20min&cs=&ce=&m=load_one&s=by+name&c=tools&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [17:24:18] Setting up a new webserver and splitting the load would probably work for some time; maybe Coren can look into that in HKG. [17:25:13] Yeah, the disk issue is getting worse as more people use the tool labs. [17:25:25] This is why it's priority #1 once opsen arrive in HKG. [17:25:41] mhhm, Coren would less jobs writing to disk help at all? :P [17:26:21] or just writing to something that isnt nfs [17:26:39] yuvipanda: Does redis have some interesting statistics for Ganglia? Number of keys? Number of connections? Etc. [17:26:48] scfc_de: nope, nothing at all [17:27:03] scfc_de: we've sortof disabled all the diagnostic functions, so no way to know either :P [17:27:08] only thing we can monitor right now is memory usage [17:27:12] addshore: It'd possibly reduce the visiblity somewhat, but it's not going to get better on its own. [17:27:19] however, it is possible to set all these up. just a bit of hack [17:28:15] Coren: what happened half way through last month? :P [17:28:38] *goes to read sal* [17:30:12] yuvipanda: I was just thinking about simple statistics like we have for SGE or replag. [17:30:21] what do you have in mind? [17:30:29] memory usage would be a good baseline, methinks [17:30:46] indeed [17:30:49] and i/o [17:30:51] :> [17:30:56] redis should ideally do no IO [17:31:16] I mean net io :) [17:31:24] * Coren ponders. [17:31:33] Hey, I gots an ideas for a stopgap! [17:31:41] what is it? :p [17:31:59] yuvipanda: *Meaningful* numbers! :-) Like "x keys are stored right now": Starts with 0 after reboot, shows actual use by tools. [17:32:13] scfc_de: agreed, yeah. [17:32:14] As far as I can tell, the controller stall issue was nowhere /near/ as bad with the previous kernel that (on the other hand) exploded reliably at 14 days interval. [17:32:39] scfc_de: so currently, we disable the commands that let us do that. [17:32:39] when did you change kernal ? :) [17:32:55] yuvipanda: For security? [17:32:58] scfc_de: yup [17:33:05] Coren: You want to weekly reboot the server? [17:33:23] addshore: ... a bit over two weeks ago. But then again, we switched when the previous problem was clearly heading towards a catastrophe in response to what is /known/ was a nasty kernel bug. [17:33:36] scfc_de: I wrote http://yuvi.in/blog/attempting-to-secure-redis-in-a-multi-tenant-environment/ about that [17:34:00] so Coren thats explains the dip on ganglia half way through last month! :) [17:34:46] So I'm not sure how bad things would be with the older kernel in the meantime. It might buy us a bit of time. [17:34:47] which also kind of looks like the point the flucuating load and responsivness starts [17:34:49] scfc_de: so to get stats, we have two options. 1 is to parse the AOF (append only file, persistance mechanism) file ourselves. This can give us detailed stats as much as we want, is sortof 'clean', but also clearly stupid :P [17:35:09] scfc_de: the other option is to find some way to enable those commands that we disabled, but accessible only from the host itself (tools-redis). Then we can get those stats trivially [17:35:32] * Coren ponders. [17:35:45] Coren: Whether NFS stalls for a few minutes due to a error or because you set up another kernel, I think for the users the effect is the same :-). [17:36:04] yuvipanda: What's the command to get a number of keys? [17:36:06] indeed ;p [17:36:13] scfc_de: let me find, moment [17:36:22] scfc_de: No, what I mean is the issue right now is clearly driven by the amount of traffic (look at the daily pattern, and the continually increasing baseline) [17:36:43] yuvipanda: KEYS seems to give a *list* of all keys, we'd need a number. [17:36:49] scfc_de: It's entirely possible that the previous problem was also driven by traffic and would be worse than it was if I rollback. [17:36:57] http://redis.io/commands/info [17:36:58] scfc_de: ^ [17:37:16] let me see if that provides it? [17:37:19] hey pranavrc [17:37:20] Coren: So your idea is ... ? [17:37:24] !tools-help [17:37:24] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help [17:37:30] pranavrc: ^ is toollabs :) [17:37:31] scfc_de: I dunno yet. I'm thinking. [17:37:50] heya [17:37:50] wouldnt it be great if you could pipe the output of all jobs to something that wasnt the nfs server? [17:37:51] yuvipanda: But that's not disabled ATM, is it? redis-cli to test? [17:37:55] Thanks! I'm gonna go through that [17:38:04] scfc_de: true! i don't think that is disabled [17:38:12] pranavrc: yeah, redis-cli to test on tools-redis? [17:38:13] imo i imagine the jobs .out files is the majority of the nfs read write [17:38:21] or tools-mc, rather, since nobody is using tools-redis yet [17:38:24] except addshore, i think [17:38:33] yuvipanda: indeed I am :) [17:38:43] addshore: I doubt it is/ [17:39:02] Coren: any way to find out? ;p [17:39:17] yuvipanda: "connected_clients:17", "db0:keys=5,expires=0", doesn't look too bad. [17:39:23] heh :D [17:39:34] scfc_de: do it on -redis now ;p [17:39:36] scfc_de: can I get permissions to access tools-mc and tools-redis? [17:39:39] addshore: Not trivially. [17:40:14] yuvipanda: Would need Puppet change I believe; is limited to local-admin group. [17:40:33] not worth it if it is too much work [17:40:36] Coren: I imagine you have seperate load statistics or graphs for the nfs server? [17:40:39] so nevermind :) [17:40:44] addshore: Part of the problem is that one of the feature we use (snapshots for timetravel) requires a more recent kernel than is otherwise used in production. [17:40:45] scfc_de: can you pastebin output of full INFO command? [17:41:15] yuvipanda: You read minds :-). Moment. [17:41:35] addshore: Oh, we have the separate graphs for the NFS server, but that doesn't have file-level granularity. :-) [17:41:50] hehe :p [17:42:05] where are the graphs? ;p [17:42:16] oh wait, i think i have seen them labstore right? [17:42:28] yuvipanda: http://pastebin.com/z9Ey3eRw [17:42:30] * Coren nods. [17:42:44] addshore: Want to see the real time pain? http://ganglia.wikimedia.org/latest/graph.php?r=1hr&z=xlarge&h=labstore3.pmtpa.wmnet&m=cpu_report&s=descending&mc=2&g=cpu_report&c=Labs+NFS+cluster+pmtpa [17:43:02] sheesh. It's even worse than yesterday. [17:43:30] Allright, I'm going to revert to a 3.5 kernel with the known long-term bug, see if that helps. [17:43:32] Coren: how can I pause a grid job? [17:43:41] Because right now it's clearly unusable. [17:44:03] Coren: Before you do that: Is the traffic mainly coming from Tools? [17:44:24] I think you had ntop installed last time to see that. [17:44:31] addshore: qmod -sj -j [17:44:50] scfc_de: yeah, that looks good. no sensitive info [17:45:08] scfc_de: Yeah, I removed ntop because it's teh evil on resources but I have something else. It was last I checked, lemme fire it up again. [17:45:20] pranavrc: right now majority of the tools are written in php / python, so would be nice to get tools in other languages :D [17:45:33] yuvipanda: What's "db0" and "db9"? [17:45:34] I can get packages installed for you (puppet, yay!) and there is a good enough build environment [17:45:44] scfc_de: there are 16 'db's that can be used by applications [17:45:52] scfc_de: they are just different key namespaces, nothing else [17:46:03] [bz] (8NEW - created by: 2Steinsplitter, priority: 4Unprioritized - 6normal) [Bug 52502] deleting of service group - https://bugzilla.wikimedia.org/show_bug.cgi?id=52502 [17:46:20] yuvipanda: The app chooses which db to use? [17:46:23] scfc_de: yeah [17:46:35] yuvipanda: k [17:46:51] yuvipanda: So number of keys = sum(number of keys dbx)? [17:46:57] yeah [17:47:07] Oh, yeah, and of course there's an actuall stall just as I try. This is unusuable as [bleep] [17:47:14] :> [17:47:33] :-) [17:47:43] Coren: its fine, as long as you don't have to do anything ;p [17:48:00] yuvipanda: Oh, we even have keyspace_hits:20457 and keyspace_misses:22 as meaningful traffic statistics. [17:48:09] yeah [17:48:15] also memoery! [17:48:15] yuvipanda, great! I'm trying to think of something I could do. Do you guys have a roadmap/ideas that may need attention? [17:48:43] yuvipanda: (Though probably accumulated, so harder to depict in Ganglia properly.) [17:50:14] addshore: Yeah, the tools exec hosts are the overwhelming majority of the traffic. [17:50:44] It wasn't so bad if it stalled occasionally once every other hours or so, but right now it's [bleep]. [17:51:16] Coren: BTW, https://gerrit.wikimedia.org/r/#/c/77452/ might reduce disk IO a bit. [17:51:20] scfc_de: yeah, but there's connected clients, memory, aof size, pubsub_channels, keys [17:51:24] wonder if you will be able to see any noticable change at all now I suspend all of my jobs :P [17:51:48] pranavrc: hmm, nothing of that sort, actually - especially considering you'd want to write new tools than fix tools in php :D [17:52:25] yuvipanda, :D Alright, I'll try to think of something I can start with. Thanks for the resources! [17:52:32] pranavrc: :D [17:53:16] * yuvipanda pokes Coren with https://gerrit.wikimedia.org/r/#/c/77454/ [17:53:58] pranavrc: btw, any idea what language/environment you are planning on using? if it is not installed I should get that done now, before a bunch of us fly out to hong kong [17:54:03] yuvipanda: That's really not the time for me. :-) [17:54:19] Coren: time or type? :) but yeah, agreed. I shall try to spam less :) [17:54:50] yuvipanda: I'll set up a cron job on tools-redis to test some statistics in a few hours. [17:54:57] scfc_de: \o/ [17:54:59] scfc_de: Just pushed that change. [17:55:49] pranavrc: btw, if you want to do some angularjs/reactjs type stuff, perhaps you can replace http://tools.wmflabs.org/?status [17:56:19] yuvipanda, you'll be returning on the 20th, right? That's fine, I have some stuff to complete before that, and I'll look up the workflow and try to brainstorm something out before then. Enjoy your Hong Kong trip ;) [17:56:24] pranavrc: ah, okay :) [17:56:26] yeah, 20th [17:56:50] Coren: how are would it be to setup somewhere else for jobs to dump the output and errors to? [17:56:53] *hard [17:58:14] Coren: Thanks. [17:58:46] addshore: put them in redis! :P [17:58:53] (only half joking) [17:59:07] hah! [17:59:15] but really, redis doesn't touch NFS at all [17:59:16] so [17:59:17] put them on gluster ;p [17:59:31] !log deployment-prep looks like simplewiki's search index finally finished. party time. [17:59:34] Logged the message, Master [17:59:50] addshore: heh :P it writes AOF file to its local disk, so no NFS issues [18:02:42] yuvipanda, can you elaborate a bit on that status page? Are you aiming for a asynchronously updating status page? [18:02:49] pranavrc: exactly, yeah [18:03:02] current one is static and also eeky php calling out to shell [18:04:11] Ah, I might mess around with that one. I'll see what I can do. Can you link me to the source? [18:04:52] pranavrc: sure, moment [18:05:37] pranavrc: https://github.com/wikimedia/labs-toollabs/tree/master/www [18:05:57] but since that'll run in 'production' mode, I guess lisp/scheme would be not the best of ideas :P [18:06:06] pranavrc: plus it'll have to interact with commandline a fair bit too [18:06:27] That sounds like a python job [18:06:58] Anyway, doesn't matter, I'll feed on it until you get back. Python's fine if it's fine with you [18:07:08] pranavrc: yeah, python's what I was thinking too [18:07:15] pranavrc: have you used the 'sh' module? it's pretty kickass [18:07:46] hmm no, I use subprocess for that kinda stuff, but I'm taking a look [18:08:24] pranavrc: yeah, a lot better than subprocess, IMO [18:09:51] Ok, that should do. Enough stuff on the platter to check out. Thanks! [18:10:06] pranavrc: :) [19:09:07] scfc_de: around? [19:11:44] @search apply [19:11:44] No results were found, remember, the bot is searching through content of keys and their names [19:11:48] @search tools [19:11:48] Results (Found 10): morebots, labs-morebots, stats, add, tools-bug, status, tools-web, pastebin, awstats, filemoves.py, [19:15:51] !toolsrequests [19:15:55] !toolrequests [19:15:58] @search request [19:15:58] Results (Found 1): project-access, [19:16:02] !project-access [19:16:02] To request access to a project, use a project's discussion page; see !project-discuss [19:16:11] hey nerus [19:16:14] !tools-doc [19:16:19] !tools-help [19:16:19] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help [19:16:21] hey yuvipanda! :-) [19:16:28] nerus: that link has info on the toollabs environment :) [19:17:06] got it, thanks! [19:17:21] scfc_de: Ryan_Lane can you grant nerus access to tools project? [19:17:22] Coren: ^ [19:17:39] he made a request, I think [19:18:21] @search shell [19:18:21] Results (Found 1): account-questions, [19:18:25] !account-questions [19:18:26] I need the following info from you: 1. Your preferred wiki user name. This will also be your git username, so if you'd prefer this to be your real name, then provide your real name. 2. Your preferred email address. 3. Your SVN account name, or your preferred shell account name, if you do not have SVN access. [19:18:31] hmm, old! [19:19:21] nerus: I granted you shell access [19:19:43] nerus: read up the docs, someone should be around soon to grant you access to the project itself :) [19:20:01] :O [19:21:30] nerus, yuvipanda: Done. [19:21:38] scfc_de: \o/ [19:21:48] scfc_de: nerus is a good friend of mine who I have conned into replacing tools.wmflabs.org/?status :D [19:22:10] scfc_de: Thanks :-) [19:22:18] * nerus in general going :O [19:22:34] nerus: try sshing to tools-login.wmflabs.org? [19:22:37] and of course yuvipanda thanks :D [19:23:30] Up since 5 days :) [19:23:49] yuvipanda: i am in! [19:23:56] nerus: \o/ [19:24:08] yuvipanda: You have shellmanager (?) rights? Then in #wikimedia-labs-requests you can "!rq $USERID" for the shell request pages (and "!tr $USERID" for the Tools project ones). [19:24:09] nerus: yeah, we've some NFS issues that kinda kill our uptime. hardware issues, I'm told [19:24:23] scfc_de: I've shellmanager but I can grant only shell. Not access to tools [19:24:52] !tools-help [19:24:52] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help [19:25:05] nerus: ^ is fairly complete, and should help you understand the underlying infrastructure we have :) [19:25:23] scfc_de: tools.wmflabs.org/?status doesn't really have any info that's not accessible to all tools, does it? [19:25:27] like, regular tools [19:27:39] yuvipanda: I think it doesn't have any privileges, yes. [19:28:44] nerus: Spammed your talk page with some helpful links for reference. [19:29:18] scfc_de: thanks! [19:29:34] * nerus is getting too much <3 ? :) [19:51:05] %ping [19:51:31] pfft :P [19:51:37] meh [19:51:42] it started working just as I typed it :D [19:52:16] &ping [19:52:16] Pinging all local filesystems, hold on [19:52:18] Written and deleted 4 bytes on /tmp in 00:00:00.0005340 [19:52:19] Written and deleted 4 bytes on /data/project in 00:00:00.0080820 [19:55:27] What would people think if I changed some of the filesystem caching behaviour in a way that made hangs less frequent but slightly longer to recover? [19:55:44] depends on how slightly longer :P [19:56:24] Hard to predict in advance; my guess is that it'd trigger much less often but probably get bursts of 3-4 back to back stalls when it does. [19:56:45] is it just 1 server out of several identical ones that is malfunctioning? [19:57:56] OuKB: There is just the one server; the redundancy is at the raid level. There is a second one we can bring up (and that's part of what we'll be doing in a few days) but it implies downtime I'd rather avoid. [19:58:21] mmm, spof [19:58:46] Coren: it feels like 3-4 back to back bursts already ;p [19:59:18] "spof"? [19:59:28] I say go for it Coren ! whats the worst that could happen right ;p [19:59:32] addshore: Yeah, it got rather sucky. [19:59:37] addshore: +1 [19:59:42] * Coren turned on caching to the max. [19:59:47] turns* [19:59:51] :> [19:59:54] https://en.wikipedia.org/wiki/SPOF [20:00:11] I'm not the, I'm just Max [20:01:12] ah. Yeah, to a point; although having a ~3h switchover time is well within our expected service level in case of hardware failiure. Cold swap was entirely sufficient for our needs -- provided that the hardware didn't have issues to begin with. [20:01:49] The problem is that it's not entirely clear /which/ component has failed, so just switching things around blindly is not a smart move. [20:02:27] On Toolserver, the "automatic" switchover between two boxes who had physical access to the disks caused much more pain than here. [20:02:55] the solution would be a distributed FS [20:03:02] like Gluster [20:03:03] * yuvipanda puts OuKB on GlusterFS [20:03:07] oh shi,,, [20:03:24] now you won't respond for a while and someone will need to get you unstuck :P [20:03:31] * Coren giggles. [20:04:20] OuKB: glusterFS is the nightmare we ran away /from/ -- the stalls are a minor amusing inconvenience compared to the data loss and complete outages caused by gluster. :-) [20:04:32] I know [20:04:52] I advocated to KILLKILLKILLGLUSTER myself;) [20:05:00] xD [20:05:09] At any rate, with the caching I just turned on, chances are that the problem will be much less sallient until we find a true fix. [20:05:43] Coren: how eactly does 'the caching' work? [20:06:09] ie. if nfs has an issue and stuff is unreachable, will the cached stuff still be reachable? :O [20:06:10] addshore: It's software raid. I just cranked up dirty_ratio and dirty_writeback [20:06:22] ahh :P [20:06:40] Coren: addshore has wm-bot's password he could pm you for next time. [20:07:43] And given the amount of ram that box has, actual flushes to disk are going to be well spaced out which means that the stalls are also going to be infrequent. But there'll be a LOT of stuff to flush hence the probability of back to back stalls. [20:08:12] haha :) [20:08:27] bleh, on Windows, SMB just works:P [20:10:05] yeah, I've been advocating SMB for a while now! [20:10:17] WHY WON'T ANYONE LISTEN TO MY OBVIOUSLY SUPERIOR SUGGESTIONS!!!1 [20:10:24] Coren: how about one massive SSD and save everything to that? [20:10:52] :. [20:10:53] :. [20:10:58] <: [20:11:44] addshore: Sure! When will you be donating the SSDs to the Foundation? I'll make sure we have someone at hand to install them. :-) [20:11:55] :P [20:12:07] TIL addshore is literally Samsung! [20:13:50] If the problem lies *between* the server and the disks, switching to SSDs wouldn't necessarily change anything :-). [20:14:08] Well, besides the speed ... [20:14:29] scfc_de: That too. I don't mind the SSDs anyways though. :-) [20:15:59] Oh, no, if someone (Samsung?) wants to donate a bunch, keep 'em coming. If there's a surplus, I'd take one, too :-). [20:16:27] * Technical_13 still wonders if SMB doesn't simply stand for Some Magic Bean... [20:17:50] SMB is http://img.gawkerassets.com/img/17mckpbjndajupng/original.png [20:18:18] * Coren politely points out that nothing has used SMB in the 20th century. [20:18:24] You probably me CIFS. [20:18:26] mean* [20:18:41] 21st* [20:18:46] * Coren gives up. [20:19:22] Coren: You switched 20:00Z? http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=labstore3.pmtpa.wmnet&m=cpu_report&s=by+name&mc=2&g=load_report&c=Labs+NFS+cluster+pmtpa [20:21:23] Coren, lol, [[CIFS]] redirects to [[SMB]] [20:23:33] I didn't think anyone had planted any magic beans since the mid 14th century... [20:40:08] [bz] (8NEW - created by: 2Tim Landscheidt, priority: 4Unprioritized - 6normal) [Bug 52236] rotate_logs.sh throws exceptions - https://bugzilla.wikimedia.org/show_bug.cgi?id=52236 [20:44:25] * addshore likes being able to queue jobs :) [20:44:34] how many do you have now? :P [20:44:57] [bz] (8RESOLVED - created by: 2silke.meyer, priority: 4Unprioritized - 6major) [Bug 51890] Bots compenent in bugzilla is obsolete - https://bugzilla.wikimedia.org/show_bug.cgi?id=51890 [20:45:08] just under 150 [20:45:25] what jobs, exactly? [20:45:42] so every should probably just ignore the fact there are currently over 100 jobs waiting, nothing is broken!, its just me queueing them up so I dont have to think about them until monday :) [20:45:51] yuvipanda: importing coordinates to wikidata [20:46:00] aah [20:46:00] :) [20:46:01] split up into smaller chunks [20:46:07] no redis? [20:46:23] nah :P dont think redis would speed this one up :P [20:46:47] the painfull thing about the last one was my database on tools-db and needed to read and write to is continuosly [20:47:04] what were you reading and writing continuously? [20:47:06] * addshore hops tools-db doesnt use nfs xD [20:47:14] yuvipanda: rows :P [20:47:21] pfft :P what were in the rows? [20:47:41] data? :P [20:47:47] was expecting that :P [20:47:47] it is a databse afterall ;p [20:47:52] addshore: btw, see http://tools-webproxy.proxy.wmflabs.org/ :D [20:48:10] O_o [20:48:12] which is...? [20:48:24] it's a replacement for instanceproxy [20:48:35] eventually this will enable us to run webservers on the grid :D [20:48:37] or any type of servers [20:48:48] :> [20:48:53] you can do ..proxy.wmflabs.org [20:49:09] that will be nice indeed [20:49:12] so if you have something running on tools-exec05 [20:49:16] on port 9999 [20:49:27] 9999.tools-exec05.proxy.wmflabs.org [20:49:38] addshore: even better, you can just dynamically setup different domains to alias to this :D [20:49:44] so it'll be quite nice [20:49:45] when done :) [20:50:20] yuvipanda: Port 9999 via a HTTP proxy, or 1:1 connection? [20:50:38] scfc_de: proxy.wmflabs.org is running a nginx + lua thing [20:50:43] no real IPs are passed through :) [20:51:28] scfc_de: no 1:1 connections :) [20:51:43] scfc_de: and this will all be behind keystone augh of some sort [20:51:45] *auth [21:14:18] (03PS1) 10Yuvipanda: Add github-reciever script to receive webhooks from github [labs/tools/gerrit-to-redis] - 10https://gerrit.wikimedia.org/r/77553 [21:14:52] (03CR) 10Yuvipanda: [C: 032 V: 032] Add github-reciever script to receive webhooks from github [labs/tools/gerrit-to-redis] - 10https://gerrit.wikimedia.org/r/77553 (owner: 10Yuvipanda) [21:39:22] scfc_de you pinged me yesterdauy [21:39:25] what you needed [21:39:30] my boucner dced [21:44:11] petan: I have *no* idea :-); probably just mentioned your name. [21:46:44] scfc_de: any way at all, whatsoever, to do tool renaming? [21:48:16] scfc_de: also can you grep error logs for what error http://tools.wmflabs.org/gerrit-to-redis/cgi-bin/github-receiver.py is experiencing? [21:51:11] yuvipanda: I don't think tools can be renamed. Re error log, moment. [21:54:18] "[Sat Aug 03 21:46:57 2013] [error] [client 10.4.1.89] File "/data/project/gerrit-to-redis/cgi-bin/github-receiver.py", line 15, in " [21:54:23] "[Sat Aug 03 21:46:57 2013] [error] [client 10.4.1.89] [Errno 2] No such file or directory: '/data/project/gerrit-to-redis/config.yaml'" [21:54:40] That's all of $(grep gerrit-to-redis) [21:55:41] alright [21:55:51] yuvipanda: Between that in error.log: [21:55:58] [Sat Aug 03 21:46:57 2013] [error] [client 10.4.1.89] with open(CONFIG_FILE) as f: [21:56:03] [Sat Aug 03 21:46:57 2013] [error] [client 10.4.1.89] IOError [21:56:03] [21:56:12] right [21:56:17] let me find it [21:56:21] ty scfc_de [21:58:12] scfc_de: can you look again? [21:59:47] yuvipanda: Still: File "/data/project/gerrit-to-redis/cgi-bin/github-receiver.py", line 15, in [21:59:51] with open(CONFIG_FILE) as f: [21:59:54] IOError [21:59:56] : [22:00:00] hmm [22:00:02] [Errno 2] No such file or directory: '/data/project/gerrit-to-redis/cgi-bin/../config.yaml' [22:00:06] Premature end of script headers: github-receiver.py [22:01:28] hmmm [22:01:32] Ah, you have doubled gerrit-to-redis in the path to config.yaml. [22:01:49] aaaaah, no that's not it [22:01:52] it's using a relative path [22:02:00] and i've the orders wrong [22:02:13] &ping [22:02:13] Pinging all local filesystems, hold on [22:02:14] Written and deleted 4 bytes on /tmp in 00:00:00.0006110 [22:02:15] Written and deleted 4 bytes on /data/project in 00:00:00.0066740 [22:02:22] hmm, just my network then [22:03:53] scfc_de: grep again? [22:07:25] [Sat Aug 03 22:03:37 2013] [error] [client 10.4.1.89] File "/data/project/gerrit-to-redis/cgi-bin/github-receiver.py", line 20, in [22:07:31] PREFIX = config['github_receiver']['redis_prefix'] [22:07:34] KeyError [22:07:39] 'github_receiver' [22:08:46] ah, nice :) [22:08:48] let me fix that [22:08:59] scfc_de: so the bug was with how I was resolving symlinks [22:09:22] I was doing join(resolve(dirname)) while should've been join(dirname(resolve)) [22:09:30] since cgi-bin is not a symlink [22:10:05] !tools-help [22:10:05] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help [22:15:07] (03PS1) 10Yuvipanda: Fix symlink resolution bugs with finding BASE_PATH [labs/tools/gerrit-to-redis] - 10https://gerrit.wikimedia.org/r/77556 [22:15:34] (03CR) 10Yuvipanda: [C: 032 V: 032] Fix symlink resolution bugs with finding BASE_PATH [labs/tools/gerrit-to-redis] - 10https://gerrit.wikimedia.org/r/77556 (owner: 10Yuvipanda) [22:16:41] Looking at http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=labstore3.pmtpa.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1375561002&g=load_report&z=large&c=Labs%20NFS%20cluster%20pmtpa the NFS stalls seem to have become less frequent, while *not* increasing in length. [22:17:40] &ping [22:17:40] Pinging all local filesystems, hold on [22:17:41] Written and deleted 4 bytes on /tmp in 00:00:00.0007950 [22:17:42] Written and deleted 4 bytes on /data/project in 00:00:00.0068620 [22:17:53] hmm, but I do get intermittent hangs (like right now) [22:17:58] but without any chage in /data/project [22:18:04] my git pull origin master is still hung [22:18:56] also not my network, since I can ctrl-c [22:19:13] * yuvipanda does an strace [22:19:20] aaaah, this is gerrit eing slowasfuck :P [22:19:20] ok [22:36:00] :> [22:36:13] gawd yuvipanda, dont automatically presume its labs fault! xD [22:36:20] :P [22:41:56] scfc_de: That's actually a useful datapoint; it does seem to indicate that whatever goes wrong does so during /bursts/ of traffic but not during relatively sustained low-level traffic. [22:43:27] scfc_de: At any rate, it means the stalls have returned to "berable annoyance" level rather that "omg kill me now". [22:43:45] <: [22:44:05] But that won't last as more prople join the labs. [22:47:24] Coren: Yep, you still have some work to do :-). [22:48:22] Right now, the work I have to do is to pack for a ~17h flight to HKG [22:48:40] No Wifi on the plane? :-) [22:48:50] Mine is liek 4 hours :D [22:48:52] bearable [22:48:52] woohoo :P [22:48:56] I doubt there is Wiki in the middle of the pacific. :-) [22:49:00] Wifi* [22:49:02] or 5, not sure. [22:49:31] * Coren is _so_ glad he isn't doing that in coach. [22:49:40] ah, I guess you win there :P [22:50:29] Don't they use satellites for uplink? BTW, 17 h non-stop? [22:51:29] scfc_de: No, I saved some $400 for each of us by flying out of Toronto instead; so it's 1.2h "almost-a-bus" to YYZ followed by 15.8h to HKG [22:53:08] 15.8h is still a pretty long time :-). [22:53:25] * yuvipanda 's return flight from SF was totally ~32 hours last time [22:53:28] I think the word you were looking for is "damn" long. :-) [23:30:16] NOOO | wiki | zh | 0 | 2005年12月逝世人物列表 | 15 | 2013-07-25 16:03:37 | NULL | [23:30:16] | wiki | zh | 0 | 2005年1月逝世人物列表 | 16 | 2013-07-25 16:03:38 | NULL | [23:30:16] | wiki | zh | 0 | 2005年2月逝世人物列表 | 16 | 2013-07-25 16:03:38 | NULL | [23:30:16] | wiki | zh | 0 | 2005年3月逝世人物列表 | 16 | 2013-07-25 16:03:38 | NULL | [23:30:16] :< [23:30:22] wait, that appears fine here! [23:31:46] urgh, okay, apparently the bot struggles with zh... [23:31:58] yuvipanda: I almost though redis was broken then ;p [23:32:08] heh :P [23:32:09] ok [23:46:09] :< i dont know whats happened! [23:46:29] (03PS1) 10Yuvipanda: Add support for registering for GitHub rsubscriptions too [labs/tools/gerrit-to-redis] - 10https://gerrit.wikimedia.org/r/77559 [23:52:29] (03PS2) 10Yuvipanda: Add support for registering for GitHub rsubscriptions too [labs/tools/gerrit-to-redis] - 10https://gerrit.wikimedia.org/r/77559 [23:59:02] scfc_de: we should setup https://github.com/nicolasff/webdis [23:59:08] since redis-cli isn't availab [23:59:09] le [23:59:10] eithe [23:59:12] r [23:59:19] from tools-login [23:59:23] brb