[02:24:18] halfak: haha was at the office and having dinner with people. Turned into a bitchfest about how all language specific package managers suck for production things [13:00:20] YuviPanda, yeah. I figured that pip wasn't alone. [14:32:09] o/ harej, around? [14:32:27] I'm trying to figure out when we'll hear back about WCUSA submissions so I can plan travel. [14:32:49] halfak: hi [14:32:55] I just got your email [14:32:57] Hey YuviPanda [14:32:57] Cool. [14:33:07] Very early in the day for me atm [14:33:16] I'll read again i n a few hours [14:34:57] kk [14:35:17] I think we have some simple tests that we can run. :) [14:39:37] halfak: ya if we have a reproducible use case... [14:39:51] We can worst case strace it to figure out wtf happens [14:40:06] Gotcha. You'l like me to see if I can reproduce it in the ORES VM? [14:40:26] I don't know when the program will be published. [14:40:30] halfak: ya [14:40:37] halfak: well or even in prod [14:40:38] Is OK [14:40:40] I think that we might need several VMs because the problem doesn't seem to happen on staging -- where redis, ores, celery workers are on the same machione. [14:40:44] Prod is best maybe [14:41:05] I can turn off my ghetto restart cron and we can reproduce :\ [14:41:35] But not intermittently but exactly? [14:45:47] Oh. Yeah. [14:45:57] Wait... you want to *cause the problem in prod? [14:46:12] YuviPanda, ^ [14:46:39] Yes [14:46:55] So I can attach a debugger to things [14:47:02] And see what is happening [14:48:27] What do you think of that TCP keep-alive setting? [14:52:14] We could try it but without actually figuring out what exactly was going wrong we won't have peace of mind [14:52:30] But yeah feel free to try it! It might just work [14:59:28] * halfak feels weird about experimenting in prod [14:59:46] We need some monitoring too. [15:00:08] Yes [15:00:17] Brb [15:04:21] Either way, I think it is best if I follow your lead for solving this issue. I don't have much experience in the "prod system goes down intermittently" stuff. [15:04:30] What step would *you* do first? [15:05:35] halfak: I'd find a way to repro it with a revision / URL [15:05:50] That's not it. I'm pretty sure. [15:05:59] It only happens in the prod environment. [15:06:12] I have a precached running against staging too. [15:06:21] I mean [15:06:35] Find a particular URL or revision that reproduced it in the prod environment? [15:06:41] Basically I want to catch it red handed [15:06:54] Yeah. I don't think that it is processing a particular revision that is causing it. [15:07:05] But I guess I can try. [15:07:21] There's no error that suggests that this is the case. [15:07:21] Ok [15:07:35] Didn't we have thrown exceptions before them stopping earlier? [15:07:52] Yeah, throwing exceptions is a perfectly normal thing for a job to do. [15:10:12] I've got to bike towards the University for a bit. I'll have a think and come back with a more coherent discussion of what I think we know. [15:10:28] I'll be AFK for ~ an hour. [15:10:29] o/ [15:10:29] halfak: ok [15:10:32] halfak: me too [17:17:31] o/ harej, I hear that wikidata for citations was a big topic at the Wikipedia Science conference. [17:17:41] were people talking about librarybase? [17:19:35] I don't know. The conference ended the same day you showed that to me/us. [17:19:45] DarTar is talking about this at me right now. [17:19:50] So you might want to ping him :) [17:19:56] Can you give me a link [17:20:01] ? [17:20:03] I want to show him [17:20:11] DarTar is familiar with it. [17:20:15] Nevermind. he know what's uyp [17:20:15] http://librarybase.wmflabs.org [17:20:19] :D [17:20:29] I'm glad that it's not parallel initiatives [17:24:57] We were discussing it on the tweeter. [17:25:04] People seem to think that we do not need items for URL-based citations. [17:25:06] They're wrooooong! [17:25:12] They are wrong. [17:32:30] Welp.. ORES is down again. [17:32:41] YuviPanda, just noticed while I was researching [17:35:20] Workers are online [17:35:35] Looks like the uwsgi nodes are reporting that they are out of sockets [17:35:41] * halfak restarts uwsgi [17:36:28] ores-staging is online :S [17:36:33] I'm going to redirect the proxy. [17:36:40] hey [17:36:51] well that's probably because it doesn't get the same level of traffic [17:36:51] Redirecting the proxy a good idea? [17:36:57] redirecting traffic to it would kill that [17:37:01] did you restart uwsgi? [17:37:03] looking [17:37:17] especially if it is because it is out of sockets [17:37:20] Yu[. [17:37:22] That got it. [17:37:41] ok, so the restart made it work? [17:37:52] Yeah [17:38:39] ok [17:38:45] we need to add another web node in that case [17:38:52] I think that the problem is that the workers stop accepting tasks and the precached doesn't rate-limit. [17:38:52] and also increase per-process socket limits [17:39:02] I don't think that is it. [17:39:04] ah, 'back pressure' [17:39:07] Yeah. [17:39:35] But still, it's not like we're overloaded. Out pipe stops draining and fills up fast. [17:39:59] I guess we should stop filling in tasks when celery isn't accepting tasks [17:40:21] Yeah... The workers still accept tasks though. I think that's the problem. [17:40:27] They just don't do anything with them. [17:40:38] I think that the workers are still online, but they can't read tasks from redis. [17:41:34] Is there anything substantially different from how ORES uses redis/celery and quarry [17:41:38] ? [17:41:45] quarry gets far less traffic [17:42:13] and is probably on ancient versions [17:43:08] Still.. Is it using redis as a backend and broker? [17:43:16] yeah [17:43:16] I can't seem to find the worker application file [17:43:19] https://github.com/wikimedia/analytics-quarry-web [17:43:23] Where is it configured? [17:43:37] halfak: https://github.com/wikimedia/analytics-quarry-web/blob/master/quarry/web/worker.py [17:43:46] halfak: and https://github.com/wikimedia/analytics-quarry-web/blob/master/quarry/default_config.yaml [17:44:13] CELERY_RESULT_BACKEND: 'redis://' [17:44:21] That means celery is running on localhost, right? [17:44:34] ah so that's the default config [17:44:42] the actual config is 'secret' since it has the OAUTH keys [17:44:54] and in that it is redis://quarry-main-01.eqiad.wmflabs [17:45:11] Can you censor and send me the config [17:45:11] ? [17:45:17] yeah am digging for it now [17:45:21] (don't have it locally) [17:45:27] Thanks [17:46:31] * halfak is very unnerved by how fragile our service setup is. :\ [17:48:07] I'm even more unnerved by the fact that restarting the workers on an hourly rotation doesn't solve these types of issues. [17:48:40] https://www.irccloud.com/pastebin/0PMp3h0F/ [17:48:56] YuviPanda, do you have time to look into this right now? [17:49:08] I just realized I'm hammering you with pings and you've got other stuff to do [17:49:20] halfak: yeah, I'm going to look at this deeply for about 30mins and then head to office [17:49:35] well, at least the celery issue [17:49:43] halfak: but to be clear, ORES is up now, right? [17:49:53] Yes [17:49:58] ok [17:50:36] And FWIW, the recent downtime was happening even though the workers reported "online" in flower,. [17:50:48] The workers were *not* processing events at the time of downtime. [17:50:56] right [17:51:04] wait and restarting *uwsgi* fixed those? [17:51:08] yeah [17:51:12] hmm [17:51:13] right? [17:51:30] I dunno WTF that was. It ruins all my theories. [17:51:52] BTW, worker-01 restarts on the 00th minute of every hour [17:51:59] worker-02 on the 10th minute [17:52:06] worker-03 on the 20th minute, etc. [17:52:17] nothing suspicious in the redis logs [17:52:39] Some people were reporting a lost connection with redis in old bugs causing this. [17:55:04] I saw nothing about that in the worker logs. [17:55:23] halfak: http://tools.wmflabs.org/nagf/?project=ores has network spikes just now for all the things [17:55:26] The worker logs only report errors, so I saw an error that's perfectly acceptable at the end of the log every time. [17:55:58] Weird. [17:56:06] we should turn up logging a lot then [17:56:30] Could that be the restart I just did? [17:56:53] yeah that's possible [17:56:57] esp. since redis was flat before that [17:59:11] doing that now [17:59:16] (increasing loglevels [17:59:17] ) [18:00:39] Great. [18:00:41] ValueError: Pool restarts not enabled [18:00:51] I got that on the end of worker-02's log [18:04:49] I wonder if that was because I tried a restart from flower. [18:06:17] that's possible [18:09:16] halfak: so I"m seeing a bunch of [18:09:17] Sep 10 18:09:05 ores-worker-01 celery[7521]: AttributeError: 'NoneType' object has no attribute 'json' [18:09:31] hmm [18:09:37] i'm cranking the loglevel up on -worker-01 [18:09:41] Yeah. That's normal. [18:09:58] unhandled exceptions shouldn't be 'normal' IMO [18:10:01] makes stuff hard to debug [18:10:02] Regretfully, it's a shitty message. [18:10:08] It is handled. [18:10:11] It is reported to the user. [18:10:16] But I need the task to throw it. [18:10:24] Which is what celery does for me :) [18:10:33] And why my Exceptions need to be picklable. [18:10:53] (if you remember my frustrated rant from the other day) [18:11:16] https://wikitech.wikimedia.org/wiki/Hiera:Ores/host/ores-worker-01 is how to tweak loglevel on per-host basis btw [18:11:34] halfak: hmm, then they shouldn't show up in the exception log somehow, since it does making stuff hard to debug [18:11:38] ooo [18:11:40] What. [18:11:41] I've no idea which exceptionsa re 'expected' [18:11:43] and which aren't [18:11:44] Can anyone else this? [18:11:49] hmm? [18:11:54] see what? [18:11:55] *edit [18:11:57] no [18:11:59] k [18:11:59] only project admins [18:12:20] halfak: err, project admins for the ores project [18:13:15] halfak: need to find a solution to that long term. Quarry just puts an 'error' result with appropriate info there [18:13:19] YuviPanda, expected errors at that NoneType and CaughtDependencyError [18:13:38] YuviPanda, it's using celery in the intended fashion. [18:13:46] I don't see what problem there is to *solve* [18:13:49] well [18:13:52] problem is I'm looking at error logs [18:13:54] Oh! I can tell celery what kind of errors I expect. [18:13:56] and without talking to you [18:14:00] I've no idea which are 'expected' [18:14:03] and which are causing problems [18:14:08] so you are atm a spof [18:14:19] Sure. OK. Fair enough. Any non-startup error should not take down a worker. [18:14:36] i agree, but I don't know what class of error that is [18:14:37] And any worker that is down should automatically be restarted [18:14:49] You can tell from the logs that it is not startup. [18:14:49] I don't know which are thrown by celery's internal processes and which are used by you for reporting back to the web [18:14:58] well, it's crashing in the middle... [18:15:00] Still I'll get these error types fixed and registered. [18:15:06] yup, please do [18:15:07] In the middle? [18:15:10] open a task somewhere as well? [18:15:16] Middle of what? [18:15:17] like, in normal operations [18:15:23] in the middle of normal operations [18:15:34] basically, any exception in the logs should be treated as 'shit' [18:15:41] if it shouldn't be, it shouldn't be in the logs [18:16:17] YuviPanda, sure. [18:16:28] I agree. I just don't know why celery would log these at all. [18:16:34] s'fine though. I'll get 'em out. [18:16:40] thanks [18:21:51] So many pings from gerrit! [18:22:00] * halfak adds gerrit to an ignore list [18:22:01] heh [18:22:02] from 'ores' [18:22:07] yeah [18:22:20] Good thing we don't have many geologists in my IRC channels [18:23:23] https://github.com/wiki-ai/ores/issues/84 [18:23:29] " [18:23:29] Celery logs full of execptions." [18:24:50] halfak: ok, so worker-01 has incredibly more detailed logs now [18:24:57] halfak: where's the cronjob? I'd like to get rid of it in worker-01 [18:29:12] halfak: have disabled the cronjob on -01 [18:32:34] Found it in my personal cron? [18:32:50] halfak: yup [18:33:03] halfak: *lots* of very detailed logs [18:33:09] ha. [18:33:12] including the individual feature extractions that are being run [18:33:18] Yeah [18:33:23] GOod debugging for revscoring :) [18:33:29] indeed [18:33:38] halfak: now I guess we need to wait for it to 'lock up' [18:33:43] Yeah. [18:33:46] 01 is a good one. [18:33:55] Until recently, 04 never went down [18:34:05] for some reason. [18:34:31] hmm [18:34:41] Oh! I have an idea. [18:34:51] So, until recently, we were not doing IO in the workers [18:34:55] So, that might be related. [18:35:02] Our IO is an API call. [18:35:11] Maybe that API call gets into a bad state. [18:35:16] ooooh [18:35:20] does the API call have a timeout? [18:36:04] Ha! [18:36:15] Wait. no [18:36:21] it does. [18:36:27] what's the timeout? [18:36:33] Oh wait. Maybe not. [18:36:36] :D [18:36:45] are you using requests? afaik that doesn't have a default timeout [18:37:05] Yeah. That could be it. [18:37:20] This is a good time to switch to using mwapi. [18:37:33] I'll (1) add a timeout param to mwapi [18:37:57] (2) Update revscoring to use mwapi and throw a better error when the API returns an error (that's the NoneType) [18:38:09] (3) Update ores to reference the new version [18:38:16] (4) move this all to staging [18:38:25] (5) Ping YuviPanda [18:38:34] Sound good? [18:38:47] halfak: how long do you estimate that to take? would a shorter experiment be just adding the timeout= parameter to requests and see what happens? [18:39:17] There's not a good way to do that right now. [18:39:25] I'm using the api module of mediawiki-utilities. [18:39:33] But we could hack it. [18:40:00] * halfak thinks. [18:40:16] halfak: ah, ok. your call on that. if we diceide to do the switch to mwapi now, I"ll go get some food / lunch / move to office etc. if not I'll continue debugging now [18:40:53] So, if I were going to add the param to requests, I'd need a custom revscoring install. That shouldn't be too hard. [18:41:02] Should I do it on staging? [18:42:55] halfak: hmm, should we do it on -worker-01? :) [18:42:57] hm [18:42:59] I could also just add a default timeout to mediawiki-utilities [18:43:10] We could do it on 01 [18:43:21] if you think it's too messy we can just switch to mwapi [18:43:49] Not that messy. I'm more worried about playing in prod [18:44:04] But I'll follow your advice here. [18:44:16] I think I can do it by editing two lines in a simple way. [18:44:35] so I suggest we do that in worker-01 [18:44:38] and wait for it to die (or not) [18:44:43] well, it or some other one to die [18:44:54] OK. I'll get a version of mediawiki-utils ready. [18:45:00] ok [18:45:00] To install that is. [18:45:10] halfak: hmm, do you want to just hack it in /srv/ on worker-01? :) [18:45:16] I definitely don't want to touch pip [18:45:20] with a 10foot pole [18:45:24] after yesterda [18:45:24] y [18:45:34] na. I'll get the tar.gz and then push it up. [18:45:52] ok. just be careful :) [18:46:04] halfak: mind if I pop off for 10mins to get some nutrition? [18:46:32] Sure :) [18:46:33] hanks [19:01:06] * halfak runs the upgrade of mediawiki-utilities on -01 [19:02:02] * halfak restarts celery on -01 [19:03:05] halfak: am here now [19:03:21] OK. Change is implemented. The request now has a 20 second timeout. [19:03:27] ok [19:03:31] Only -01 has the upgrade. [19:03:47] It seems to be processing jobs nicely. [19:05:07] So, we should expect to see timeout errors appear in the logs every now and then now. [19:05:17] yah [19:05:22] I'll get back to work making ORES hide its errors better. [19:05:30] so I guess we just wait now? [19:06:15] Do you want to knock log-level back down on -01? [19:07:19] halfak: yeah, maybe to INFO [19:07:57] halfak: yeah, did so. [19:08:23] kk. [19:08:32] * halfak hacks. [19:08:55] halfak: do you know what Sep 10 19:08:30 ores-worker-01 celery[19314]: [2015-09-10 19:08:30,559: INFO/Worker-12] Resetting dropped connection: en.wikipedia.org is [19:09:10] Negative [19:09:21] http://stackoverflow.com/questions/25239650/python-requests-speed-up-using-keep-alive [19:09:23] fascinating [19:09:47] it's probalby ok [19:09:50] we do support keepalive [19:10:06] Yup [19:10:11] Using requests.Session() [19:10:17] that should've been in DEBUG tho [19:10:21] but requeset puts it in INFO [19:10:22] oh well [19:12:31] Requests puts a lot of stuff in info [19:12:34] -._o_.o [19:12:41] Woops [19:12:46] That guy is holding an apple [19:15:48] > Sep 10 19:10:26 ores-worker-01 celery[19314]: [2015-09-10 19:10:26,693: INFO/MainProcess] Task ores.score_processors.celery._score_task[enwiki:reverted:680417968:0.3.0] raised expected: RevisionNotFound('Could not locate revision. It may have been deleted.',) [19:15:52] halfak: ^ is nice :) [19:17:02] :D Much better than before [19:19:34] YuviPanda, you know a good way to simulate "takes forever to respond"? [19:20:19] halfak: we can drop the packets coming back from a place with iptables [19:20:53] not sure if that'll work tho [19:21:11] Do you know off-hand what I'd need to do to try? [19:21:18] I've never worked with iptables before [19:21:32] Oh wait! ConnectionTimeout is a type of timeout [19:21:40] So I could just use hosts to point to nothing [19:22:10] halfak: iptables -A OUTPUT -d -j DROP [19:24:03] I was able to get a ConnectionTimeout by just throwing a random IP in place of the mediawiki host [19:24:15] ok [19:24:20] And I can show how it hangs when I don't set a timeout. [19:24:21] Cool. [19:24:26] PR for mwapi soon. [19:24:30] ok [19:25:44] for mwapi, do you think our library user should need to catch requests errors [19:25:53] or should we handle them and throw our own errors? [19:27:38] halfak: hmm, good quesetion. [19:27:48] halfak: I think we should wrap them in a connection error of sorts [19:28:45] OK. [19:50:14] halfak: going to go afk for a bit now [19:51:10] kk. Thanks for your help today YuviPanda [19:51:16] I should have you a set of pull requests by EOD [22:00:34] YuviPanda, I think that the PRs are good to review. [22:00:46] I saw your comments here: https://github.com/mediawiki-utilities/python-mwapi/pull/16 [22:01:04] See also https://github.com/wiki-ai/revscoring/pull/184 [22:02:01] I ran some tests that show the current setup works with ores, but we'll want to make some minor modifications (e.g. we don't specify the full URL for the API anymore.) [22:02:25] With a little luck, we could get this in staging tonight. :) [22:02:34] And have better error logs to look at tomorrow. [22:03:23] halfak: I responded [22:06:31] Yeah... that doesn't really solve any of the problems except for how to print the wrapped error. [22:06:44] FWIW, requests seems to be wrapping urllib3 errors [22:06:48] :S [22:07:00] Probably for the same reasons I listed. [22:07:08] what do you mean by 'wrapping' [22:07:13] we might be referring to different things [22:10:20] Maybe. Requests catches urllib3 errors and raises its own in a similar way that my PR does. [22:14:27] halfak: hmm, actually - the more I think about it the more I feel like I should leave it in your hands - you've definitely been working with both python and mwapi way more than I have in the recent past [22:15:02] Hmm. I still like your pushback. I should have to justify these types of decisions. [22:15:18] I would have just merged if it was only the timeout that I needed to do. [22:16:08] the str(e) was based off of massive pain [22:16:11] from JS libraries [22:16:13] that'll just do that [22:16:21] and lose the stacktrace etc from the underlying setup [22:16:33] and so I've come to strongly believe that we should use the language's exception wrapping semantics [22:16:35] as much as possible [22:16:39] Ahh yeah. We won't lose the stack. [22:16:52] and python explicitly has exception wrapping semantics [22:16:53] Well... we will in python 2.7 [22:16:58] Only in python 3 [22:17:01] right [22:17:04] well [22:17:05] no [22:17:17] How about i look into the exception wrapping semantics before we merge. [22:17:49] halfak: http://www.ianbicking.org/blog/2007/09/re-raising-exceptions.html [22:17:50] ok [22:17:52] yes [22:17:54] But note that both the wrapped and wrapper will be reported on error. [22:18:22] right. my ultimate concern is that we don't try to hide the requests exception - it probably has useful info that we want callers to be able to see [22:18:48] Yeah. It won't get hidden. Really I should pass the args along. [22:19:00] I could actually just re-host the errors. [22:19:14] E.g. HTTPError = requests.exceptions.HTTPError [22:19:43] I suppose the class redefinition does that already. [22:20:18] Let me review the semantics and come back. [22:20:21] I want to see how it behaves. [22:20:35] ok [22:20:42] ultimately, I consider you mwapi maintainer now :) [22:23:12] :D [22:23:18] awight: think you'll have time this week to help move from pip to debs for everything? if not I can do most of it myself... [22:23:26] halfak: ^ you too, although I would like you to be around :) [22:23:29] Just so long as you keep pushing back on my crazier PRs [22:23:41] I'm also going to need to think about how we deploy revscoring as a packag.e [22:23:43] Totally [22:23:53]