[07:43:59] akosiaris: thanks for the log, is it possible to widen the time span? specially log for May 3 and 4 [07:44:29] Amir1: oh I thought you needed only apr 30 based on the task subject [07:44:33] lemme see what I can do [07:44:40] thanks :) [07:45:07] we did a deploy in April 30, so the key error should not happen afterwards [07:46:02] Amir1: https://phabricator.wikimedia.org/P3000#13554 [07:46:39] awesome [07:46:41] thanks [07:46:50] are you doing subtransactions by any chance ? [07:46:59] all those there is already a transaction in progress make me think that [07:47:10] postgres does not support that [07:47:10] I'm not sure [07:47:26] per what I saw we don't [07:47:30] but let me check [07:48:20] akosiaris: we fixed the upsert issue but still we face the intermittent performance hiccups [07:50:02] did you by any chance fix it on May 2nd ? [07:50:15] cause I am looking at https://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&h=labsdb1004.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=cpu_report&c=MySQL+eqiad [07:50:25] we did it in April 30 [07:50:31] and since May 2nd things are better than what they used to be [07:51:22] I think that's because the load is lower on the server [07:52:17] well, there aren't any real users on that server except you, so it might be related [07:53:12] technically it's a shared server, but from what I gather from the databases and connections on it the main user is wikilabels [07:53:18] I mean there wasn't anyone using our server in those days [07:54:03] it was me labeling at 30 April, and we had some users labeling at 2 May [07:59:21] any reason why your server has multiple connections state as idle in transaction since Apr 30 ? [08:00:26] many are for some reason idle in the COMMIT command [08:01:24] there is even one stuck doing SELECT 1; since 2016-04-30 04:21:05.884077+00 [08:03:00] akosiaris: https://github.com/wiki-ai/wikilabels/blob/master/wikilabels/database/db.py#L43 [08:03:26] that's the one which does "SELECT 1" [08:04:58] well, it's somehow stuck since Apr 30 [08:05:14] somehow it's shooting itself in the foot... [08:05:31] hmm [08:05:43] let me see how this method is being called [08:05:49] you can do a select * from pg_stat_activity and see it btw [08:06:44] thanks [08:06:57] it's being called in each and every transaction [08:07:02] so it might be the reason [08:07:13] conn = self.get_good_connection() [08:07:41] one interesting thing that I see there are 30 connections more than 3 days old [08:07:56] some idling, some idle in transaction [08:08:35] maybe they are connections that got harakiri [08:08:55] doesn't harakiri kill the process ? [08:09:17] I can probably trace it down to the TCP connection [08:09:20] lemme see [08:10:01] postgres 19694 postgres 12u IPv4 141854917 0t0 TCP labsdb1004.eqiad.wmnet:postgresql->10.68.17.198:60950 (ESTABLISHED) [08:10:04] there you go [08:10:31] VM with IP 10.68.17.198 probably has a process using ephemeral port 60950 connecting to postgres [08:11:52] and here's the culprit on the VM's side [08:11:54] www-data 16514 0.0 0.9 160860 37756 ? Ss Apr30 0:51 /usr/bin/uwsgi --die-on-term --autoload --ini /etc/uwsgi/apps-enabled/wikilabels-web.ini [08:12:49] btw, I see multiple processes all using port 60950 to connect to postgres [08:13:09] how is wikilabels connection handling structured ? [08:13:20] let me show [08:13:35] https://github.com/wiki-ai/wikilabels/blob/master/wikilabels/database/db.py [08:13:43] that's the db [08:14:10] https://github.com/wiki-ai/wikilabels/blob/master/wikilabels/database/labels.py#L25 [08:14:16] that's an example [08:15:01] akosiaris: Each and every worker has their own connection to the server [08:15:21] is it okay? [08:15:30] yes it is the good way to design this [08:15:51] but that does not correlate with the reuse of port 60950 by all the uwsgi processes [08:15:59] root@wikilabels-01:~# lsof -i -n -P |grep 60950 [08:15:59] uwsgi 7747 www-data 23u IPv4 11091417 0t0 TCP 10.68.17.198:60950->10.64.37.8:5432 (ESTABLISHED) [08:15:59] uwsgi 11049 www-data 23u IPv4 11091417 0t0 TCP 10.68.17.198:60950->10.64.37.8:5432 (ESTABLISHED) [08:15:59] uwsgi 14234 www-data 23u IPv4 11091417 0t0 TCP 10.68.17.198:60950->10.64.37.8:5432 (ESTABLISHED) [08:15:59] uwsgi 16514 www-data 23u IPv4 11091417 0t0 TCP 10.68.17.198:60950->10.64.37.8:5432 (ESTABLISHED) [08:16:01] for example... [08:16:33] so I am a bit confused by that [08:18:16] akosiaris: so you are saying each process should use a different port [08:18:17] ? [08:18:48] if every process was starting the connection it would happen on its own, that is what I am saying [08:19:28] cause the Operating System would be assigning an ephemeral port automatically to every process [08:19:40] this looks like there is connection pooling instead [08:19:46] yup [08:19:51] they use a connection pool [08:20:08] https://github.com/wiki-ai/wikilabels/blob/master/wikilabels/database/db.py#L63 [08:20:38] ah, so they share connections instead of every worker having its own connection to the server [08:21:05] which is not necessarily bad, but has these caveats we just noticed [08:21:10] yeah [08:21:13] my bad [08:21:48] I am not really sure how uwsgi works here to express an informed opinion [08:22:03] I 'll have to read up a bit first to understand what's going on [08:22:18] and all in all, it might not even be the reason for whatever problems we see [08:22:43] excluding the 6 long alive db connections that is [08:22:51] 6 days* [08:23:46] hmm [08:23:53] I'm not sure [08:24:28] but I think the system that designed to get a good connection [08:24:35] is not very optimum [08:24:59] https://github.com/wiki-ai/wikilabels/blob/master/wikilabels/database/db.py#L43 [08:25:04] I don't even understand why you needed that [08:25:23] was it so often that a DB connection would fail ? [08:26:30] I haven't seen anything about failed connections in logs [08:26:34] and I read lots of logs [08:38:11] let me disable it for a while [08:39:17] akosiaris: btw. https://gerrit.wikimedia.org/r/#/c/280403/ [08:39:33] i had to change it so it would be compatible with https://gerrit.wikimedia.org/r/#/c/284418/ [08:39:52] (which is not merged yet but cherry-picked into the beta cluster) [08:48:50] akosiaris: I would appreciate if you comment on this: https://github.com/wiki-ai/wikilabels/pull/119 [08:48:52] thanks [09:00:39] hmm [09:01:42] so connection in postgres are in autocommit by default... I am wondering whether that get_good_connection had some weird side effect causing that log error ... not sure though [09:03:39] akosiaris I wan't to try something then check the status of the log [09:05:48] Can you check if the db long connections just increased by 1 ? [09:11:03] GhassanMas: nope. they are the same [09:13:29] thanks akosiaris [09:13:30] :) [09:13:45] Amir1: halfak_: btw, I 'd like to reboot the box to get the new kernel updates applied [09:14:02] for wikilabels? [09:14:15] halfak is asleep now [09:14:27] no I mean the database box [09:14:41] it will mean wikilabels will not be able to connect to the database for about 2-3 mins [09:14:47] okay [09:14:59] let me send an email to ai-l [09:19:45] akosiaris: did the long started in 28 April [09:20:57] GhassanMas: the long standing queries ? I don't know that. The oldest one I see is dated Apr 30 if that helps [09:21:27] akosiaris: sent the email [09:21:30] :) [09:21:43] (I cc'd you too) [09:22:47] akosiaris: tell me when it's down, so I restart the uwsgi service [09:23:22] Amir1: ok doing it now [09:25:15] alright, I thought about that wikilabels may be the cause, if no new long standing db requests that may not be the cause [09:27:54] Amir1: we are up and running [09:28:00] thanks [09:28:01] er, the db box I mean [09:28:18] let restart the uwsgi [09:29:43] it's restarted [09:29:53] but db connections are super slow [09:29:56] yeah I can see the connections [09:30:07] https://labels.wmflabs.org/campaigns/fawiki/?campaigns=stats [09:30:12] this one was fast [09:30:21] and now it's much faster [09:30:27] okay. We are up [09:30:30] thanks [09:30:32] probably the connection pooling code [09:33:05] akosiaris: psycopg2.DatabaseError: error with no message from the libpq [09:33:18] and that's fixed now [09:34:30] thanks [13:39:04] o/ [13:39:10] Hey Amir1 [14:44:37] * halfak finished emails in 2 hours this morning! [14:44:38] woo [14:44:44] OK now on to hacking. [14:44:53] mwparserfromhell is going to get a bunch of tests. [14:45:11] * YuviPanda has halfak's review scheduled for today [14:45:46] :D I just finished all of mine yesterday in a mad rush. [16:31:22] halfak: o/ [16:31:51] Yo! [16:31:54] In meeting more. [16:31:56] *mode [16:32:05] I suddenly fell asleep [16:32:06] Got a little bit of work done with mwparserfromhell [16:32:08] :( [16:32:11] anyway [16:32:18] halfak: it's awesome [16:32:22] Should sleep if you need it! [16:32:47] Slept enough, I sleep more when I'm under stress for visa [16:33:26] or I'm under stress in general [16:34:11] halfak: anyway, we talked with alex about the intermittent issue [16:34:33] http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-ai/20160505.txt [16:34:44] and made some changes with wikilabels [16:43:19] Great! [16:43:23] * YuviPanda shall suggest moving to mysql as the unfortunately realistic soluton [16:43:28] Here's what I learned re. mwparserfromhell [16:43:29] https://gist.github.com/halfak/723af5446332b29e09144646dd1a20f2 [16:43:45] YuviPanda, not totally crazy. probably a direction we should go eventually. [16:43:59] Assuming that this will eventually become a mediawiki extension [16:44:07] * YuviPanda nods [16:44:11] * halfak is sad because postgresql > mysql [16:44:23] But support >> non-support [16:44:25] indeed. such is life. [16:44:58] Amir1, expect a PR soon that'll make most of our interactions with mwparserfromhell a bit faster. [16:45:08] \o/ [16:45:18] halfak: have you read ai-l for today? [16:45:37] I saw the note about short downtime [16:45:45] Nice work on making sure people know about that [16:45:46] we restarted the db for wikilabels so it was down [16:45:54] and another thing [16:46:14] https://github.com/wiki-ai/wikilabels/pull/119 [16:46:17] halfak: ^ [16:46:30] tell me when you have a PR ready :) [16:47:35] btw. tell me if you need the script for feature importance and regex performance test [16:49:48] get_good_connection resulted from us losing DB connection and not being able to get it back. [16:50:00] It was taking down wikilabels every other day. [16:51:26] Amir1, ^ [16:51:48] maybe it was an issue in other front [16:51:53] *another [16:52:02] let's talk to akosiaris [16:52:05] :) [16:52:49] https://github.com/wiki-ai/wikilabels/commit/0e81b112549095cdfc7fbb5e49a141d8091ac149 [16:53:02] +1 for not using my hacky solution, but it would be good to know why it was happening. [17:19:27] halfak can I check the logs by myself ? [18:48:48] wiki-ai/revscoring#708 (parse_filter_faster - bdd5977 : halfak): The build passed. https://travis-ci.org/wiki-ai/revscoring/builds/128118383 [18:56:21] GhassanMas, which logs are you looking for? [18:56:58] Which comes from db.py [18:57:09] For wikilabels. Sure. [18:57:18] Do you have an account on wikitech.wikimedia.org? [19:03:04] I will set one [19:04:37] I forgot i had one ! :) [19:05:34] Where to go from wiki tech [19:05:50] Amir1, https://github.com/wiki-ai/revscoring/pull/271 [19:06:09] Both our mean and median extraction timings are below 0.5 seconds! [19:06:10] \o/ [19:06:10] awesome halfak [19:06:31] This dropped a *half second* from our longest extraction timings. [19:06:37] {{done}} [19:06:37] You rule, Amir1! [19:06:43] thanks [19:06:46] :D [19:15:29] GhassanMas, Amir1 will help you get set up to look at the logs. [19:15:32] I have to run right now. [19:15:38] I'll be back in 1-2 hours. [19:17:51] Alright [19:18:49] I'll b online in five min. [19:35:37] GhassanMas: Can you tell me your wikiech account name? [19:39:32] Sorry amir1 i was afk, its ghassanmas [19:42:02] okay [19:42:26] GhassanMas: do you want logs of wikilabels? [19:42:31] is that so? [19:42:42] Yup [19:44:44] it's super slow [19:45:56] GhassanMas: You are a memeber now [19:45:57] https://wikitech.wikimedia.org/wiki/Help:Access [19:45:59] read this [19:46:10] connect to wikilabels-01.wikilabels.eqiad.wmflabs [19:47:14] Alright, thank you Amir1 [21:40:32] halfak: https://phabricator.wikimedia.org/T132624#2266752 [21:40:39] (for when you're around [21:40:48] I go try to get some sleep [21:40:53] or watch a movie [22:35:02] halfak: ^ [22:36:28] Awesome. I can help make sure the materials are all in place. [22:36:37] Maybe we should try to recruit Helder to help out. [22:36:40] I wonder if he is coming. [22:36:49] DarTar might be able to help too [22:38:06] awesome [22:42:37] I'll bring that up with Dario [23:06:58] HalfakŲŒ is any user in Wikipedia is able to request workset? [23:07:39] If the labeling version of wiki label is available in their language