[00:00:56] springle, nuria, milimetric, so let me start [00:01:23] I've been looking at the data and trying to match imported data with original data [00:01:50] I updated some code in the loading scripts and also in the verification scripts [00:02:07] the final code is here: https://gerrit.wikimedia.org/r/#/c/189532/ [00:02:13] (not merged yet) [00:02:37] (that's next on my list to review it) [00:02:57] oh, thanks :] [00:03:11] * springle reads gerrit 189532 [00:03:35] so, after importing the data with these adapted scripts, the checks work [00:03:39] no differences found [00:03:42] EXCEPT [00:04:19] the edit table changes with time, it seems edits get added in the past [00:04:39] so if I execute the verification script right after importing the data, it matches [00:05:02] but the more time I let pass between importing and verification, the more the two sets differ [00:05:18] and always the original table contains more edits [00:05:47] so, I deduce that edits get regularly added with timestamps in the past [00:06:10] I suppose this will be a difficulty in updating the warehouse [00:06:20] mforns: does this occur far in the past, or current month, or..? [00:06:28] current month [00:06:39] and it happens quite commonly [00:06:42] specially in dewiki [00:06:53] (I only testes dewiki and enwiki) [00:06:55] heh [00:06:57] *tested [00:07:04] well i'll be damned :) [00:07:05] idk what causes that. MW somehow [00:07:53] mforns: days in the past? hours? [00:08:01] let me execute the verification right now while we talk, and see what happens (I imported the data yesterday) [00:08:19] days, if I recall... [00:08:33] I'll give you details in a sec [00:09:17] something to do with bots or jobrunners possibly [00:10:16] but from what I remember, I executed the test for december 2014 and after 2 days from the import, several days in the same month had different values [00:10:23] also, both revision and archive changing? or just archive? [00:10:32] they're allowed to edit history? I didn't think we had a "rebase" in mediawiki :) [00:10:51] look, I executed the check right now: [00:10:52] +--------+---------------------+--------------------+----------------------+ [00:10:53] | wiki | date_with_bad_match | value_in_warehouse | value_in_original_db | [00:10:53] +--------+---------------------+--------------------+----------------------+ [00:10:53] | dewiki | 20141209 | 30970 | 30972 | [00:10:53] | dewiki | 20141210 | 32770 | 32777 | [00:10:54] | dewiki | 20141211 | 31602 | 31604 | [00:10:56] | dewiki | 20141212 | 30266 | 30267 | [00:10:58] | dewiki | 20141219 | 26229 | 26230 | [00:11:00] | dewiki | 20141225 | 20263 | 20264 | [00:11:02] | dewiki | 20141227 | 26755 | 26756 | [00:11:04] | dewiki | 20141231 | 27590 | 27608 | [00:11:06] +--------+---------------------+--------------------+----------------------+ [00:11:13] 8 days differ already in dewiki [00:11:27] right after importing, the values matched [00:11:41] * milimetric looks at the code with a double magnifying glass [00:11:48] milimetric: :) but maybe allowed to specify timestamps than get batch applied by an async job after N hours maybe [00:12:18] right, but months in the past seems weird [00:12:36] months? i thought we're talking days [00:12:56] that's december 2014 [00:13:38] mforns: any idea if it is revision or archive, or both, showing the extra records? [00:13:43] man .. live and learn...when i thought i knew something about the datamodel [00:13:49] this is this script, right: https://gerrit.wikimedia.org/r/#/c/189532/1/scripts/test-data-loading.sql [00:14:54] springle, the verification query gets revision and archive together, so I can not tell yet [00:15:07] but I can query them separately with some modifications [00:15:13] mforns: that's the script we're looking at here, right? ^ [00:15:27] milimetric, yes exactly [00:16:27] if we could make joins to "page" fast, we could exclude the "archived" pages and remove the "union all" to the source archive [00:17:19] springle: on the subject of making that join fast, have you played with the idea you mentioned in SF? Adding an index to each column, right? [00:17:32] milimetric: not yet [00:18:02] anecdotally it worked to get the pentaho cube visualizer thing faster queries (and it does similar dimensional stuff) [00:19:33] so, if we can identify expactly where the extra records are appearing, we can dump the binary logs and find the relevant statements, and trace it back to MW code [00:19:58] aha [00:20:15] I'm trying to execute the query only for archive [00:22:27] well, it may take some time... [00:24:26] springle: let me make sure i understand [00:24:38] we can talk about EL in the meantime [00:24:57] springle: so we do not think that the chnages appearing is "incorrect" [00:25:22] springle: rather those changes are not propagated also to the warehouse database [00:25:52] no, none of us know of any wizardry that would make edits months in the past [00:26:07] nuria: i sure don't know yet :) timestamps appearing the past is crazy, but... mediawiki [00:26:07] but anything is "possible" in the land of M and W [00:27:42] hm, if this was happening it should be in recent changes... [00:32:47] the query is running, just for archive [00:36:19] it is taking longer, because I had to add a join with the warehouse's page table, which is 'real size' [00:37:14] yeah, that's the join that we need to be fast [00:37:24] I see [00:37:26] but springle and nuria, let's talk about the EL problems as this works in the background [00:38:22] mforns: might be faster to add index to staging.mforns_warehouse_page.page_id first [00:38:37] oh! sure [00:39:23] milimetric: np [00:39:53] ok, I can maybe explain [00:40:06] so what we're seeing is that 2006 "mysql has gone away" error that you know of [00:40:24] we looked through all the code deploys and are very skeptical that any code change caused this to start happening [00:40:49] our previous experience from wikimetrics says that this error can happen when using SQL Alchemy and allowing one of the connections in the pool to go stale [00:41:17] springle, the index is already there [00:41:43] there's a (weak) hypothesis that when EL throughput ramps down for the night some of those connections get stale, and when they're used again as throughput ramps up, we see this exception [00:41:45] mforns: this is staging.mforns_warehouse_page ? [00:41:53] yes [00:42:03] but this is probably not true because our worker that inserts stuff is single threaded, so it would only ever really use one connection [00:42:18] mforns: i see only two indexes on staging.mforns_warehouse_page: [00:42:19] KEY `i1` (`wiki`,`page_id`), [00:42:19] KEY `i2` (`wiki`,`namespace`,`page_id`) [00:42:49] unless... sqlalchemy pools randomly grab connections out of the pool instead of always going in the same order... I'd have to read the implementation [00:42:55] mforns: wiki is first in both, but you have a range clause involving both de and en, which won't allow the page_id component of i1 to be used [00:43:00] anyway, there's a second hypothesis, way weirder: [00:43:43] mforns: try an index on (page_id,wiki) or just (page_id). there might be something else odd happening, but theoretically.. [00:44:01] insert 1 takes 2.5 seconds, meanwhile insert 2 is queueing up [00:44:12] insert 2 is now bigger, so it takes 3 seconds, meanwhile insert 3 is queueing up [00:44:16] springle, I see, so, one of the existing indexes is page_id, I'm not understanding [00:44:18] insert 3 takes 3.5 seconds, and so on [00:44:34] and eventually we just get a giant insert that causes that error [00:44:43] but how it causes that error, we don't know [00:44:55] (because we've only ever seen it when connections go stale) [00:45:24] springle, ok, I got it now [00:46:02] milimetric: the pattern on the master is: slow INSERT, then sleeping EL client connection, then socket abort, then *sometimes* an immediate second connect/abort around 20s later [00:46:42] how slow is slow springle ? [00:46:57] roughly 100-200 seconds [00:47:07] can a connection get stale in that time? [00:47:09] I thought it was hours [00:47:32] it _is_ hours. the master says "unknown error" [00:47:44] i guess we need to start packet sniffing [00:48:15] we reverted the code, btw, to the version that was running before we saw this problem [00:48:29] but there are a few problems with that - it's hard to tell exactly when this was since we don't have old logs [00:48:44] we only guessed based on the throughput of events in schemas we think should be relatively constant [00:48:58] milimetric: however if any slowdown could theoretically cause this for EL, we need to set a maximum batch size to avoid the potential for very slow inserts [00:49:12] so now if someone with sudo on vanadium can check the logs since this morning, we would see if the revert fixed the problem [00:50:02] yeah, springle but then we have to rethink the whole thing since this would mean that eventually the "buffer" would fill up and we would drop events [00:50:32] yep [00:51:07] (or make it multi-threaded) [00:51:11] mforns wisely suggested breaking up into multiple workers, one per schema, and those could be each their own thread. That would give us some parallelism [00:51:13] exactly [00:52:13] milimetric: actually it is easy to tell when it started happening by counting db events [00:52:31] milimetric: it is clear when the dropout happens [00:52:37] milimetric: so we know when it started [00:53:33] on Feb 4th the INSERT rate dropped like a stone on m2-master, and times went up. Yesterday sometime, it recovered, dipped again, then recovered [00:54:00] INSERT rate is around pre-Feb4th levels now [00:54:25] springle, milimetric which kind of completely refutes the buffer filling up theory [00:54:36] as teh inflow of events is identical [00:54:36] https://tendril.wikimedia.org/host/view/db1020.eqiad.wmnet/3306 [00:54:48] springle & nuria: sorry I'm a hardcore skeptic, I agree that it's very likely true that the 4th is when the problem started. But it's not something certain because we don't have logs to see whether the error showed up before [00:55:00] milimetric: i did looked at those [00:55:01] "Query Write Traffic" graphs [00:55:24] milimetric: fair enough [00:56:17] oh, so we know the 2006 error code was not showing up before the 4th? [00:56:22] milimetric: sorry , send it too soon [00:56:38] milimetric: it did not appear to such a great extent before no [00:56:53] milimetric: now we had 20/30 instances of that in the log [00:57:05] milimetric: before i found 6 [00:57:19] milimetric: and before before none [00:58:11] index created, running query [00:59:03] nuria: oh, cool, that's good to know [00:59:15] milimetric: that is not to say that huge inserts will not cause problem, they will [00:59:26] mforns: EXPLAIN looks a bit better. i3 index is in use [00:59:41] nice [00:59:44] the "buffer filling up" idea is not totally dead though, it is possible that the burstiness of the data changed such that it started causing this bug to surface [00:59:50] but then that would be "really" weird [01:00:15] milimetric: yes, it is is also true that buffering without a cap (upper bound) is not so hot [01:00:31] why would that be wierd, if it's a connection timing/timeout issue? there are many layers there [01:01:03] springle: because with the same inflow events [01:01:16] springle: we shoudl it it happening now regardless of the code [01:01:22] it'd be weird that all of a sudden we start getting smooth continuous data if before we were getting bursty data [01:01:42] ah that too.. yes [01:01:56] and it's not supported by looking at the monitoring, though it's possible the resolution is not high enough there (too much room for error in the timing of the metrics being updated) [01:02:24] nuria: but EL does not own the db master or the network. it shares former with OTRS and others, and letter with production traffic [01:02:35] there are other possible factors here [01:03:11] springle: yes there are but the environmental factors have not changed with the code revert [01:03:40] i don't think we know that :) [01:04:11] we can possibly confidently say so, but we don't have enough data imo [01:04:50] springle: aham, ok, let's say "we did not change anything else but the code we are running" on our side [01:05:00] from this morning to now... [01:05:41] nuria: so are you seeing the error since this morning btw? [01:06:24] so while i agree with milimetric that connection management could be done better, i have a hard time believing that would start on the 4th at the same time we deploy code (could be but not likely) [01:06:35] milimetric: lemme look, i looked last at 2pm [01:07:08] the master last saw an abort from eventlog at 2015-02-10 20:28:11 UTC [01:07:22] hm... interesting :) [01:07:26] that's about when we reverted [01:07:58] as nuria said this morning, if that revert fixes it, we've all gotta get fitted for glasses because I can't see why that would be :) [01:08:03] circumstantially convincing, i'll grant you :) [01:08:07] jaja [01:08:14] ay ay [01:08:55] but we do have some theories that say it would only start happening again tomorrow morning, after it starts ramping up again [01:09:03] milimetric, nuria: how about you give it a week and see [01:09:36] sure, if it works i'm not touching it [01:09:45] s/week/$interval/ [01:09:47] so in the last period of time since the log rotated there is only 1 "gone away" [01:10:02] hm, ok, so definitely "better" [01:10:11] so we'll check again tomorrow [01:10:28] and if it holds, we'll take $interval to look at the code and monitor it [01:10:39] because yea, primary thing is to make sure we don't lose more data we'll have to then backfill [01:11:29] actually, one more data point... i recal ori changing something about the consumer reconnection behavior during the cafeteria meeting we had in SF. i think it was to make it retry on failure [01:11:29] springle: I think we shall see the difference tomorrow [01:11:51] !!! yes, when nuria and I had palpitations, I remember [01:11:52] could that have been the last config change before the Feb4th deployment [01:12:08] mmm, i think he changed it live [01:12:16] springle: ya, when i say, don't do that RIGHT NOW but hey .... [01:12:36] the db master seeing the double-abort behavior is kind of odd, like a race condition [01:13:02] * springle musing [01:13:20] idk much about sqlalchemy under the hood [01:13:55] all dbas love ORMs eh? [01:14:48] guys, the query finished, but with wrong results, I'll have to figure out what happened [01:15:03] ORMs are fine. connection pools and the underlying stacks can be fun though [01:15:09] but I should go to sleep now, it's 2:15am for me now [01:15:24] tomorrow I can have a look [01:15:32] springle, milimetric will monitor logs tomorrow again and will report [01:15:39] ok [01:15:49] mforns: yess please, bedtime [01:16:07] see you tomorrow! [01:16:09] good night [01:16:14] night [01:16:28] springle: I missed how can see this problem in the graphs you sent though [01:16:30] ok all, i should wrap up for the night, I'm not 0.5 years younger like Marcel anymore [01:16:31] :P [01:16:45] hahah ciao [01:16:54] springle: what is the good graph to look at? [01:16:57] mforns: thanks - I tried that, and it was killed: http://quarry.wmflabs.org/query/1949 [01:17:21] nuria: for what? [01:17:31] huh: [oops - missed mforns.....] [01:17:46] springle: to see the dropout on transactions from EL fom the 4th [01:18:20] springle: ah , this one" "Query Write Traffic" graphs! [01:19:11] nuria: it's like one or two per hour. nothing much shows on graphs. i've been watching the master error log with extra warnings turned on, but inly since you reported the issue yesterday [01:19:59] springle: ok, thank you. will keep a eye on it myself from the vanadium side and will try to reach ori to see if he has any ideas [01:21:18] springle: one more thing, i was trying to see on the information-schema if any tables were created from the 4th [01:21:34] springle: but queries were too slow, is that something you can find out? [01:22:53] nealmcb: mforms is gone to sleep but note that quarry might have stricter limits when it comes to query times, maybe higher than what a user can run on its own [01:23:26] nuria: yes - 10 minute limit. I'll try smaller queries, and be in touch... thanks! [01:23:39] nealmcb: k [01:24:23] nuria: http://aerosuidae.net/paste/e/54daae53 [01:24:42] springle: excellent! thank you [01:25:12] also, information_schema queries running slow is usually people doing SELECT *. most of those views must be generated on the fly, so only select fields you want [01:30:35] nuria: did you mean the channel? [01:30:41] i thought you were inviting me to chime in on the mailing list [01:31:09] ori: no, teh channel, but you can read the backlog a bit to see springle comments [01:31:18] yeah, doing so now. sorry for interjecting. [01:31:18] ori: sorry [01:32:50] [00:49:12] so now if someone with sudo on vanadium can check the logs since this morning, we would see if the revert fixed the problem [01:32:58] what would you like me to check for? "server has gone away"? [01:33:02] [00:51:07] (or make it multi-threaded) [01:33:11] it is already; it is easy to put a cap on batched inserts [01:36:48] nuria: there's a "server has gone away" error in the current log [01:37:16] not timestamped, sadly. [01:37:17] https://dpaste.de/mE2H/raw [01:38:04] nuria: I'm having a bit of a hard time following the investigation. Could you possibly recap what we know as of now? [01:38:42] ori: yes, saw that "gone away", in the prior logs -when we were having problems we saw that 30 times on a given day [01:39:03] ori: remains to be seen if we will see more of those tomorrow. Recap: [01:39:23] ori: from the 4th (5th pst) EL starts dropping events big time [01:39:40] (i'll let nuria explain, but no worries about running stuff on vanadium, nuria did it) [01:39:57] well, we know why that is -- it's because events were failing validation [01:39:59] ori: some of it were the capsule issues but those appear as clear errors and were easy to fix [01:40:14] could you be more precise? what is "dropping events"? [01:40:18] ori: but the dropout did not affected validation of events [01:40:36] what do you mean by "dropout"? [01:40:47] ori: meaning that a events validated >>> events inserted on db [01:40:59] ori: in this case [01:41:04] ori: makes sense? [01:41:08] for example: [01:41:15] so there are definitely events that have passed validation which are not seen in the database? [01:41:17] about how many? [01:41:35] https://www.irccloud.com/pastebin/Apdch0eZ [01:41:52] ori: alot [01:42:47] in the case of one of the schemas hundreds of thousands [01:42:49] navigation timing has had a code change in that period, so a drop in navigation timing by itself is not sufficient [01:42:56] which schema? [01:43:02] ori: no, it's all of them [01:43:22] https://www.irccloud.com/pastebin/920Tlx1W [01:44:07] nod, ok [01:44:25] ServerSideAccountCreation is a good example because it is quite constant and, being server-side, not expected to be affected by the validation changes [01:44:37] ori: and , at the same time, we see connection droputs in the db [01:45:00] ori: in the logs for the db consumer [01:45:57] selects are huge (you can unzip one log from the 9th and see it) and that in itself is kind of worrisome as it seems events are queueing up too much on teh buffer [01:46:42] ori: and * i think* that the huge selects are what is killing the connection (those take minutes to process) but this is just a hyphotesis [01:47:28] OK, so I looked at a single 'server has gone away' traceback [01:47:35] and the number of rows it was trying to insert is 71,336 [01:47:46] ori;aham, yes [01:48:20] there are any number of reasons why the batch can be that big, but they all have a single solution, which is to cap the insert size at something much, much lower [01:48:30] ori: but there is more [01:48:42] like i did with https://gerrit.wikimedia.org/r/#/c/189764/ [01:48:50] ori: yes saw that [01:48:55] more what? [01:49:14] but since we reverted tha code to what it was on the 19th of dec [01:49:33] ori: things are better which doesn't make sense [01:50:10] it makes total sense. there is a single worker thread doing the inserts. if it starts to lag a little, things will start queueing up. the queue will get bigger and bigger the longer it runs. [01:50:30] if you reverted the code, you also had to restart the daemon, which means the problem will take some time to snowball to the same severity [01:50:46] ori: in prior cases it took no time [01:50:47] this theory is supported by the fact that the one 'server has gone away' error in the current logs is recent. [01:51:36] ori: and the fact that i restarted the service before and saw the issue show up again within a day [01:52:40] i think it is worth investigating the cause, but again, we know that there are multiple possible causes for this, and we need to avert this problem from happening categorically [01:53:14] so i recommend going with , because it will happen again (and has happened again, at least once) with the old code. [01:53:42] +1 [01:53:56] look at springle all quite there ... [01:54:10] springle: do you have a guesstimate as to what an optimal insert size is? (in terms of either rows or kilobytes)? [01:54:47] without understanding the cause i doubt we can say that the gerrit change will fix it, it will sure mitigate one of the scenarios [01:55:59] ori: it's really a client decision. serverside timeout is many hours and 50000+ is no big deal; just slow. in each case of aborted connections the master saw a client disconnect after 100-200 seconds [01:57:00] springle, ori: sqlalchemy will kill you before hours that is for sure , but it will not kill you in couple minutes [01:57:15] i think having a maximum batch size in the hundreds is optimal, because we want to know if things are going wrong quickly, so we want to keep the time we give the server to respond low [01:57:24] unless we can have config specifically saying so [01:57:51] yeah, idk why it happened at minutes. we'd need to dump tcp [01:59:11] springle, ori: we have queries running in wikimetrics that take several minutes w/o issues with a very simplistic sqlalchemy config (now those are not inserts) [01:59:21] http://docs.sqlalchemy.org/en/rel_0_8/faq.html#mysql-server-has-gone-away suggests client code thread-safety issues could be to blame [02:01:00] http://docs.sqlalchemy.org/en/rel_0_8/orm/session.html#is-the-session-thread-safe [02:01:30] see the paragraph that starts with: [02:01:31] "Making sure the Session is only used in a single concurrent thread at a time is called a “share nothing” approach to concurrency. But actually, not sharing the Session implies a more significant pattern; it means not just the Session object itself, but also all objects that are associated with that Session, must be kept within the scope of a single concurrent thread." [02:01:57] ori: right, that is what we do in wikimetrics, we refactor that big time [02:02:11] ori: but in this case the periodic thread is only 1, right? [02:02:51] well, there are two things that could be wrong [02:03:00] so the main process starts the periodic thread that -synchronously- spawns a thread and waits [02:03:08] one is that 'engine' and 'meta' are actually created in the main thread and then handed over to the worker [02:04:17] is the worker linked to the main thread, or is there a possibility that the worker thread is still alive after the main thread crashed? [02:05:01] the other thing we do that is probably unsafe is: if the worker crashes, and there are events remaining in the queue, we have the main thread try to write them before it exits [02:05:17] from my prior tests i'd say that no, teh worker thread dies [02:05:49] these sort of thread-safety issues can be extremely difficult to reproduce [02:05:55] the solution is to just not have them [02:06:00] ori: yes, this second one is not optimal but note that the other thread has crashed so you are not sharing a session [02:06:04] i didn't know about sessions and related objects not being thread-safe [02:06:35] yeah, but rather than thinking that such usage across threads is possibly or probably ok, we should just not do it. [02:06:56] ori: in our case * I think* this is not the issue because a thread dying will take the parent process with [02:07:28] "There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies." -- C.A.R. Hoare [02:07:33] we want the former, not the latter :) [02:07:35] ori: and thus the event queue that it holds [02:08:05] ori:but our problem is the event queue getting too large to be handled by the thread i think [02:08:32] ori: see, something like thread starts with 2 secs worth of events, it takes 2.5 secs to insert those, [02:09:02] starts again but now it has 2.5 secs worth of events and takes 3 secs to process those, .....backs up [02:09:13] well, obviously if the rate at which the database can reliably process inserts is slower than the rate at which events are coming in, we are screwed [02:09:17] but i don't think that this is what is happening. [02:09:42] the database should be able to handle more events/sec than EL receives, which means that temporary hiccups followed by a spike should be survivable [02:09:43] ori: so eventually is dealing with 5 minute worth of events and that is 300*60* 5 = 60.000 [02:10:21] ori: then , how do you explain a select with 70.000 records when the throughput is 100 per sec? [02:10:23] i think we have some work to do before we ask springle to scrutinize the database's behavior [02:10:45] nuria: 1) main thread starts [02:10:48] 2) worker thread starts [02:10:51] 3) worker thread crashes [02:10:57] or freezes [02:11:02] 4) that crashes main thread [02:11:02] 4) main thread keeps queuing events [02:11:08] it does not [02:11:11] 5) main thread finally actually crashes [02:11:22] 6) main thread attempts to insert all the events it has accumulated in the interim [02:12:00] i don't think it's essential to reconstruct everything that happened if we found two things that we know are broken and should be fixed [02:12:22] we can fix those and see if the problem goes away [02:12:38] even if it doesn't make the problem go away, it's not wasted effort, because these are things we have to fix regardless [02:13:11] (the two things being session objects shared between threads and no cap on the insert batch size) [02:14:15] Agreed we have to fix the cap size [02:14:45] but the sharing across threads is really not happening here if you do not let the main thread clean up [02:15:41] and even in that case the "large" buffering is not explained but again, I AGREE, no upper bound for insert is no good [02:17:24] so, 1) let's look at logs tomorrow 2) let's merge the capsize [02:18:06] +1 [02:20:13] ori: ok, thanks for taking a look! [02:20:23] thank you (and thanks, springle, milimetric) [02:21:43] thanks ori [02:25:49] note to self: look at this discussion tomorrow [02:26:07] Analytics-Kanban, Analytics-Cluster: Geocoding UDF should be more resilient - https://phabricator.wikimedia.org/T89204#1029775 (Nuria) NEW [04:09:45] Analytics-Wikistats: Provide total active editors for December 2014 - https://phabricator.wikimedia.org/T88403#1029883 (Eloquence) @ezachte @DarTar @ArielGlenn any update on whether we can realistically get this done by end of this week, one way or another? [07:25:54] springle: so it turns out thread-safety isn't an issue; it only becomes an issue when you start getting into ORM magic. but we don't use sqlalchemy's ORM at all in EventLogging; we use it to get a nice, consistent API over mysql and sqlite (used for unit tests). [07:26:18] the engine object doesn't really have any mutable state [07:26:32] (just sharing the results of my research) [07:27:08] ori: good to know [07:31:02] springle: any thoughts on which (if any) of the strategies described in are appropriate? [07:35:05] ori: since we'r dealing with so few connections, usually just one, pessimistic handling seems like a good plan [07:36:54] * ori nods [07:37:16] there is actually a ping method in the mysql client api, but idk if sqlalchemy can use it. even so it would be little different to SELECT 1 in the grand scheme [07:42:38] yeah, it looks like there's a way to call it [07:43:07] the difference between that and 'select 1' is that with mysql_ping there is no statement to parse [07:43:26] yep [07:51:40] thanks [11:28:29] Hi folks [13:18:38] Multimedia, Analytics: Set up varnish 204 beacon endpoint for virtual image views - https://phabricator.wikimedia.org/T89088#1030789 (Gilles) a:Gilles [13:46:00] Analytics-Tech-community-metrics, Possible-Tech-Projects: Allow contributors to update their own details in tech metrics directly - https://phabricator.wikimedia.org/T60585#1030907 (Qgil) Wikimedia will [[ https://phabricator.wikimedia.org/T921 | apply to Google Summer of Code and Outreachy ]] on Tuesday, Feb... [13:46:05] Analytics-General-or-Unknown, Possible-Tech-Projects: Pageviews for Wikiprojects and Task Forces in Languages other than English - https://phabricator.wikimedia.org/T56184#1030909 (Qgil) Wikimedia will [[ https://phabricator.wikimedia.org/T921 | apply to Google Summer of Code and Outreachy ]] on Tuesday, Febr... [13:46:22] Analytics-Tech-community-metrics, Possible-Tech-Projects: Support for vizGrimoireJS-lib widgets - https://phabricator.wikimedia.org/T89132#1030925 (Qgil) Wikimedia will [[ https://phabricator.wikimedia.org/T921 | apply to Google Summer of Code and Outreachy ]] on Tuesday, February 17. If you want this task to... [13:46:25] Analytics-Tech-community-metrics, Possible-Tech-Projects: Improving MediaWikiAnalysis - https://phabricator.wikimedia.org/T89135#1030923 (Qgil) Wikimedia will [[ https://phabricator.wikimedia.org/T921 | apply to Google Summer of Code and Outreachy ]] on Tuesday, February 17. If you want this task to become a... [13:46:47] Analytics-Wikistats: Provide total active editors for December 2014 - https://phabricator.wikimedia.org/T88403#1030929 (ezachte) As per https://phabricator.wikimedia.org/T88209 we were wildly optimistic in dump processes catching up in limited time after restart. (and no extra servers were allocated). I did... [13:49:11] Multimedia, MediaWiki-extensions-MultimediaViewer, Analytics: Collect more data in MediaViewer network performance logging - https://phabricator.wikimedia.org/T86609#1030939 (Gilles) [13:49:19] Multimedia, MediaWiki-extensions-MultimediaViewer, Analytics: Collect more data in MediaViewer network performance logging - https://phabricator.wikimedia.org/T86609#1030940 (Gilles) a:Gilles>None [14:50:23] Analytics-Cluster: Increase and monitor Hadoop NameNode heapsize - https://phabricator.wikimedia.org/T89245#1031110 (Ottomata) [15:00:17] milimetric: sorry, be right there... [15:00:19] 2 mins [15:14:09] ah joal, hiyaaa [15:14:38] Hey ottomata :) [15:14:48] Nice to see you there [15:26:03] operations, Analytics-Cluster: Increase and monitor Hadoop NameNode heapsize - https://phabricator.wikimedia.org/T89245#1031236 (Ottomata) [15:32:04] operations, Analytics: Upgrade Analytics Cluster to Trusty, and then to CDH 5.3 - https://phabricator.wikimedia.org/T1200#1031238 (Ottomata) [15:41:28] Analytics-Kanban: Reliable scheduler computes Visual Editor metrics - https://phabricator.wikimedia.org/T89251#1031255 (Milimetric) NEW a:Milimetric [15:43:16] joal, are you our dev in analytics? [15:45:36] *new dev [15:47:16] Analytics-Kanban, Analytics-Cluster: Geocoding UDF should be more resilient - https://phabricator.wikimedia.org/T89204#1031271 (kevinator) p:Triage>Normal [15:50:05] Analytics-Kanban: Reliable scheduler collects Visual Editor deployments - https://phabricator.wikimedia.org/T89253#1031289 (Milimetric) NEW a:Milimetric [15:54:42] Analytics-Kanban: Controls help you navigate between the Visual Editor sunburst visualizer and timeseries visualizer - https://phabricator.wikimedia.org/T89254#1031307 (Milimetric) NEW a:Milimetric [15:55:15] halfak : Yes, I am a newbie :) [15:56:03] Welcome! I'm Aaron. I'm on the Research side of Analytics. [15:56:08] Nice to meet you. :) [15:57:02] Analytics-Kanban: New host for Visual Editor visualizations - https://phabricator.wikimedia.org/T89255#1031328 (Milimetric) NEW a:Milimetric [16:02:29] Analytics-Kanban: Script adds indices to the Edit schema on analytics-store - https://phabricator.wikimedia.org/T89256#1031351 (Milimetric) NEW a:Milimetric [16:03:17] Analytics-Cluster: Improve hive partition deletion script to work with refined webrequest tables. - https://phabricator.wikimedia.org/T89257#1031362 (Ottomata) NEW a:Ottomata [16:03:22] Analytics-Kanban: New host for Visual Editor visualizations - https://phabricator.wikimedia.org/T89255#1031370 (Milimetric) [16:08:11] (PS1) QChris: Document changes of recent v0.0.5 release in changelog [analytics/refinery/source] - https://gerrit.wikimedia.org/r/189976 [16:08:13] (PS1) QChris: Reflow pom of refinery-tools with indentation of 4 spaces [analytics/refinery/source] - https://gerrit.wikimedia.org/r/189977 [16:08:15] (PS1) QChris: Add Percent encoder and decoder [analytics/refinery/source] - https://gerrit.wikimedia.org/r/189978 [16:08:17] (PS1) QChris: Update Java to version 1.7 [analytics/refinery/source] - https://gerrit.wikimedia.org/r/189979 [16:08:19] (PS1) QChris: Add Referer classifier [analytics/refinery/source] - https://gerrit.wikimedia.org/r/189980 [16:08:21] (PS1) QChris: Add parser for media file urls [analytics/refinery/source] - https://gerrit.wikimedia.org/r/189981 [16:09:42] (CR) Ottomata: [C: 2 V: 2] "Seems nuts, but since it is overridable as a property, I'm for it!" [analytics/refinery] - https://gerrit.wikimedia.org/r/189218 (owner: QChris) [16:10:37] ottomata: WHAT IS NUTS ?????????????????? [16:10:38] :-P [16:11:45] (CR) Ottomata: [C: 2 V: 2] Add per webrequest_source 5xx tsvs to legacy_tsvs [analytics/refinery] - https://gerrit.wikimedia.org/r/189219 (owner: QChris) [16:12:10] ha, just crazy to get the sources from the coordinator file name, but i like the use of convention here for the default value! [16:12:33] (CR) Ottomata: [C: 2 V: 2] Document changes of recent v0.0.5 release in changelog [analytics/refinery/source] - https://gerrit.wikimedia.org/r/189976 (owner: QChris) [16:12:47] (CR) Ottomata: [C: 2 V: 2] Reflow pom of refinery-tools with indentation of 4 spaces [analytics/refinery/source] - https://gerrit.wikimedia.org/r/189977 (owner: QChris) [16:12:51] And it allows to reuse the same HQL file for different sources by launching them through different coordinators. [16:13:03] The 5xx-split got really simple by that. [16:13:40] (CR) Ottomata: [C: 2 V: 2] Add Percent encoder and decoder [analytics/refinery/source] - https://gerrit.wikimedia.org/r/189978 (owner: QChris) [16:14:03] (CR) Ottomata: [C: 2 V: 2] Update Java to version 1.7 [analytics/refinery/source] - https://gerrit.wikimedia.org/r/189979 (owner: QChris) [16:14:37] qchris: this seems like it might fit better into the Webrequsts class [16:14:37] https://gerrit.wikimedia.org/r/#/c/189980/1/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/RefererClassifier.java [16:14:40] rather than its own class [16:15:38] Mhmm. Not sure. [16:16:50] That would turn the enum from RefererClassifier.Classification into Webrequest.RequestClassitication or something like that... [16:17:02] Meh. k. [16:17:07] I'll rework that change. [16:17:31] Is it ok to keep the UDF separate? [16:17:50] s/separate/as separate class/ [16:18:01] Or should that get merged somewhere else too? [16:19:10] yes UDF should be separate class [16:19:33] Hi halfak, nice to meet you too (sorry for delay ;) [16:20:00] joal, :) [16:21:13] (CR) QChris: [C: -1] "I'll rework that change to add the classifier to Webrequests" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/189980 (owner: QChris) [16:39:29] (CR) Ottomata: (WIP) project class/variant extraction UDF (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/188588 (owner: OliverKeyes) [16:47:50] (PS9) Ottomata: Add new UDF to determine client IP given source IP and XFF header [analytics/refinery/source] - https://gerrit.wikimedia.org/r/187651 (owner: Ananthrk) [16:48:01] (CR) Ottomata: [C: 2 V: 2] "Done!" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/187651 (owner: Ananthrk) [16:48:45] nuria: even just one invocation of the "inner loop" of the election eligibility query takes more than 10 minutes, assuming I got the SQL right: http://quarry.wmflabs.org/query/1950 [16:48:58] mforns: ^ [16:50:26] I got user 24902 out of a query for the really big users (i.e. a small number of users). so probably lots of edits there.... [17:03:47] nealmcb: let me see [17:04:33] nealmcb: right, labs tables are views over prod tables and indexing is just the same that query might take forever for enwiki [17:04:49] *indexing is just NOT the same [17:05:35] nealmcb: and not only that, labs is probably most used right at this time [17:06:50] nealmcb: I think you might be better of asking halfak if he has any table in labs with that data pre-processed, or a dataset that contains that info [17:07:05] o/ [17:07:10] reading scrollback [17:07:59] nealmcb, what exactly are you looking for? [17:08:16] Also, it looks like that query should use 'revision_userindex' rather than 'revision' [17:08:42] ah, right halfak , forgot about that one [17:10:26] Looks like it is still super slow. Not sure what is up with that. Reading this index should be super-fast. [17:12:23] (PS1) Ottomata: Update changelog for 0.0.6 [analytics/refinery/source] - https://gerrit.wikimedia.org/r/190005 [17:12:42] (CR) Ottomata: [C: 2 V: 2] Update changelog for 0.0.6 [analytics/refinery/source] - https://gerrit.wikimedia.org/r/190005 (owner: Ottomata) [17:14:38] nealmcb, ping [17:15:37] nealmcb, did you see the query I sent to you yesterday? [17:15:53] I pasted a link to a pastebin on this channel [17:16:06] ottomata: You about to release 0.0.6? [17:16:15] nealmcb, here: http://pastebin.com/LXJ2Br5u [17:16:16] yes [17:16:33] merged the ip util stuff [17:16:50] Would it be ok to do anoethre release in the next few days? [17:17:11] sure [17:17:14] releasing is a bit easier now [17:17:14] Ah. Ok. [17:17:20] Cool. Thanks. [17:18:15] sorry for the noise then. [17:18:20] yeah, you should be. [17:18:22] :p [17:18:26] gonna get some lunchhhhh [17:24:34] mforns_brb: According to halfak i think that query also needs to use revision_userindex [17:35:52] nuria / ori: regarding the session sharing thing. The solution we implemented in Wikimetrics would pass the Hoare test, I think. It actually simplified the code and our understanding of it. But I also agree with nuria that the main thread should not try to clean up after the worker fails. It should just fail hard at that point. [17:36:28] I also agree that we should fix these small things before we try to think about the larger problem. [17:36:51] milimetric: agreed, I do not the session sharing is our problem. [17:37:11] Keep in mind though that the MySQL error we're seeing is almost certainly due to max_allowed_packet being exceeded by the large insert statements [17:37:17] sorry "i do not think the session sharing is our problem" [17:37:19] the default is 1MB and we're not changing the default [17:37:47] right, so our session sharing probably is neither a root nor a contributing cause [17:39:04] how the code we deployed since the 4th causes this snowball effect is Wacky Debug Land Adventure Time, it might be easier to figure out after we make the other fixes though [17:41:31] Analytics-Kanban, Analytics-Cluster: Implement Last Visited cookie [34 pts] {bear} - https://phabricator.wikimedia.org/T88813#1031743 (bd808) [17:51:17] (PS1) Ottomata: Add 0.0.6 refinery jars and update symlinks [analytics/refinery] - https://gerrit.wikimedia.org/r/190014 [17:51:25] (PS2) Ottomata: Add 0.0.6 refinery jars and update symlinks [analytics/refinery] - https://gerrit.wikimedia.org/r/190014 [17:51:43] (CR) Ottomata: [C: 2 V: 2] Add 0.0.6 refinery jars and update symlinks [analytics/refinery] - https://gerrit.wikimedia.org/r/190014 (owner: Ottomata) [17:55:06] !log deployed refinery 0.0.6 jars to HDFS [17:58:54] Analytics-Kanban: Reliable scheduler collects Visual Editor deployments - https://phabricator.wikimedia.org/T89253#1031809 (Milimetric) [18:32:17] ottomata: would you be so kind as to take a look at this one change for EL alarms: https://gerrit.wikimedia.org/r/#/c/189588/ [18:36:33] nuria: done :) [18:36:41] ottomata: thank yousirrr [18:44:17] nuria, milimetric: not sure if you saw my follow-up to s.pringle, but there is no thread-unsafe session sharing [18:44:22] we don't use the ORM [18:57:24] milimetric: cron dashboards working well. thanks. [18:58:44] ori: no, did not see that, is it in the channel? [18:59:21] yeah [18:59:23] ori: ah , just read it [18:59:35] also sean recommending the pessimistic approach [18:59:50] (typical DBA :P) [18:59:55] ori: but wait , we do create sessions through the sqlalchemy meta objects, do we not? lemme see [19:00:14] we don't pass them around [19:01:19] heading to cafe, back shortly [19:01:43] ori: but we pass the meta, which contains the engine, to the thread ... right? [19:02:01] the engine is thread-safe [19:02:27] https://groups.google.com/d/msg/sqlalchemy/t8i3RSKZGb0/QxWshAS3iKgJ [19:02:34] "The engine is absolutely threadsafe." [19:02:42] from michael bayer, the principal author of sqlalchemy [19:03:52] ori: ah wait, it's the session [19:06:13] ori: is our session implicitily created from the engine when running inserts? [19:06:31] no [19:10:19] (PS2) QChris: Add parser for media file urls [analytics/refinery/source] - https://gerrit.wikimedia.org/r/189981 [19:10:21] (PS2) QChris: Add Referer classifier [analytics/refinery/source] - https://gerrit.wikimedia.org/r/189980 [19:10:56] Analytics-Engineering, Analytics-Kanban: Backfilling EL events from 06/02 to 10/02 - https://phabricator.wikimedia.org/T89269#1032057 (Aklapper) [19:11:11] Analytics-Engineering, Analytics-Kanban: Backfilling EL events from 06/02 to 10/02 - https://phabricator.wikimedia.org/T89269#1031859 (Aklapper) [Please add projects when creating tasks. Thank you!] [19:11:31] ori: so when is the session created then? [19:13:37] wikimedia/mediawiki-extensions-EventLogging#350 (wmf/1.25wmf17 - 3da44f5 : Mukunda Modell): The build passed. [19:13:37] Change view : https://github.com/wikimedia/mediawiki-extensions-EventLogging/commit/3da44f5a15f5 [19:13:37] Build details : http://travis-ci.org/wikimedia/mediawiki-extensions-EventLogging/builds/50388357 [19:19:08] (CR) QChris: "Is this change still needed?" [analytics/refinery/source] (otto-geo) - https://gerrit.wikimedia.org/r/164264 (owner: Ottomata) [19:23:16] nuria: the "meta" object carries the information to open the session around to store_sql_events [19:23:23] and when the call to insert is executed: https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/server/eventlogging/jrm.py#L225 [19:23:31] then the session is opened and closed right after [19:23:56] store_sql_events is called from both the worker and the main thread [19:24:02] milimetric: ok [19:24:04] but the session doesn't exist before the call but inside the call [19:24:17] so ori's right, there's no threading problem related to the session [19:24:27] regardless, that's not the main problem here as we talked above [19:26:02] Analytics: Report Signups in 2014 Oct-Dec - https://phabricator.wikimedia.org/T89276#1032107 (kevinator) NEW [19:28:39] milimetric: ya, agreed, i did not think it was a problem before [19:29:02] Analytics: Report New editors per month in 2014 Oct-Dec - https://phabricator.wikimedia.org/T89277#1032122 (kevinator) NEW [19:35:37] Analytics: Report New editors per month in 2014 Oct-Dec - https://phabricator.wikimedia.org/T89277#1032167 (kevinator) [19:40:08] Analytics: Report New editors per month in 2014 Oct-Dec - https://phabricator.wikimedia.org/T89277#1032179 (kevinator) Erik described the extrapolation method here: T88403#1030929 [19:40:40] Analytics: Report New editors per month in 2014 Oct-Dec - https://phabricator.wikimedia.org/T89277#1032198 (kevinator) [19:43:39] Analytics-Kanban, Analytics: Report Total pageviews (new definition) in 2014 Oct-Dec - https://phabricator.wikimedia.org/T88844#1032209 (kevinator) [19:47:10] Analytics: Number of new registrations for Oct-Dec 2014 - https://phabricator.wikimedia.org/T88846#1032241 (kevinator) [19:47:11] Analytics: Report Signups in 2014 Oct-Dec - https://phabricator.wikimedia.org/T89276#1032242 (kevinator) [19:50:28] Analytics: Report visitors (comScore) in 2014 Oct-Dec - https://phabricator.wikimedia.org/T89281#1032262 (kevinator) NEW [19:53:35] Analytics: Report New articles for 2014 Oct-Dec - https://phabricator.wikimedia.org/T89283#1032289 (kevinator) NEW [19:55:33] Analytics: Report Edits for 2014 Oct-Dec - https://phabricator.wikimedia.org/T89284#1032297 (kevinator) NEW [19:55:57] Analytics: Report Edits for 2014 Oct-Dec - https://phabricator.wikimedia.org/T89284#1032297 (kevinator) [19:59:25] Analytics-Tech-community-metrics: "Contributors new and gone" in korma is stalled - https://phabricator.wikimedia.org/T88278#1032325 (Qgil) Thanks! The new contributrs table is refreshed, but I'm still seeing very old results in "Who seems to be on the way out or gone?" [20:05:41] Analytics: Report metrics for Quarterly Report 2014 Oct-Dec - https://phabricator.wikimedia.org/T89024#1032335 (kevinator) [20:09:20] Analytics: Report New editors per month in 2014 Oct-Dec - https://phabricator.wikimedia.org/T89277#1032345 (kevinator) a:ezachte [20:09:30] Analytics-Wikistats: Provide total active editors for December 2014 - https://phabricator.wikimedia.org/T88403#1032346 (kevinator) a:ezachte [20:09:40] Analytics: Report New articles for 2014 Oct-Dec - https://phabricator.wikimedia.org/T89283#1032348 (kevinator) a:ezachte [20:10:10] Analytics: Report Edits for 2014 Oct-Dec - https://phabricator.wikimedia.org/T89284#1032350 (kevinator) a:ezachte [20:10:40] Analytics-Kanban, Analytics: Report Total pageviews (new definition) in 2014 Oct-Dec - https://phabricator.wikimedia.org/T88844#1032367 (kevinator) a:Tbayer [20:21:39] qchris, yt? [20:21:45] yup [20:23:47] nuria: What's up? [20:24:19] qchris: i was trying the backfilling on beta labs (so ahem, as to keep things nice in vanadium) [20:24:28] k [20:24:42] my command looks something like: [20:25:06] qchris: cat some-file | /usr/bin/python -OO /usr/local/bin/eventlogging-consumer @/home/nuria/backfilling/mysql-m2-master-BACKFILLING > log-backfill.txt 2>&1 [20:26:05] hola nuria. do you know if we have already a ua_parser for https://phabricator.wikimedia.org/T88504 in place? [20:26:06] qchris: I am replying logs that are already stored [20:26:34] https://www.irccloud.com/pastebin/orGwMjJ3 [20:26:56] * qchris looks [20:27:14] https://www.irccloud.com/pastebin/TqG9am2O [20:27:18] Looks good, doesn't it? [20:27:56] qchris: but .. what about pre-existing events... did you do any changes so they are handled nicely? [20:28:24] Not sure I understand. [20:28:37] leila: that report will use the user_agent udf, which is been in place for a while [20:28:47] The script bails out if it finds an event that is already in the database. [20:29:01] So you cannot get duplicate entries. That's good. [20:29:13] I see. thanks, nuria. [20:29:28] Or on second thought ... it looks like it's bailing out. [20:29:37] qchris: ah, sorry, I totally missunderstood [20:29:44] Was that log line the only output, or [20:29:49] was it just an excert? [20:29:58] s/excert/excerpt/ [20:30:11] qchris: a piece of it [20:30:35] That might make things harder. [20:30:50] Were all log lines for a run for different schemas? [20:31:05] qchris: a more complete excerpt: [20:31:12] https://www.irccloud.com/pastebin/Ld0bJTSL [20:33:27] The values looks strange ... mane %s and only a single valid entry. [20:33:32] s/mane/many/ [20:34:15] qchris: sorry, it was shortened, re-pasting [20:35:31] qchris: http://pastebin.com/tU98JGNw [20:35:36] * qchris looks [20:35:54] qchris: expected behaviour is that consumer logs duplicates but it doesn't halt, correct? [20:36:38] I only backfilled with an EventLogging version before batching was added. [20:36:55] There, expected behaviour for me was that the script exits upon the first duplicate. [20:37:08] So I could start many backfilling jobs. [20:37:16] Each with a separate 64K events block. [20:37:44] Blocks from before the outake, one could ignore. [20:38:02] The block that contains the start needs more attention. [20:38:23] Blocks during the outage should get backfilled without causing duplicates. [20:38:31] The block that contains the end of the outage needs more attention. [20:38:39] Blocks from after the outake, one could ignore. [20:38:40] qchris: but that works for a "total blackout" right? , for a "dropping events" situation [20:38:59] you have mixed events: some were inserted but others were not [20:39:17] Mhmm... [20:39:35] qchris: And I think you backfilled some of these right? [20:40:18] Oh right. I did. [20:40:33] qchris: i saw a "replace" flag on jrm.py [20:40:34] I used a patched EventLogging for that. [20:40:51] Hey ... it seems I even upstreamed that :-) [20:40:56] Right. [20:41:03] Let me read that code again. [20:41:19] qchris: ahhh, ok, did you do anything beside changing the "replace" flag on store_sql_events method? [20:41:45] operations, Analytics-Cluster: Install hadoop-lzo on cluster - https://phabricator.wikimedia.org/T89290#1032492 (Ottomata) NEW a:Ottomata [20:43:08] nuria: that EL password you pasted above is ok to be public, right? Because it's just labs? [20:43:09] https://www.irccloud.com/pastebin/Ld0bJTSL [20:43:22] milimetric: yes, it's on a wiki actually [20:43:30] milimetric: these are all tests in beta labs [20:43:59] milimetric: so the db is just the testing one that everytime you reboot the instance it empties out [20:44:40] just making sure it wasn't using the same pw in prod or anything :) [20:45:08] milimetric: good good, qchris pointed it out too [20:45:38] I'm not sure I followed what you guys were saying above, but: is it possible the whole batch is failing because of duplicate events instead of just partial failures before the batching code? [20:45:39] nuria: Ja, I think I used that to overcome the duplicate issues. [20:45:43] because there's no exception handling right? [20:46:18] qchris: ok, sounds good, triple checking here to make sure there was no other way [20:46:41] so batch up 2 seconds of events -> try to insert -> one failure kills the worker -> lots of events do not get inserted [20:47:47] milimetric: but these is backfilling and thus, it runs into event duplication [20:47:58] yup. [20:48:08] milimetric: in the real scenario the worker being killed kills the parent [20:48:22] no, I get that [20:48:28] milimetric: ah sorry [20:48:36] my point is, a "failure" after batching was implemented brings down all the events that were in the batch [20:48:42] so it's not enough to just keep the parent alive [20:49:07] we have to force it to not use insert_multi [20:49:15] nuria: You know how to pass the replace parameter? [20:49:25] milimetric: for the backfilling you mean? [20:49:29] yes [20:49:37] milimetric: ah yes, SORRY, now me comprendo [20:50:01] qchris: yes, I will do some local modifications and retry in beta labs again [20:50:07] mysql://user:pass@db?replace=1 [20:50:12] Should do the trick too. [20:50:22] https://gerrit.wikimedia.org/r/#/c/175339/ [20:50:30] oh! so that makes the insert not fail in the first place [20:50:31] qchris: ahhhh no, AS ALWAYS my method was a lot more low tech [20:50:32] oh that's better [20:50:32] sorry [20:51:29] nuria_lowtech should be my irc nick [20:51:45] nuria: I used a low tech variant too. [20:52:06] And while it was running, I rewrote the code a bit and discussed the code with o-ri. [20:52:35] Just compare Patch Set 1 to Patch Set 4 :-) [20:53:36] * qchris is glad that nuria remembered that I ran into this problem before. Because I had totally forgotten about it. [20:53:43] * qchris hugs nuria [20:53:59] ja ja [20:59:14] DarTar or anyone: question about the pageviews in pentaho (which is awesome ;) - what's the cutoff date there, i.e. until when are pageviews available? the last row is labeled "2015-01-01", and seems to be in progress: [20:59:19] 2014-11-01 16731250000 2014-12-01 16324187000 2015-01-01 6343599000 [21:00:11] ...just want to make sure that these 6.3billion aren't actually the february numbers (i.e. that the row labels are shifted by one month) [21:00:30] total is around 20B -- I'd cut off the first month and the last month [21:00:44] they are partial months [21:00:51] it would be better if they were not in the dataset [21:02:08] yes, i assume we had more than 6.3B ... but that is labeled january ("2015-01-01"), so either the data lags by about a month, or the label is off by a month [21:03:14] HaeB: January is truncated [21:03:14] milimetric: i will document all stuff to do with backfilling once i have filled a couple days so all these nuisances do not get lost, we will just forget them otherwise [21:03:20] only use data from 2014 [21:03:30] and exclude the first month of the series in 2013 [21:04:01] DarTar: ok, so the row labels are correct and the data lags, right? thanks [21:04:06] just wanted to double-check [21:04:32] that's ok, i only need oct-dec from 2013 [21:04:40] nuria: ok, but you mean nuances right? I hope we forget all the nuisances :) [21:05:43] milimetric: yessss ains ... [21:10:27] Analytics: Report metrics for Quarterly Report 2014 Oct-Dec - https://phabricator.wikimedia.org/T89024#1032553 (Tbayer) [21:10:28] Analytics-Kanban, Analytics: Report Total pageviews (new definition) in 2014 Oct-Dec - https://phabricator.wikimedia.org/T88844#1032551 (Tbayer) Open>Resolved [[https://docs.google.com/a/wikimedia.org/spreadsheets/d/1HmwBYpcqTUEsTE7e35Rf1TGZcsUwjG1F1OG4jZyJqno/edit#gid=313881876 | done]] using Pentaho (... [21:32:04] (PS1) QChris: Fix NPE in GeocodedDataUDF for countries with iso code but no name [analytics/refinery/source] - https://gerrit.wikimedia.org/r/190092 (https://phabricator.wikimedia.org/T89204) [21:32:06] (PS1) QChris: Fix NPE in GeocodedDataUDF for location without lon/lat [analytics/refinery/source] - https://gerrit.wikimedia.org/r/190093 (https://phabricator.wikimedia.org/T89204) [21:32:08] (PS1) QChris: Fix potential NPE in Geocode's subdivision extraction [analytics/refinery/source] - https://gerrit.wikimedia.org/r/190094 (https://phabricator.wikimedia.org/T89204) [21:32:10] (PS1) QChris: Fix potential NPE stopping to abuse getters in Geocode data extraction [analytics/refinery/source] - https://gerrit.wikimedia.org/r/190095 (https://phabricator.wikimedia.org/T89204) [21:32:12] (PS1) QChris: Harden GeocodeDataUDF's extraction of values against NPEs [analytics/refinery/source] - https://gerrit.wikimedia.org/r/190096 (https://phabricator.wikimedia.org/T89204) [21:34:49] (PS2) QChris: Harden GeocodeDataUDF's extraction of values against NPEs [analytics/refinery/source] - https://gerrit.wikimedia.org/r/190096 (https://phabricator.wikimedia.org/T89204) [21:36:01] ottomata: Feel like reviewing patches for ewulczyn geocoding issues? ^ [21:36:19] so many patches! :) [21:36:21] (Evil plot: I put them on top of the needed media file consumption changes) [21:36:23] operations, Analytics-Kanban, Analytics-Cluster: Increase and monitor Hadoop NameNode heapsize - https://phabricator.wikimedia.org/T89245#1032675 (kevinator) [21:36:38] But each of them really is atomic. [21:39:33] operations, Analytics-Kanban, Analytics: Upgrade Analytics Cluster to Trusty, and then to CDH 5.3 - https://phabricator.wikimedia.org/T1200#1032686 (ggellerman) [21:40:13] (CR) Ottomata: "This would classify any URL, not just referers, right? Even though it will mostly be used with referers, maybe we should call it WMFUrlCl" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/189980 (owner: QChris) [21:42:36] (CR) QChris: "I think that this is only useful for Referers, as "-" and the" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/189980 (owner: QChris) [21:46:36] (CR) QChris: [WIP] Add Hive UDF for media url identification (7 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/169346 (owner: QChris) [21:46:38] (CR) Ottomata: Add parser for media file urls (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/189981 (owner: QChris) [21:47:28] (Abandoned) QChris: [WIP] Add Hive UDF for media url identification [analytics/refinery/source] - https://gerrit.wikimedia.org/r/169346 (owner: QChris) [21:49:24] Analytics-Engineering, Analytics-Kanban, Analytics-EventLogging: Spike on requirements to prune EL data - https://phabricator.wikimedia.org/T89293#1032714 (kevinator) NEW [21:49:38] (CR) Ottomata: [C: 2 V: 2] Fix NPE in GeocodedDataUDF for countries with iso code but no name [analytics/refinery/source] - https://gerrit.wikimedia.org/r/190092 (https://phabricator.wikimedia.org/T89204) (owner: QChris) [21:50:02] (CR) Ottomata: [C: 2] Fix NPE in GeocodedDataUDF for location without lon/lat [analytics/refinery/source] - https://gerrit.wikimedia.org/r/190093 (https://phabricator.wikimedia.org/T89204) (owner: QChris) [21:50:08] (CR) Ottomata: [V: 2] Fix NPE in GeocodedDataUDF for location without lon/lat [analytics/refinery/source] - https://gerrit.wikimedia.org/r/190093 (https://phabricator.wikimedia.org/T89204) (owner: QChris) [21:50:26] ottomata: Look at my "evil plot" above. [21:50:31] That's why they won't merge. [21:50:42] oh i know [21:50:48] i added some comments to the media file ones [21:50:52] reviewed those first [21:51:01] Just adding documentation to that one :-) [21:51:17] (CR) Ottomata: [C: 2 V: 2] Fix potential NPE in Geocode's subdivision extraction [analytics/refinery/source] - https://gerrit.wikimedia.org/r/190094 (https://phabricator.wikimedia.org/T89204) (owner: QChris) [21:51:41] (CR) Ottomata: [C: 2 V: 2] Fix potential NPE stopping to abuse getters in Geocode data extraction [analytics/refinery/source] - https://gerrit.wikimedia.org/r/190095 (https://phabricator.wikimedia.org/T89204) (owner: QChris) [21:52:18] (CR) Ottomata: [C: 2 V: 2] Harden GeocodeDataUDF's extraction of values against NPEs [analytics/refinery/source] - https://gerrit.wikimedia.org/r/190096 (https://phabricator.wikimedia.org/T89204) (owner: QChris) [21:57:57] (PS2) QChris: Fix NPE in GeocodedDataUDF for location without lon/lat [analytics/refinery/source] - https://gerrit.wikimedia.org/r/190093 (https://phabricator.wikimedia.org/T89204) [21:57:59] (PS2) QChris: Fix NPE in GeocodedDataUDF for countries with iso code but no name [analytics/refinery/source] - https://gerrit.wikimedia.org/r/190092 (https://phabricator.wikimedia.org/T89204) [21:58:01] (PS2) QChris: Fix potential NPE stopping to abuse getters in Geocode data extraction [analytics/refinery/source] - https://gerrit.wikimedia.org/r/190095 (https://phabricator.wikimedia.org/T89204) [21:58:03] (PS2) QChris: Fix potential NPE in Geocode's subdivision extraction [analytics/refinery/source] - https://gerrit.wikimedia.org/r/190094 (https://phabricator.wikimedia.org/T89204) [21:58:05] (PS3) QChris: Harden GeocodeDataUDF's extraction of values against NPEs [analytics/refinery/source] - https://gerrit.wikimedia.org/r/190096 (https://phabricator.wikimedia.org/T89204) [21:58:07] (PS3) QChris: Add parser for media file urls [analytics/refinery/source] - https://gerrit.wikimedia.org/r/189981 [21:59:05] (CR) QChris: Add parser for media file urls (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/189981 (owner: QChris) [22:03:16] (CR) Ottomata: [C: 2 V: 2] Add parser for media file urls [analytics/refinery/source] - https://gerrit.wikimedia.org/r/189981 (owner: QChris) [22:04:12] (CR) Ottomata: [C: 2 V: 2] "Cool w me." [analytics/refinery/source] - https://gerrit.wikimedia.org/r/189980 (owner: QChris) [22:04:20] cascade merge now! ? [22:04:28] Sure! [22:04:36] But I think V+2 got lost :-( [22:04:46] oh rebase [22:04:55] They are rebased. [22:04:56] (CR) Ottomata: [V: 2] Fix NPE in GeocodedDataUDF for countries with iso code but no name [analytics/refinery/source] - https://gerrit.wikimedia.org/r/190092 (https://phabricator.wikimedia.org/T89204) (owner: QChris) [22:05:07] (CR) Ottomata: [V: 2] Fix NPE in GeocodedDataUDF for location without lon/lat [analytics/refinery/source] - https://gerrit.wikimedia.org/r/190093 (https://phabricator.wikimedia.org/T89204) (owner: QChris) [22:05:18] (CR) Ottomata: [V: 2] Fix potential NPE in Geocode's subdivision extraction [analytics/refinery/source] - https://gerrit.wikimedia.org/r/190094 (https://phabricator.wikimedia.org/T89204) (owner: QChris) [22:05:34] (CR) Ottomata: [V: 2] Fix potential NPE stopping to abuse getters in Geocode data extraction [analytics/refinery/source] - https://gerrit.wikimedia.org/r/190095 (https://phabricator.wikimedia.org/T89204) (owner: QChris) [22:05:45] (CR) Ottomata: [V: 2] Harden GeocodeDataUDF's extraction of values against NPEs [analytics/refinery/source] - https://gerrit.wikimedia.org/r/190096 (https://phabricator.wikimedia.org/T89204) (owner: QChris) [22:05:51] phew, done. [22:05:52] thanks! [22:06:00] Woohooo! Thanks! [22:06:25] Now I only need to find someone with archiva credentials to do the release .. [22:06:29] * qchris looks around. [22:07:37] I guess I killed enough of your time, and ask tomorrow. [22:07:43] Thanks again! [22:10:27] haha [22:10:38] you want a new version? [22:11:24] It's ok. [22:11:30] I'll ask tomorrow. [22:11:46] hah, ok [22:16:06] operations, Analytics: Increase HADOOP_HEAPSIZE (-Xmx) for hive-server2 - https://phabricator.wikimedia.org/T76343#1032804 (Ottomata) Oof, we might need a new machine for hive server. http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_hive_install.html analytics1027 only has... [22:28:12] Analytics-Kanban, Analytics: Report metrics for Quarterly Report 2014 Oct-Dec - https://phabricator.wikimedia.org/T89024#1032854 (kevinator) a:kevinator [22:29:05] Analytics-Kanban, Analytics: Report metrics for Quarterly Report 2014 Oct-Dec - https://phabricator.wikimedia.org/T89024#1025696 (kevinator) p:Triage>High [22:38:15] milimetric, yt? [22:38:21] hi mforns [22:38:32] hi, one question about slqalchemy [22:38:54] EventLogging uses SQLite for python unit testing [22:39:34] but SQLite has no compatibility with new MariaDB field changes that I'm doing... [22:39:42] right [22:39:57] it's ok to not test in that scenario [22:40:09] the framework doesn't support it so we'll have to test manually on beta labs [22:40:35] ok, but... 12 of the current tests fail actually... [22:40:40] because of that [22:43:42] ha [22:44:02] uh... [22:44:22] yeah, this is why I went away from this when I did wikimetrics, SQLite is nice in theory [22:45:20] I don't have a good answer, sorry, I would love to spend time on it but if you can't think of any hack to get around it I suggest you drop it and move on to the scheduler stuff [22:45:24] which is higher priority [22:48:15] oh, I see [22:48:54] ok, makes sense [22:49:05] thanks milimetric [22:49:54] * milimetric is all about focus: just say "no" [22:49:55] :) [22:54:12] xD [22:55:05] mforns / nuria: remember how I said initially that the filter / map bad performance was not so noticeable in the starburst? [22:55:14] aha [22:55:31] and also remember how I said something mysterious was causing dev tools to choke whenever I opened them to look at the starburst? [22:55:33] :D [22:56:05] with filter, my laptop turns into a jet engine and dev tools dies [22:56:11] :o [22:56:19] without filter, my laptop is quiet as a mouse and dev tools works great [22:56:25] lesson learned man, filter SUCKS [22:56:34] :O [22:57:14] is starburst data processing so intensive? [22:57:21] the raw data I'm loading is, yes [22:57:28] I see [22:57:31] so the amount of stuff it has to iterate through in order to filter in the first place [22:58:23] milimetric, good to know [22:59:11] and this will count also for outside the browser/server-side [22:59:14] I suppose [23:22:25] operations, Analytics-Kanban, Analytics: Upgrade Analytics Cluster to Trusty, and then to CDH 5.3 - https://phabricator.wikimedia.org/T1200#1032967 (Ottomata) Today I practiced this in Vagrant and in Labs. I'd like to do it one more time in labs. My preliminary procedure will be this: http://www.cloudera.c... [23:23:43] (PS1) Milimetric: [WIP] Add timeseries graph of key metrics [analytics/limn-edit-data] - https://gerrit.wikimedia.org/r/190113 [23:26:56] milimetric: wait, again, [23:27:14] milimetric: with the two loop approach things work *faster* right? [23:27:43] it turns out the speed is not very different [23:27:59] it is if you're doing like BAJILLIONS of requests [23:28:03] but for us, speed isn't the issue [23:28:19] I think the call stack overhead kills the memory to the point that dev tools stops working though [23:28:33] (call stack overhead of doing a function call for each item in the array) [23:28:59] milimetric: ah, ok [23:29:08] gotta run pick up Steph from the train, but I can tell you more tomorrow [23:29:11] ok [23:57:52] halfak: nuria: Background: https://en.wikipedia.org/wiki/Wikipedia_talk:Wikipedians#How_many_eligible_to_vote_here.3F [23:59:08] nealmcb, Cool. How far did you get? I might like to take a try at a couple of those queries. [23:59:43] nealmcb: I do not think you will be able to gather that data globally from the labs db in less than 10 minutes , you can try halfak suggestion of using revision_userindex but likely halfak might have that data pre-compiled so you can benefit from it