[00:10:33] springle: you said that you did the m2 failover to db1046. But m2-master.eqiad.wmnet is still pointing to db1020 for me. [00:10:48] Also I could not find a relevant commit in operations/dns repository. [00:11:15] qchris: I said I redirected connections to db1046. it's using a tcp forwarder for the moment [00:11:23] Ah. [00:11:42] Ok. So it's expected to connect to db1020 right now. Right? [00:11:56] correct; ip hasn't changed [00:12:03] Ok. That explains things. Thanks. [00:13:01] qchris: did we manage to fork bomb vanadium somehow? [00:13:20] A recent EventLogging commit is backfiring a bit. [00:13:31] Python hitting recursion depth :-( [00:13:47] But I just arrived home again ... still investigating. [00:14:18] ah ok. a screen session I had left open on vanadium last night, failed with 'fork: cannot allocate memory' [00:14:52] Not sure if Python issues explain that. No clue. [00:15:31] http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&h=vanadium.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=mem_report&c=Miscellaneous+eqiad [00:15:40] has the python consumer changed recently except for the db issue last night? [00:15:41] ^ explains it a bit. [00:15:57] Yes. Sadly enough. Db inserts are now batched. [00:16:04] wow. heh [00:16:06] It seems this is not working smoothly. [00:16:32] Why u no like single inserts? ;-) [00:16:41] haha [00:19:28] qchris: given this is all foobar, mind if I go ahead with the master CNAME change? it has to be done now, regardless, to complete the failover, and we may as well use the proxy IP since the firewall is sorted. [00:19:47] No. I do not mind. [00:20:05] Sounds like a good plan to do it now. [00:29:53] ori: It seems the multi-inserts are causing issues, and make python hit recursion limit. All memory is grabbed, OOM activates. [00:30:19] Stack traces show repeat utils.py:28 a lot. [00:30:42] Is it ok to roll-back those two commits for now, or would you suggest a quick fix? [00:30:51] (Reverts are prepared in: [00:30:56] https://gerrit.wikimedia.org/r/#/c/174328/ [00:31:00] https://gerrit.wikimedia.org/r/#/c/174329/ [00:31:04] ) [00:33:52] ori ^ [00:34:30] qchris: ah, good catch [00:34:36] no need to revert, we can just deploy the earlier revision [00:34:47] i can do that [00:34:52] I never did an EventLogging deploy. [00:34:54] Oh. Thanks. [00:35:01] That'd be great. [00:49:07] bye folks! [00:52:14] qchris, but wait... how did teh code get to vanadium if we are testing it in beta labs? [00:52:43] it's deployed to both places at once? cc ori [00:52:51] boy, that makes no sense [00:53:00] I just noticed that same issue on beta labs [00:53:29] nuria__: Not sure what you mean. [00:54:16] qchris: let me see, ori merged this morning, I was going to test in beta labs and i found out things were not working [00:54:17] I did not check beta, but the multi-insert changes have been merged and were deployed in vanadium. [00:54:30] qchris: so... who deployed to vanadium? [00:55:19] qchris, ori: i was waiting to have tested in betalabs before deploying [00:55:20] Not sure. Today during standup, I thought you said ori deployed. I probably misunderstood that then. [00:55:51] But regardless. vanadium is running the old code again and is doing fine. [00:56:27] (as far as I can tell) [00:56:28] qchris: Ok, but it is important to have a workflow in which we do not deploy there before doing the testing [00:57:24] I am glad that ori is around and is still helping us :-) [00:57:54] I like his work a lot. [01:02:15] qchris, nuria__: I deployed, but not deliberately. I'm trying to reconstruct what happened. [01:02:35] ori, qchris : np, as long as it was a mistake is fine [01:03:02] ori, qchris: no big deal, i just thought some ghost thing was deploying as soon as stuff is merged [01:03:14] which would be bad [01:03:15] Yes, it's my error. I probably thought I was operating on Labs when I was on production. [01:03:33] qchris: feeling's mutual, by the way! [01:03:48] ori: no worries, i will work some more on this today, it's great to be able to test in beta labs now [01:17:24] springle: About the m2-master change. I saw that the change is merged. Is it ok to try using it and spare us the meeting in 8 hours, or are other preparations needed to? [01:21:01] springle: ^ [01:24:04] qchris: yes, we can do that [01:24:17] cool. Then I'll give it a shot. [01:24:34] all other m2 clients have switched, so only eventlog to worry about [01:25:29] qchris: looks like that worked from db perspective [01:25:46] Same for the EventLogging perspective. [01:25:58] :-D [01:26:11] ok, now we walk away and don't touch all this for a while :) [01:26:51] That's a great plan! [06:11:37] Good morning, all :) [09:05:17] YuviPanda, Hi [09:07:12] (CR) Yuvipanda: [C: 2] Show author info a query in query details page. [analytics/quarry/web] - https://gerrit.wikimedia.org/r/174152 (https://bugzilla.wikimedia.org/69544) (owner: Rtnpro) [09:07:24] (Merged) jenkins-bot: Show author info a query in query details page. [analytics/quarry/web] - https://gerrit.wikimedia.org/r/174152 (https://bugzilla.wikimedia.org/69544) (owner: Rtnpro) [09:07:25] rtnpro: hi [09:07:30] Merged your patch [09:07:37] Thank yourl for the contribution! [09:07:38] YuviPanda, cool :) [09:07:59] YuviPanda, I am hungry, give me more issues to work on :D [09:08:28] YuviPanda, which you need to fixed on a priority basis [09:08:52] YuviPanda, also, I will add a README for quarry and send it [09:09:10] Yeah [09:09:19] That would be nice ;) [09:09:37] Can you add pagination to the all queries list? [09:09:43] Currently it is not paginated [09:10:11] YuviPanda, okies, I will do it :) [12:34:38] Analytics / Refinery: Raw mobile webrequest partition for 2014-11-18T13/1H not marked successful - https://bugzilla.wikimedia.org/73607 (christian) NEW p:Unprio s:normal a:None The mobile webrequest partition [1] for 2014-11-18T13/1H has not been marked successful. What happened? [1] ___... [12:35:25] Analytics / Refinery: Raw webrequest partitions that were not marked successful due to configuration updates - https://bugzilla.wikimedia.org/72300 (christian) [12:35:25] Analytics / Refinery: Raw mobile webrequest partition for 2014-11-18T13/1H not marked successful - https://bugzilla.wikimedia.org/73607#c1 (christian) NEW>RESO/FIX The job doing the sequence number analysis for the partition failed to start properly. The error message was [...] Caused by: ja... [12:37:21] Analytics / Refinery: Raw mobile webrequest partition for 2014-11-18T13/1H not marked successful - https://bugzilla.wikimedia.org/73607#c2 (christian) (In reply to christian from comment #1) > which matches the time of the upgrade of analytics1015 [1]. And here goes the missing footnote: [1] https://... [13:02:55] Analytics / Refinery: Raw webrequest partitions for 2014-11-18T19/2H not marked successful - https://bugzilla.wikimedia.org/73608 (christian) NEW p:Unprio s:normal a:None 6 webrequest partitions [1] for 2014-11-18T19/2H has not been marked successful. What happened? [1] _________________... [13:04:14] Analytics / Refinery: Raw webrequest partitions for 2014-11-18T19/2H not marked successful - https://bugzilla.wikimedia.org/73608#c1 (christian) NEW>RESO/FIX Merging of the commits cdd19aed010ae5100feb907ddd49b44126465b9f e40cfe942461f78b602c972e75dee9e654d120b2 238f3e1bac5616c17a594efce6f... [13:04:14] Analytics / Refinery: Raw webrequest partitions that were not marked successful due to configuration updates - https://bugzilla.wikimedia.org/72300 (christian) [13:05:38] Analytics / Refinery: Raw webrequest partitions that were not marked successful - https://bugzilla.wikimedia.org/70085 (christian) [13:05:40] Analytics / Refinery: cp1064.eqiad.wmnet lost a kafka message on 2014-11-18T20:05:24 - https://bugzilla.wikimedia.org/73609 (christian) NEW p:Unprio s:normal a:None cp1064.eqiad.wmnet lost message with sequence number 2400268789. Neighboring messages had timestamp 2014-11-18T20:05:24. Not... [13:06:22] Analytics / Refinery: Raw webrequest partitions for 2014-11-18T19/2H not marked successful - https://bugzilla.wikimedia.org/73608#c2 (christian) (In reply to christian from comment #1) > [...], let's split that out to a separate bug. It's tracked in bug 73609. [13:08:33] !log Marked raw bits webrequest partition for 2014-11-18T19/2H ok (See {{bug|73608}}) [13:08:44] !log Marked raw mobile webrequest partition for 2014-11-18T20/1H ok (See {{bug|73608}}) [13:08:53] !log Marked raw text webrequest partition for 2014-11-18T19/2H ok (See {{bug|73608}}) [13:09:26] !log Marked raw upload webrequest partition for 2014-11-18T20/1H ok (See {{bug|73608}}, and {{bug|73609}}) [15:15:20] mforns: do you know mysqldump? [15:15:32] it might be more efficient than tsvs in this case, not sure [15:15:32] aha [15:15:45] if you are just doing a mysql > file > other mysql instance [15:15:54] but I could not use mysqldump because of permission problems [15:15:57] hm. [15:16:01] that's strange [15:16:15] not select into outfile [15:16:24] can we have a look at this after the meeting? [15:16:25] mysqldump should allow you to access the same data as mysql [15:16:26] sure [15:16:50] i mean, what you have will work i'm sure, [15:36:22] Analytics: Upgrade Analytics Cluster to Trusty, and then to CDH 5.2 - https://phabricator.wikimedia.org/T1200#23456 (Gage) I noticed 1033's console problem too, and created an RT ticket: https://rt.wikimedia.org/Ticket/Display.html?id=8858 [16:20:05] Analytics / Quarry: Show the author of a query in its page - https://bugzilla.wikimedia.org/69544 (Yuvi Panda) PATC>RESO/FIX [17:29:27] YuviPanda, Hey [17:30:26] YuviPanda, for implementing pagination in all queries page in quarry, do we want to write something custom, or do we have the liberty to use a flask app [17:30:43] YuviPanda, I found this: https://pythonhosted.org/Flask-paginate/, seems to be pretty good [17:31:14] YuviPanda, I am not in favor of reinventing the wheel ;) [17:32:03] YuviPanda, also, there are some sample snippets available for pagination in Flask here: http://flask.pocoo.org/snippets/44/ [17:32:13] YuviPanda, what do you suggest? [17:33:26] rtnpro: isn [17:33:43] rtnpro: isn't that retrieving all results every time though? [17:33:59] rtnpro: glanced at it briefly but seems that is the case [17:34:59] nuria__, which one are you referring to? the snippet or Flask-paginate [17:35:09] rtnpro: flask-paginate [17:35:37] nuria__, yes, I agree [17:35:59] nuria__, the docs seemed to be good, but I do not like it's code :\ [17:36:05] rtnpro: not such a good option for "unbounded" queries (as in queries w/o determined limits) [17:36:23] nuria__, I agree +1 [17:43:44] rtnpro: what nuria__ said, I think you've to use something custom. pagination is super hard to get right in a generic way, without running the full query all the time [17:43:53] rtnpro: limit and offset also doesn't work, since that still runs a full scan [17:47:52] YuviPanda, yes, I agree [17:50:26] YuviPanda, I had similar issues with search in MongoDB as well, offset (skip in MongoDB) is very costly [17:50:35] ind3eed, and it usually is [17:50:54] the usual trick is to paginate on id. save the last id from current resultset, and generate next set as > id [17:51:10] YuviPanda, yes :D [17:52:08] YuviPanda, but I am at a loss that in the above implementation, implementing total page counts and going to a particular page is a pain [17:52:17] that's fine [17:52:26] YuviPanda, cool, then :) [17:52:26] going to a particular page is painful, and that's ok [17:54:44] YuviPanda, I will work on some code on this tomorrow, early morning, and send you for review [17:54:50] thanks :D [19:51:24] Analytics / Dashiki: Remove the "confusing" underported data for Edits and Pages Created - https://bugzilla.wikimedia.org/73617 (Kevin Leduc) NEW p:Unprio s:normal a:None changes in metric definition to include all namespaces for Edits and Pages Created causes a visible step up when viewed... [19:51:35] Analytics / Dashiki: Remove the "confusing" underported data for Edits and Pages Created - https://bugzilla.wikimedia.org/73617 (Kevin Leduc) p:Unprio>High [19:53:06] Analytics / Dashiki: Remove the "confusing" unde reported data for Edits and Pages Created in Vital Signs - https://bugzilla.wikimedia.org/73617 (Kevin Leduc) [19:53:21] Analytics / Dashiki: Remove the "confusing" under reported data for Edits and Pages Created in Vital Signs - https://bugzilla.wikimedia.org/73617 (Kevin Leduc) [20:19:10] milimetric, nuria__: have you seen Jonathan's response? [20:20:06] mforns, milimetric sounds like what we talked out this morning will be enough then [20:20:35] yes, I think the implementation I did the day before yesterday will do [20:20:43] cool mforns, that's the easiest solution [20:21:00] fine, I'll respond to his mal [20:21:01] mail [20:21:43] kevinator, ^ is it ok with you, too? [20:24:28] nuria__: kaldari asked us in scrum of scrums if we're going to backfill the EL data that went missing [20:24:41] (I don't know the details but I do remember seeing an email from christian about it) [20:25:16] they ran an A/B test during the time we lost data there, and they need to know whether or not we'll restore data from the logs into the DB [20:32:42] milimetric: we need to find out what is wrong with EL db before doing any backfilling [20:33:13] milimetric: christian backfilled earlier adifferent period where the db host had issues [20:33:22] nuria__: totally agree, I think they just want to know *whether* we will get around to it at some point [20:33:35] that way they know whether to rerun the experiment or not, for example [20:34:07] milimetric: if it's easy I would re-run it , there are different issues by which the db might not store events [20:34:18] milimetric: if logs are good we can backfill [20:34:31] k, cool, i'll tell them that [20:34:39] milimetric: with teh issue yesterday likely logs are not so good as machine was OOming [20:35:33] milimetric: I can check the logs real fast but given that the machine was suffering it is safer to -re-run [20:38:38] milimetric: logs for 18th -size wise- don't look so bad but still, i think my recommendation for this particular problem will be to re-run the test [20:41:33] i sent the email to that effect nuria__ [20:44:52] milimetric: nuria__ another option, if re-running test is harder, is to just grep through the events stored in the file for the interesting events, and then just do whatever computations they want on that. With modest shell / python skills, this is going to be faster than re-running the test. [20:45:14] YuviPanda: if the logs are clean [20:45:27] invalid events shouldn't be that big a problem, I think? [20:45:27] but that's not easy to assume right now, considering that when a disk fills up all bets are off [20:45:36] YuviPanda: but with the machine OOMing you have no guarantee that data is any good for that timeperiod [20:45:38] oooh, the disks filled up? yeah, then all bets are off. [20:45:41] that's true. [20:45:42] yep :) [20:45:50] I thought it was just a db crash [20:46:04] no, sadly we are experiencing #realworldproblems [20:46:13] heh [20:46:14] YuviPanda: backfilling in normal circumstances is easy, as EL arch allows for that easily [20:46:19] yeah [20:46:27] didn't realize current circumstances, so do ignore me :) [20:46:30] YuviPanda: we had two issues 1) disk filling up (13th) [20:46:52] YuviPanda: 2) unfortunate deployment that caused OOms (18th) [20:47:31] ah, right. [20:50:11] nuria__: Hi, how can we rebuild the ui-daily graph that's related to this patch? https://gerrit.wikimedia.org/r/#/c/174055/ [20:51:06] bmansurov: there is an e-mail thread about that already, db is having issues so graph doesn't look good [20:51:22] bmansurov: our dba springle is on the thread and will need to take a look [20:52:19] nuria__: I'll check the mail thread, but isn't that a separate issue? I'd like to see new columns on the left even though their values are NaN [20:53:13] bmansurov: basically queries are returning bad data now looks like so i do not think you can do anything about the graph [20:53:26] nuria__: ok thanks [21:27:47] ottomata, you doing anything to the cluster? [21:27:51] "java.sql.SQLException: Could not establish connection to jdbc:hive2://analytics1027.eqiad.wmnet:10000/wmf_raw: Internal error processing OpenSession" is a new error [21:34:14] hm, i am, but not anything that should do that [21:34:54] looks ok to me [21:35:12] hrrmr [21:35:15] okay. Thanks! [21:35:27] btw, your lib googly woogly is installed [21:36:15] also, Ironholds, i'm almost done with hadoop worke rupgrade to trusty [21:36:21] wheee [21:36:25] that means that newer R and python should be on those by default now [21:36:27] yep, I saw, replied to email with "yay!" [21:36:30] awright [21:36:32] * Ironholds pops knuckles [21:36:36] you know what THIS means. [21:36:40] i don't actually! [21:36:44] i don't ahve cdh 5.2 up yet [21:36:48] that will come after [21:36:49] :) [21:37:14] it means that I can fork out code over the entire cluster [22:01:32] bmansurov: I want to make sure I know exactly what you mean since both you and Jon talked about this graph [22:01:38] what I see: [22:01:52] graph has two missing days in March (around the 19th) [22:02:11] graph has some relatively very low values around July [22:02:25] graph has a huge spike starting November 14th [22:02:33] which of those issues are you talking about? [22:03:51] milimetric: I think Jon was talking about the spike starting November 14th. I'm interested in the new events that were added recently, but not showing up in the graph. [22:04:06] ah, I see [22:04:09] milimetric: If you look at http://stat1001.wikimedia.org/limn-public-data/mobile/datafiles/ui-daily.csv all new values are 0 [22:06:30] bmansurov: hm, I'm not seeing what you mean [22:06:45] those values look to be 0 when those events were not being generated [22:06:59] but start being non-zero at different points (I assume when they start being generated) [22:07:06] (make sure you don't have a cached version of that file) [22:07:24] milimetric: all good then, just Jon's issue [22:07:26] on November 19th, for example, I see all columns with values [22:07:51] k, cool, let me know if there's anything else [22:08:12] hey milimetric ! [22:08:16] milimetric: oh, one more thing [22:08:21] want to talk about the bugs in this sprint? [22:08:27] sure kevinator [22:08:32] (i'm still listening bmansurov) [22:08:36] batcave or IRC? [22:08:39] milimetric: those new values don't show up on the left of the graph along with other existing values http://mobile-reportcard.wmflabs.org/graphs/ui-daily [22:08:45] whichever you prefer kevinator [22:08:52] let’s do the batcave [22:09:15] milimetric: I mean the labels [22:09:43] gotcha, in the legend bmansurov [22:09:53] yes [22:10:17] one sec, i'll look at the graph definition [22:14:39] Analytics / Dashiki: Story: User selects breakdown in Vital Signs - https://bugzilla.wikimedia.org/72739#c2 (Dan Andreescu) For this story, we're going to make the DailyPageviews metric work with the breakdown, since the mobile data is already available in the files we download. [22:14:41] Analytics / Dashiki: Story: User selects breakdown in Vital Signs - https://bugzilla.wikimedia.org/72739 (Dan Andreescu) a:Dan Andreescu [22:15:29] Analytics / Dashiki: Story: User selects breakdown in Vital Signs - https://bugzilla.wikimedia.org/72739 (Dan Andreescu) [22:15:29] Analytics / Wikimetrics: Story: AnalyticsEng has editor_day table in labsdb - https://bugzilla.wikimedia.org/69145 (Dan Andreescu) [22:15:29] Analytics / Dashiki: Story: Vital Signs User selects the Daily Pageviews metrics - https://bugzilla.wikimedia.org/72740 (Dan Andreescu) [22:15:30] Analytics / EventLogging: database consumer could batch inserts (sometimes) - https://bugzilla.wikimedia.org/67450 (Dan Andreescu) [22:15:35] Analytics / Wikimetrics: Story: WikimetricsUser reports pages edited by cohort - https://bugzilla.wikimedia.org/73072 (Dan Andreescu) [22:15:58] milimetric: thanks [22:24:18] Analytics: Upgrade Analytics Cluster to Trusty, and then to CDH 5.2 - https://phabricator.wikimedia.org/T1200#23611 (Ottomata) Completed today: ``` 1034 1035 1036 1037 1038 1039 1040 1041 ``` I need to figure out what's up with 1033's console. Then I can proceed with analytics1027. Then I need to think ab... [22:27:00] bmansurov: this definition is not updated to use the CSV datafile properly: https://raw.githubusercontent.com/wikimedia/analytics-limn-mobile-data/master/graphs/ui-daily.json [22:27:06] and hang on, probably the datasource is not either [22:27:51] milimetric: this is the related patch https://gerrit.wikimedia.org/r/#/c/174055/ [22:27:54] bmansurov: yeah, the datasource is not: https://github.com/wikimedia/analytics-limn-mobile-data/blob/master/datasources/ui-daily.json [22:28:05] bmansurov: yeah, that only changes the SQL [22:28:19] milimetric: could you give me some info about how to add the remaining parts [22:28:27] milimetric: a link to the documentation maybe? [22:28:27] the datasource and graph, unfortunately (very unfortunately) have to be edited manually [22:28:35] milimetric: i see [22:28:38] :) I'm, sadly, the only documentation [22:29:02] one sec, i gotta jump in a meeting [22:29:10] milimetric: ok thanks [22:29:14] rather - half hour - i'll ping after [22:29:18] sure [22:32:00] bmansurov: i'll write an email to mobile-tech about it? [22:32:07] milimetric: sounds good [22:58:16] ggellerman: I'm milimetric [22:58:24] doh... lol, I mean - Dan == milimetric [23:49:31] see you tomorrow folks! [23:50:19] RAAAAR out of memory error! okay, i re-run hql [23:50:52] ciao